1. Introduction
With the rapid advancement of artificial intelligence, robotics, and computer vision technologies, service robots have gradually transitioned from laboratory environments to practical applications. Currently, they are widely deployed in indoor public spaces such as hospitals, schools, libraries, hotels, and shopping malls to perform tasks including patrols, guidance, cleaning, and security. This has not only significantly enhanced service efficiency but also effectively reduced labor costs [
1,
2,
3,
4]. To reliably execute tasks in these complex, dynamic environments, robots must possess robust environmental perception, mapping, and autonomous navigation capabilities. Among these, mapping serves as a core module, providing fundamental environmental representation for robot localization and navigation.
Currently, indoor mobile robots primarily rely on metric maps represented by occupancy grids. Such maps focus on describing the geometric structure of the environment, containing only spatial information about obstacles and navigable areas, and lack semantic descriptions of entity attributes within the environment [
5]. However, in practical service scenarios, natural language remains the most intuitive means of interaction between humans and robots. Traditional metric maps struggle to support advanced task comprehension and interaction based on semantic concepts, thereby limiting the intelligence level of service robots [
6]. For instance, in public indoor areas like hospitals and shopping malls, robots frequently interact with humans. Relying solely on geometric information clearly fails to meet practical demands. Consequently, developing methods to construct semantic maps that integrate spatial structure and semantic information has become a research hotspot [
7].
Semantic mapping builds upon geometric mapping by attaching semantic labels to entities in the environment, further integrating external knowledge bases and reasoning mechanisms to achieve advanced semantic understanding [
8]. The typical semantic mapping process comprises two key components: spatial mapping and semantic extraction. Regarding spatial mapping, Simultaneous Localization and Mapping (SLAM) technology has been widely applied for pose estimation and map construction in unseen environments [
9]. The core challenge in constructing semantic maps lies in extracting semantic information about environmental entities [
10]. Currently, semantic extraction primarily falls into two categories: scene-oriented and object-oriented approaches. Scene-oriented methods focus on classifying and identifying functional areas such as rooms and corridors. Due to its low sensitivity to lighting conditions and high stability, lidar data is frequently employed for such tasks. Based on sensor type, semantic mapping methods can be categorized into those using 2D LiDAR [
11,
12,
13] and 3D LiDAR [
14,
15,
16]. However, LiDAR provides limited texture information, making it difficult to distinguish objects with similar geometric features but different semantic meanings. In contrast, visual sensors offer rich texture and color details, making them more suitable for object-oriented recognition (e.g., identifying doors, trash cans). Although semantic information can be supplemented through geometric model matching [
17,
18] or human–machine interaction [
19,
20], purely vision-based semantic SLAM still faces challenges such as sparse mapping, high computational overhead, and poor environmental adaptability. While integrating inertial measurement units (IMUs) [
21] and odometers [
22] can enhance stability, practical application remains limited in public indoor environments with drastic lighting variations.
To address the limitations of single-sensor systems, Martins et al. proposed a solution integrating visual and laser SLAM, leveraging deep learning to extract semantic information and fuse it into geometric maps, thereby significantly enhancing map accuracy [
23]. However, such methods typically require substantial computational resources, making deployment on computationally constrained public service robots challenging. Furthermore, during mapping, perspective shifts and pose drift can cause the system to repeatedly detect and label the same object at different times and locations, leading to semantic redundancy in the map [
24]. With the advancement of deep learning, semantic acquisition methods based on object detection [
25] and instance segmentation [
26] have gradually become mainstream. Algorithms represented by the YOLO series are widely applied in semantic SLAM. However, standard YOLO models feature complex architectures and large parameter counts, posing challenges for direct deployment on mobile robot platforms [
27]. To address this issue, lightweight networks such as MobileNet [
28,
29], ShuffleNet [
30,
31], EfficientNet [
32], and FasterNet [
33] have been proposed, enabling real-time perception for embedded systems. For instance, Zhang F et al. achieved model lightweighting by replacing the YOLO backbone with MobileNetV3 [
34], while Zhang D et al. introduced FasterNet to enhance YOLOv8n-seg and reduce resource consumption [
35].
To address the application requirements of service robots in indoor public areas, this paper proposes a low-resource semantic information extraction method based on improved instance segmentation. It also designs a continuous-frame dynamic constraint approach to resolve multi-view redundant labeling issues in semantic mapping. The main contributions are as follows:
Dataset Construction: Two specialized datasets were constructed for common objects in public indoor areas. This involved combining self-collected data with public datasets and applying comprehensive data augmentation techniques.
Synergistic Architectural Optimization for Edge Perception: We designed the lightweight instance segmentation model, MS-YOLO, specifically optimized for edge devices. By reconstructing the backbone with MobileNetV4 [
36] and innovatively integrating the SHViT [
37] module into the C2f neck to form a complementary “local-global” feature extractor, this architecture resolves the conflict between high-resolution perception and limited computational budgets. This synergy achieves a significant 32.2% reduction in FLOPs without compromising segmentation accuracy.
Application-Driven Strategy for Semantic Landmark Registration: To overcome the pervasive multi-view redundant labeling problem in semantic SLAM, we developed a computationally efficient Continuous-Frame Dynamic Constraint method. This method tightly couples TF coordinate transformations with the spatiotemporal consistency verification of deep point clouds, ensuring the uniqueness and stability of semantic landmarks in global maps without relying on computationally heavy tracking algorithms.
End-to-End System-Level Implementation and Validation: We successfully integrated the optimized perception and mapping modules into a complete ROS 2 framework and deployed it on a low-power edge platform. Real-world testing in public indoor scenarios demonstrates that our system achieves reliable, real-time semantic map generation, providing a highly practical and verifiable engineering solution for service robots.
2. Datasets
Public indoor areas (such as hospitals, laboratory buildings, and shopping malls) typically contain a series of representative objects. These objects include both everyday service facilities—such as doors, rubbish bins, and benches—and fire safety equipment installed to ensure security in high-traffic environments. These objects hold significant semantic value for service robots performing navigation, interaction, and assistance tasks. To ensure the constructed dataset comprehensively reflects the semantic characteristics of public indoor scenes, this study selected four object categories—doors, rubbish bins, benches, and fire extinguishers—as the target classes for instance segmentation in the first phase.
It is worth noting that while mainstream indoor datasets such as NYUv2 and ScanNet provide massive amounts of data, their class distributions heavily skew towards domestic and residential objects (e.g., beds, sofas, and desks). They lack critical public safety and functional landmarks essential for service robots in public areas. Due to the current lack of publicly available datasets covering all the aforementioned categories simultaneously, the first-phase experimental dataset (i.e., Indoor1) was constructed using a multi-source fusion approach. Specifically, instance segmentation data for four target categories were obtained from the Roboflow platform (accessible online at:
https://roboflow.com, accessed on 4 December 2025), we obtained instance segmentation data for four target categories: doors, rubbish bins, benches, and fire extinguishers. The dataset comprises 767 images of doors, 202 images of rubbish bins, 459 images of benches, and 489 images of fire extinguishers, totaling 1917 images. All data were divided into training, validation, and test sets in a 7:2:1 ratio to ensure reliable model training and evaluation. Partial examples of the constructed indoor common object dataset are shown in
Figure 1.
To further validate the feasibility of the proposed method in real-world scenarios while evaluating the model’s generalization capability and stability in practical applications, this study constructed a self-collected dataset (Indoor2) in the second phase. Data collection was conducted at the Third Experimental Building of Chongqing University’s Huxi Campus, yielding 1905 raw images covering five target categories: doors, rubbish bins, fire extinguishers, fire hydrants, and evacuation equipment boxes. To enhance the performance, robustness, and generalization capability of the deep learning model, various data augmentation techniques were applied to the raw images. These included brightness adjustments, random cropping, arbitrary angle rotation, and mirror flipping. The effects of data augmentation are illustrated in
Figure 2. Following these processes, the dataset size expanded to 5000 images, significantly increasing data diversity while preserving the original category distribution characteristics.
During the annotation phase of the self-collected dataset, this study employs the Labelme tool for polygonal annotation of images to precisely delineate target contours. This annotation method effectively captures shape features of objects, aiding instance segmentation models in learning more detailed boundary information. To ensure rigorous annotation quality and mitigate subjective bias, a cross-review mechanism was implemented. The dataset was independently annotated by two researchers. Subsequently, a third senior researcher conducted consistency checks and resolved inter-annotator disagreements through Intersection-over-Union (IoU) evaluations, ensuring high-fidelity ground truth generation. An annotation schematic is shown in
Figure 3a. Annotation files were initially saved in JSON format and converted to TXT format prior to model training to meet the input requirements of the YOLO instance segmentation network.
Figure 3b presents an example of a model training result. Ultimately, this high-quality annotated dataset provides a robust training foundation for the model in instance segmentation tasks within public indoor areas. The complete dataset and the source code for the proposed system are publicly available at:
https://huggingface.co/datasets/lwarwick480-create/indoor (accessed on 21 March 2026).
4. Experimental Design and Analysis of Results
4.1. Experimental Environment and Parameters
Hardware Configuration: Processor is Intel Corporation (Santa Clara, CA, USA) Xeon Platinum 8481C, 25 cores; Graphics card is NVIDIA Corporation (Santa Clara, CA, USA) RTX 4090, 24 GB × 1 VRAM. Software Configuration: Programming language is Python 3.8.10; Deep learning framework: PyTorch 2.0.0 (CUDA 11.8).
Parameter Settings: Training hyperparameters were empirically tuned to balance model convergence and computational efficiency. The input image size was set to to ensure high-resolution feature extraction within GPU memory limits. Training was fixed at 100 epochs, which was experimentally observed to be sufficient for full feature learning while preventing overfitting. A batch size of 16 was chosen to optimize training speed and memory utilization on the RTX 4090. The initial learning rate was set to 0.01 to escape local optima early on, utilizing a cosine annealing strategy for smooth decay in later stages. Other parameters followed the YOLOv8 default settings.
The above environment and parameter settings ensure efficient experimental operation with hardware acceleration, providing a stable foundation for model convergence and performance evaluation.
4.2. Evaluation Indicators
To comprehensively evaluate the performance of the MS-YOLO instance segmentation model proposed in this paper, Precision, Recall, Mean Average Precision (mAP), and Giga Floating-point Operations (GFLOPs) were selected as the primary evaluation metrics.
Precision measures the proportion of actual positive samples correctly identified as positive by the model, reflecting its ability to avoid false positives during detection. Higher precision indicates lower false positive rates. Its calculation formula is shown in (7):
Among these, denotes the number of targets correctly detected, while represents the number of targets incorrectly detected.
Recall measures the proportion of all actual positive samples correctly detected by the model, reflecting its ability to minimize missed detections. A higher Recall indicates a higher target detection rate for the model. Its calculation formula is shown in (8):
Among these, denotes the actual number of targets that were not detected.
mAP is the most commonly used comprehensive metric in object detection and instance segmentation tasks, reflecting the model’s overall detection accuracy at different Recall levels. First, the average precision (AP) is calculated for a single category. AP is computed based on the area under the Precision–Recall (P-R) curve, as shown in Formula (9):
Then, the mAP is calculated by averaging the AP values across all categories:
Here, n denotes the number of classes. This paper employs two common forms: mAP@0.5 (mAP calculated at an IoU threshold of 0.5); mAP@[0.5:0.95] (AP calculated at IoU thresholds ranging from 0.5 to 0.95 with a step size of 0.05, then averaged).
Intersection over Union (IoU) is defined as the ratio of the intersection area between the predicted bounding box and the ground truth bounding box to the union area:
GFLOPs represent the number of billion floating-point operations a model can perform per second, serving as a key metric for measuring computational complexity and inference speed. In deep learning, GFLOPs reflect the computational overhead during model inference, calculated as shown in (12):
Here, and denote the height and width of the output feature map, respectively, while and represent the input and output channel counts, respectively. and denote the height and width of the convolutional kernel, respectively.
In the experiments conducted in this paper, the Precision, Recall, and mAP metrics were obtained from the result.csv file generated after training. The GFLOP values were automatically computed by the model during both training and inference phases.
4.3. Experimental Demonstration
To validate the effectiveness and superiority of the proposed MS-YOLO instance segmentation model in public indoor scenarios, this paper conducts systematic comparative experiments between it and the original YOLOv8n-seg model on the fusion dataset Indoor1 and the self-collected dataset Indoor2. The comparison primarily covers three aspects: detection performance, segmentation performance, and model complexity. The experimental results are shown in
Table 2,
Table 3, and
Table 4, respectively.
From the perspective of detection performance (
Table 2), on the Indoor2 dataset, MS-YOLO outperformed YOLOv8n-seg across all four metrics: Precision, Recall, mAP@0.5, and mAP@[0.50:0.90]. Specifically, Precision increased from 0.975 to 0.983, Recall rose to 0.972, and mAP@0.5reached 0.989, indicating the improved model possesses stronger object recognition and localization capabilities in self-built real-world scenario data. On the Indoor1 fusion dataset, MS-YOLO demonstrates even more significant performance gains, particularly in Recall and high-threshold mAP metrics, achieving improvements of 7.0% and 8.9%, respectively. This indicates that the proposed lightweight backbone architecture and enhanced feature extraction module exhibit superior robustness and generalization capabilities under complex data distribution conditions.
The comparison of instance segmentation performance (
Table 3) demonstrates that MS-YOLO achieves higher segmentation accuracy than the original model on both datasets. On the Indoor2 dataset, MS-YOLO achieves an mAP@0.5 of 0.987, with segmentation precision and recall improving to 0.981 and 0.970, respectively, indicating its ability to more accurately delineate object boundary information. On the Indoor1 dataset, MS-YOLO demonstrates even more pronounced advantages in segmentation tasks, with mAP@[0.50:0.90] improving from 0.691 to 0.782. This indicates that the improved model significantly enhances its ability to fit instance contours under multi-scale objects and complex background conditions.
As shown in the model complexity comparison results (
Table 4), MS-YOLO achieves significant lightweighting while maintaining or even improving segmentation accuracy. Compared to YOLOv8n-seg, MS-YOLO reduces the number of parameters from 3.259 million to 2.876 million, decreasing model size by approximately 10.3%. FLOPs are substantially lowered from 12.1 G to 8.2 G, representing a reduction of 32.2%. These results demonstrate that the proposed optimization strategy effectively reduces computational overhead without compromising model performance, making it particularly suitable for deployment on resource-constrained edge platforms such as Jetson Orin NX.
Based on experimental results, MS-YOLO outperforms the baseline YOLOv8n-seg model in detection accuracy, segmentation performance, and computational efficiency, providing a more efficient and reliable perception foundation for subsequent semantic mapping tasks of service robots in public indoor areas.
To further justify our hyperparameter settings, particularly the selection of 100 training epochs, we analyzed the training convergence process of MS-YOLO. As illustrated by the Mask evaluation indicator curves in
Figure 10a, metrics including precision, recall, and mAP increase rapidly during the early stages of training, indicating that the model swiftly learns the key features required to distinguish target objects from complex indoor backgrounds. Simultaneously, the validation loss curves (
Figure 10b) reveal a consistent and steady decline across bounding box (Box_loss), segmentation (Seg_loss), classification (Cls_loss), and distribution focal loss (Dfl_loss). As the iteration approaches 100 epochs, all loss values plateau and stabilize. This steady convergence confirms that the model has fully learned the dataset’s features without exhibiting any obvious signs of overfitting or underfitting, validating the optimality of the 100-epoch hyperparameter configuration.
4.4. Ablation Experiments
To evaluate the independent and synergistic contributions of the MobileNetV4 backbone and the C2f_SHViTBlock module to model performance, ablation experiments were conducted on the Indoor1 dataset. These experiments compared instance segmentation metrics (Precision, Recall, mAP@0.5) and model complexity variations across different structural combinations. The results are shown in
Table 5.
Without incorporating MobileNetV4 and C2f_SHViTBlock, the model corresponds to the original YOLOv8n-seg, achieving Precision, Recall, and mAP@0.5 of 0.930, 0.854, and 0.893, respectively. Its parameter count and FLOPs were 3.259 M and 12.1 G. After replacing only the backbone network with MobileNetV4, the model’s parameters decreased to 3.147 M, and FLOPs were significantly reduced to 8.3 G, resulting in a computational cost reduction of approximately 31.4%. On this basis, Precision and mAP@0.5 improved to 0.942 and 0.916, respectively, indicating that the lightweight backbone maintains relatively stable feature representation capabilities while significantly reducing computational complexity.
When introducing the SHViTBlock only in the C2f module, the model parameters further decreased to 2.954 million. Precision and Recall improved to 0.955 and 0.871, respectively, with mAP@0.5 reaching 0.917. This demonstrates that the SHViTBlock effectively enhances feature modeling capabilities, particularly in expressing target semantic information more fully in complex backgrounds. However, with the backbone structure unchanged, FLOPs only slightly decreased to 11.4 G, indicating relatively limited computational compression.
When both the MobileNetV4 backbone and C2f_SHViTBlock are integrated, the model achieves an optimal balance between performance and complexity. Precision, Recall, and mAP@0.5 improve to 0.969, 0.888, and 0.923, respectively, while the number of parameters decreases to 2.876 M and FLOPs drop to 8.2 G. These results demonstrate that MobileNetV4 primarily handles significant computational compression, while C2f_SHViTBlock effectively compensates and stabilizes model accuracy on this foundation. The two components form a complementary relationship, aligning with the edge deployment requirements of resource-constrained scenarios such as public indoor service robots.
4.5. Model Comparison Experiments
To comprehensively evaluate the overall accuracy and computational complexity of different models, we conducted instance segmentation performance experiments using a diverse set of baseline models. In addition to YOLOv5n-seg and YOLOv11n-seg, we introduced YOLOv3-tiny-seg as an earlier lightweight baseline, YOLACT as a representative real-time one-stage model, and Mask R-CNN as a classic high-precision two-stage baseline. The experimental results (as shown in
Table 6) demonstrate that MS-YOLO achieves the best results across all segmentation accuracy metrics, highlighting its advantages in complex indoor scene instance segmentation tasks.
Specifically, MS-YOLO outperforms both the YOLO series comparison models and YOLACT in Precision, Recall, mAP@0.5, and mAP@(50–90) metrics. Notably, even when compared to the heavyweight Mask R-CNN model—which typically excels in fine boundary delineation—MS-YOLO achieves slightly better performance, improving the mAP@0.5 from 0.905 (Mask R-CNN) to 0.923 and the more stringent mAP@(50–90) metric from 0.778 to 0.782. This demonstrates MS-YOLO’s superior stability and generalization capabilities in target boundary delineation and fine-grained segmentation.
In terms of computational complexity, MS-YOLO achieves only 8.2 G FLOPs. This is not only significantly lower than other lightweight models such as YOLOv11n-seg (9.9 G), YOLOv5n-seg (11.0 G), and YOLOv3-tiny-seg (32.7 G), but also represents a massive reduction in computational overhead compared to YOLACT (68.4 G) and Mask R-CNN (190.0 G). This reduction in computational demand is accompanied by higher segmentation accuracy. This demonstrates that the proposed MS-YOLO strikes an exceptional balance between accuracy and efficiency, making it highly suitable for deployment in practical scenarios with limited computational resources.
Regarding the significance of the differences observed in
Table 2,
Table 3,
Table 4,
Table 5 and
Table 6, it should be noted that in deep learning tasks, performance metrics are evaluated deterministically over large-scale validation sets. The consistent improvements achieved by MS-YOLO across multiple metrics and two distinct data distributions indicate robust performance gains. Furthermore, from an engineering perspective, achieving superior segmentation accuracy while securing a 32.2% reduction in FLOPs constitutes a highly significant practical advantage for real-time deployment on resource-constrained robotic platforms.
4.6. Semantic Mapping Results
Previous experiments have demonstrated that the MS-YOLO instance segmentation model proposed in this paper exhibits superior deployment adaptability on computationally constrained service robot platforms, providing a reliable perceptual foundation for edge-side semantic mapping. To validate the feasibility of our proposed low-computational-overhead, human–machine interaction-oriented semantic mapping method, we completed full-system deployment and real-world testing on edge devices. The hardware platform utilizes the Jetson Orin NX Super (NVIDIA, Santa Clara, CA, USA) as the computational unit (see
Figure 11a), integrated with an Ackerman chassis (Wheeltec, Dongguan, China) and Astra Pro Plus depth camera to form the mobile robot platform for experimental use (see
Figure 11b). The software system is built upon ROS 2.
The experiment selected three indoor scenes for mapping trials, as shown in
Figure 12, to cover diverse spatial configurations and object arrangements. It is crucial to clarify that these mapping trials were conducted online in real physical environments. The sensor data used to construct occupancy grid maps (
Figure 13) and semantic maps (
Figure 14) were collected in real-time during the robot’s navigation. This online mapping process is completely independent of the static image datasets described in
Section 2, which were utilized solely for the offline training of the MS-YOLO model. Results demonstrate that within each scene, the system successfully constructs an occupancy grid map while simultaneously detecting, integrating, and updating semantic landmarks, ultimately generating a complete semantic map as shown in
Figure 14. The spatial positions of target categories align with geometric structures such as walls and passageways, indicating the solution possesses strong edge-side implementability and environmental adaptability.
While
Figure 14 visually confirms the successful construction of semantic maps, quantitative evaluation is essential to fully validate the system’s efficiency and map quality. However, as highlighted by recent studies [
10], traditional mapping evaluation metrics such as Absolute Trajectory Error (ATE) and Relative Pose Error (RPE) are primarily centered around the pose estimation of geometric SLAM systems. Since mature SLAM algorithms can readily achieve centimeter-level accuracy in typical indoor environments, further evaluating trajectory precision offers limited practical value for service robot navigation. Currently, the academic community lacks a universally recognized evaluation standard specifically for semantic mapping.
Therefore, rather than relying on geometric trajectory metrics, we focus on the practical effectiveness of our Continuous-Frame Dynamic Constraint algorithm in eliminating semantic redundancy. To quantitatively evaluate this, we define the Redundancy Elimination Rate (RER):
where
is the number of raw semantic bounding boxes continuously detected, and
is the final number of unique landmarks registered on the map. The evaluation results across the three scenes are presented in
Table 7.
As shown in
Table 7, our algorithm effectively filters out transient multi-view redundancies, consistently achieving an RER of over 97%. This fundamental data compression ensures map conciseness without information loss. Furthermore, the end-to-end system maintains an inference speed of approximately 38 FPS on the Jetson Orin NX Super. Since 30 FPS is generally sufficient for real-time robotic applications, our system successfully balances optimal map data overhead with real-time perception.
Beyond quantitative metrics, comparing the occupancy grid maps (
Figure 13) with our semantic maps (
Figure 14) further highlights the system’s advantages for human–computer interaction. While traditional grid maps primarily express geometric passability, our semantic maps explicitly integrate key object categories, locations, and instance boundaries. This rich and highly readable representation directly supports semantic-based task planning and complex instruction execution for service robots.
5. Discussion
The experimental results systematically demonstrate that the proposed framework successfully balances the dual challenges of high computational overhead and semantic map redundancy on edge devices. By reconstructing the backbone with MobileNetV4 and introducing the SHViT module, MS-YOLO achieved a 32.2% reduction in FLOPs compared to the baseline while maintaining a high mAP@0.5 of 0.989. Furthermore, the continuous-frame dynamic constraint method effectively eliminated transient “ghosting” observations, achieving a redundancy elimination rate (RER) of over 97%. This proves that deploying high-resolution semantic perception and mapping on resource-constrained service robot platforms (e.g., Jetson Orin NX) is not only theoretically feasible but also engineeringly robust.
Despite these promising results, this study involves several intentional engineering trade-offs and limitations. First, regarding landmark representation, projecting 3D point clusters into 2D minimum bounding rectangles inherently discards vertical geometric details. While sufficient for planar robot navigation, it limits applications requiring full 3D manipulation. Second, the current system is optimized for static functional landmarks (e.g., doors, fire equipment). In highly complex and dynamic scenes—such as those with extremely dense crowds causing prolonged, severe occlusions—the semantic extraction process may be interrupted. Currently, the system relies on downstream local planners (e.g., TEB or DWA) to handle transient dynamic pedestrians, which is not reflected in the global semantic map.
To address these identified limitations, our future research will focus on two main directions. First, we plan to enhance the model’s robustness against severe occlusions by introducing temporal tracking mechanisms and multi-sensor fusion (e.g., integrating LiDAR intensity features with visual semantics). Second, we aim to deeply integrate the generated semantic maps into the robot’s active SLAM and autonomous decision-making systems, exploring how semantic priors can optimize global path planning and complex human–robot interaction tasks in crowded public environments.
6. Conclusions
This paper addresses the dual challenges of high computational overhead and semantic redundancy faced by existing semantic mapping methods on computationally constrained platforms. We propose an efficient semantic mapping system based on an improved lightweight instance segmentation network (MS-YOLO) and a Continuous-Frame Dynamic Constraint strategy.
The core innovation lies in achieving real-time, robust semantic environment representation on resource-constrained edge computing devices. Methodologically, MS-YOLO significantly reduces computational complexity (FLOPs and parameter counts) while maintaining high segmentation accuracy for typical indoor objects, outperforming the baseline YOLOv8n-seg model. Simultaneously, the proposed dynamic constraint method based on deep point clouds effectively filters transient observation noise, eliminating “ghosting” phenomena and ensuring the global uniqueness of semantic landmarks.
End-to-end deployment testing on a self-built hybrid dataset and the Jetson Orin NX platform successfully validated the system’s engineering applicability. The framework ensures stable operation in complex public indoor environments, generating highly readable semantic maps that achieve an excellent balance between real-time performance and spatial accuracy. Ultimately, this lightweight mapping solution provides a robust perceptual foundation for advanced robot navigation and human–robot interaction tasks in practical service domains.