A Comparison of Segmentation Methods for Semantic OctoMap Generation

Czajka, Marcin; Krupka, Maciej; Kubacka, Daria; Janiszewski, Michał Remigiusz; Belter, Dominik

doi:10.3390/app15137285

Open AccessArticle

A Comparison of Segmentation Methods for Semantic OctoMap Generation

by

Marcin Czajka

,

Maciej Krupka

,

Daria Kubacka

,

Michał Remigiusz Janiszewski

and

Dominik Belter

^*

Institute of Robotics and Machine Intelligence, Poznan University of Technology, pl. Marii Sklodowskiej-Curie 5, 60-965 Poznan, Poland

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(13), 7285; https://doi.org/10.3390/app15137285

Submission received: 3 June 2025 / Revised: 21 June 2025 / Accepted: 26 June 2025 / Published: 27 June 2025

Download

Browse Figures

Versions Notes

Abstract

Semantic mapping plays a critical role in enabling autonomous vehicles to understand and navigate complex environments. Instead of computationally demanding 3D segmentation of point clouds, we propose efficient segmentation on RGB images and projection of the corresponding LIDAR measurements on the semantic OctoMap. This study presents a comparative evaluation of different semantic segmentation methods and examines the impact of input image resolution on the accuracy of 3D semantic environment reconstruction, inference time, and computational resource usage. The experiments were conducted using an ROS 2-based pipeline that combines RGB images and LiDAR point clouds. Semantic segmentation is performed using ONNX-exported deep neural networks, with class predictions projected onto corresponding 3D LiDAR data using calibrated extrinsic. The resulting semantically annotated point clouds are fused into a probabilistic 3D representation using an OctoMap, where each voxel stores both occupancy and semantic class information. Multiple encoder–decoder architectures with various backbone configurations are evaluated in terms of segmentation quality, latency, memory footprint, and GPU utilization. Furthermore, a comparison between high and low image resolutions is conducted to assess trade-offs between model accuracy and real-time applicability.

Keywords:

semantic mapping; semantic segmentation; autonomous vehicles

1. Introduction

Semantic segmentation assigns a class label to every pixel in an image, grouping areas that share the same meaning [1]. This capability enables a variety of applications, such as autonomous driving [2] and robotic grasping [3], where three-dimensional mapping with precise labeling of each pixel is essential.

In safety-critical applications such as autonomous vehicles, segmentation neural networks are required to deliver both high accuracy and minimal inference latency [2]. This is typically measured by Intersection-over-Union (IoU), Dice coefficient, and minimal inference latency (frames per second and GPU memory footprint). Progress in road-scene understanding has been accelerated by large-scale datasets (CityScapes [4], CamVid [5]) providing detailed pixel-wise annotations. Terrain classes (paved roads, sidewalks, grassy verges) are assigned to each pixel by semantic segmentation models, while moving objects (pedestrians, cyclists, vehicles) are isolated from the static scene by dynamic-object segmentation [6]. In conventional computer-vision pipelines, every object is assumed to belong to a predefined set of categories; real-world roadways, however, are inherently unpredictable, and anomalous obstacles may evade detection, thereby posing critical risks to vehicle operation [7].

In this article, we evaluate several state-of-the-art segmentation architectures, i.e., SegFormer [8], DeepLabv3+ [9], U-Net [10], and feature pyramid network (FPN) [11], on our own driving dataset. Each model’s encoder (ResNet-34 [12], EfficientNet-B2 [13], and MobileNetV4 [14]) was initialized with ImageNet-pretrained weights [15], which accelerated convergence and enriched feature representations, while the decoders were trained end-to-end to guarantee pixel-level accuracy. Building on this experimental validation, we propose a system architecture that constructs a 3D map of the environment in real-time and enables efficient differentiation between object types in road and urban settings. In contrast to object segmentation in 3D LiDAR data [16], we focus on segmentation in RGB images, which are more informative, and then we project the result to 3D space using 3D LiDAR data.

In this article, we demonstrate that efficient semantic segmentation is achievable within a designated operational area. This application scenario is relevant to autonomous vehicles operating in limited environments, such as transporting goods between buildings in a factory complex or performing autonomous parking in bus depots. Dense label maps are fused with complementary depth measurements or LiDAR point clouds to construct a probabilistic occupancy grid in real-time. The resulting unified occupancy could be fed to trajectory planners, where safe paths are computed under strict safety constraints. Specifically, semantic maps can be utilized by sampling-based motion planning algorithms such as RRT* or PRM*, augmented with semantic cost functions that prioritize safe and efficient navigation based on object classes. In addition, semantic maps can enhance obstacle detection by providing rich contextual information.

The main contributions of this article are as follows:

A lightweight semantic mapping pipeline that performs segmentation on RGB images and projects LiDAR measurements onto a semantic OctoMap, using recent advancements in neural-based image segmentation in contrast to the methods from the literature [17], reducing the computational cost compared to full 3D point cloud segmentation;
A comprehensive evaluation of multiple deep neural network architectures for semantic segmentation, comparing segmentation accuracy, inference time, memory usage, and GPU load;
An analysis of input image resolution impact, highlighting trade-offs between model accuracy and real-time performance for 3D semantic environment reconstruction in autonomous driving scenarios.

2. Related Work

2.1. Semantic Segmentation

Semantic segmentation has been successfully deployed across multiple domains. In agriculture, it enables automated monitoring of crop health [18,19]; in remote sensing, pixel-level classification of satellite and aerial imagery provides accurate land-cover mapping and change detection [20,21]; in medical imaging, it supports automated diagnostic assessment by delivering precise voxel-level delineations of organs, tumors, and other pathological structures in MRI and CT scans [22,23]; and in autonomous driving, real-time scene understanding enhances vehicle perception and obstacle avoidance [24,25].

Fully Convolutional Networks (FCNs) first pioneered end-to-end per-pixel segmentation by replacing dense classifiers with learned upsampling layers. U-Net [10] enhanced boundary accuracy with a symmetric encoder–decoder architecture and skip connections. DeepLabv3+ [9] introduced an Xception backbone combined with Atrous Spatial Pyramid Pooling (ASPP) for multi-scale context aggregation and a lightweight decoder to maintain resolution. Feature Pyramid Networks (FPN) [11] fused low-resolution semantic features with high-resolution spatial details using a top-down pathway and lateral connections. SegFormer [8] demonstrated that a hierarchical Vision Transformer encoder, combined with a lightweight MLP decoder, can match or surpass convolutional models in performance. In [26], a multiscale double branch U2Net architecture was proposed consisting of a spatial U-Net and a spectral U-Net for image fusion tasks that enable effective learning in a hierarchical manner. In the automotive domain, the most recent architectures focus on multimodal fusion, and hybrid convolution-transformer backbones to maximize mIoU while preserving real-time speeds. For example, hybrid models such as FFTNet [27] combine lightweight convolutional encoders with transformer backbones that include Feature Alignment Modules (FAMs) and Pyramid Compression Pooling Modules (PCPMs), allowing them to reach over 82.5% mIoU on the Cityscapes validation set while still running in real time.

2.2. Environment Mapping

In robotics, environment mapping has evolved from simple 2D occupancy grids to fully 3D representations that fuse geometric and semantic information. OctoMap introduced real-time 3D mapping by organizing space into a probabilistic octree, enabling rapid updates while minimizing the memory consumption [28]. SemanticFusion takes this concept a step further by combining CNN-based pixel labels with dense RGB–D SLAM so that every individual surfel in the 3D map is tagged with a meaningful class name [29].

The authors of [17] enhanced OctoMap by first employing a finely tuned convolutional neural network to assign a semantic label to every pixel of the RGB image. Through precise sensor calibration, these pixel-wise labels were projected onto the corresponding points of the 16-beam LiDAR scan. Finally, by fusing this semantically annotated point cloud with the robot’s odometry within OctoMap, each voxel in the resulting map encoded both an occupancy probability and a semantic class label. In this article, we study this approach thoroughly and check various configurations of such a system using recent advancements in neural-based image segmentation. The method proposed in [30] utilizes OctoMap with semantic information for motion planning. However, semantic information is obtained using the classical SVM-based classification based on color and spatial information. In this context, the method proposed in this article is a natural extension of the method presented in [30].

In Ref. [31], the authors presented an on-board, under-canopy UAV system that fuses real-time LiDAR-inertial odometry with a RangeNet++ encoder and a lightweight MLP decoder to segment tree trunks, thus enabling the construction of large-scale, fine-grained metric-semantic maps in GPS-denied forest environments.

Another study [32] showed a TSDF mapping system that fuses LiDAR, camera, and IMU data to generate a dense, semantically labeled 3D mesh in under 10 ms per frame. This map supports real-time localization and terrain driveability analysis, allowing a hybrid A-Star planner to compute smooth, collision-free paths across complex outdoor landscapes.

Similarly, in [33], the authors deployed DeepLab v3+ with a MobileNet v3 backbone for RGB segmentation, a MobileDet SSD-style detector on both RGB and thermal image streams, and a projection-based SalsaNext network for LiDAR point-cloud segmentation. They fused these outputs into a unified 3D semantic map, and then propagated high-confidence labels from the images onto the LiDAR points, both enriching the map’s detail and retraining the LiDAR network for reliable segmentation in new urban and disaster scenarios.

In recent work [32], a real-time metric-semantic mapping system operating at 30 Hz (less than 7 ms per frame) was presented. It leverages FPGA-synchronized LiDAR, global-shutter cameras, and an IMU, fusing LiDAR and visual measurements via an error-state iterated Extended Kalman Filter (EKF) for robust 6-DoF odometry. LiDAR sweeps are undistorted and reprojected into depth/height images. Concurrently, a CNN segmentation head computes per-pixel class probabilities, which are ray-casted and fused into voxel semantic label distributions via recursive Bayesian updates.

3. Methods

3.1. Semantic OctoMap Generation

We implemented an ROS2 node to enhance LiDAR point clouds with semantic class information. A schematic visualizing the process can be found in Figure 1. Topic synchronization is employed to ensure the processing of corresponding camera images and point clouds. The image first undergoes undistortion and rectification, followed by resizing and normalization using ImageNet dataset weights. This step aims to keep consistency with the format and distribution of the training data. After tensor conversion, semantic segmentation is performed using an ONNX-format model. The format was selected for its compatibility with both greenPython and greenC++ nodes and its efficient inference time. The implementation was carried out using Python 3.10 and C++11.

In the following stage, the point cloud is projected onto the segmentation mask using transformation between LiDAR and camera frames. Points, along with their corresponding semantic class information, are then organized into a separate point cloud structure and published to a dedicated topic. We also integrated a filtration mechanism to control which classes of objects can be included in the point cloud. This feature is valuable for excluding dynamic objects (e.g., as pedestrians or cyclists) or elements of classes that cannot logically exist in the map. During the experiments discussed in the subsequent sections, no filtering was applied.

The final step involves updating the OctoMap. We utilized the OctoMap implementation available in the octomap_server package [34], which supports the additional point cloud features such as color. We leverage this to label voxels via pseudo-colors encoding class labels. However, this method can lead to color blending when spatial regions are inconsistently assigned class labels, resulting in color mixing and the formation of “intermediate” labels. When evaluating the map, points that cannot be unambiguously assigned to any class are filtered out using a heuristic based on maximum color difference from the reference values listed in Table 1. Beyond the semantic point cloud, the map requires information about the current robot’s pose in the global frame coordinates to include the points. The complete processing cycle, from receiving data on ROS topics to updating the OctoMap, requires between 30 and 60 ms, depending on the neural network architecture and the number of points involved in processing. The input consists of 65,536 points, while the number of labeled points varies between 11,000 and 16,500, determined by the number of valid distance measurements in the LiDAR scan. The node implementation is designed to support adaptation to other datasets without requiring substantial code modifications.

3.2. Dataset

Prior to data capture, we calibrated both the camera’s internal parameters and its rigid body transformation relative to the LiDAR. For camera intrinsics, we used a checkerboard procedure: images of a known planar pattern were captured from multiple viewpoints, and OpenCV’s pinhole calibration routines estimated the focal lengths, principal point, and lens distortion coefficients.

To estimate camera–LiDAR extrinsic transform, we placed a flat calibration board with reflective targets (visible to both sensors) in various positions and orientations. Simultaneously recording LiDAR scans and undistorted camera images of that board, we extracted 3D points from the LiDAR returns on those targets and back-projected the corresponding 2D corner detections in the images into 3D rays. A least-squares minimization solved for the rotation and translation that best aligned the LiDAR point clouds with the 3D camera rays. We validated the resulting extrinsic parameters by projecting LiDAR points into the rectified image plane reprojection errors that remained below one pixel for all calibration targets. Our setup does not consider temporal calibration between the LiDAR and the camera like in [35].

With intrinsics and extrinsics finalized, we collected the dataset on the outdoor premises of Poznań University of Technology using a remotely controlled vehicle (see Figure 2). The platform was equipped with an Ouster OS1-128 LiDAR, a calibrated Lucid Triton TRI064S-CC RGB camera, an Xsens MTi-30 AHRS, and DGPS (u-Blox F9P in moving-base mode).

The platform was driven in separate runs across various areas of the campus to ensure diversity in the captured multi-sensor data. The recordings were saved as Robot Operating System (ROS2) bag files, enabling further processing. Two out of the recorded runs were selected and designated as the validation set (hereafter referred to as Validation scene) and the test set (hereafter referred to as Test scene). The validation route takes approximately 5 min, while the test route takes around 2 min. Figure 3 shows the trajectories followed by the robot during the collection of validation and test data. Since the images were recorded at a frequency of 15 Hz, resulting in the similarity between consecutive images of the sequence, we decided to keep only every 10th image to train the neural networks. In the final split, the training, validation, and test sets consisted of 2418, 577, and 199 images, respectively. The images were additionally processed through distortion correction and rectification.

In Figure 4 and Figure 5, we show the example RGB and segmentation masks from the training and test sets. Despite being collected in a similar area, they contain significantly different objects, and the problem is challenging taking into account the generalization capabilities of the neural network and the complete mapping system.

Data Annotation

In our annotation pipeline, we compare a task-specific and a general-purpose approach, both using the unified nine-category colormap, as defined in Table 1. The proposed classes were derived from the CityScapes dataset, with class aggregation performed due to a small number of instances of some classes and inconsistent label generation by the general-purpose pipeline. We merged building, fence, and wall into the Building category; road and sidewalk into Road; car and truck into Car; person and rider into Person; and traffic light, traffic sign, bicycle, and pole into Traffic Objects. The task-specific method leverages NVIDIA’s SegFormer-B5 (fine-tuned on Cityscapes at 1024 × 1024 resolution) available on Hugging Face. After preprocessing, the model produces 19-class segmentation logits, which are upsampled, followed by argmax selection and remapping into our nine categories. This step ensures that both methods generate annotations with identical class definitions and RGB mask values.

As a general-purpose approach, we chose Language Segment-Anything [36], which utilizes Grounding DINO [37] for zero-shot object detection based on textual input. The resulting bounding boxes are forwarded to Segment Anything Model 2 (SAM2) [38] to generate a segmentation mask for the object of interest. We selected the SAM 2.1_hiera_large variant and queried each frame with natural-language prompts for our nine labels. Predicted masks are thresholded (box

\geq 0.35

/text

\geq 0.30

, or

0.125 / 0.20

for “Traffic_objects” class) and merged into a single per-pixel ID map.

3.3. Segmentation Models Training

3.3.1. Training Configuration

A set of state-of-the-art semantic segmentation architectures was trained and evaluated for our semantic segmentation task. Those include U-Net [10], FPN [11], DeepLabv3+ [9], and Segformer [8]. All combinations of models and encoder backbones: ResNet-34 [12], EfficientNet-B2 [13], and MobileNetV4 (Convolutional Small variant) [14] were evaluated. The encoder backbones were pretrained on ImageNet in order to leverage existing rich feature representations and accelerate convergence. The primary objective was to evaluate model–encoder performance for both segmentation precision and inference time.

3.3.2. Hyperparameters

All models were trained for 12 epochs using the Adam optimizer [39] with an initial learning rate of 0.0001. A learning rate scheduler (ReduceLROnPlateau) was employed with a reduction factor of 0.5 and a patience of two epochs to adaptively decrease the learning rate when the validation loss failed to improve.

The batch size was set to 4 for high-resolution inputs (960 × 608) and increased to 16 for low-resolution inputs (480 × 304). Model checkpoints were saved based on the best validation loss, with early stopping applied to avoid overfitting. Deterministic training was ensured by fixing the random seed across all runs.

3.3.3. Data Preparation and Augmentation

To improve generalization, all models were trained with standard data augmentations. Inputs were normalized using statistics (mean = [0.485, 0.456, 0.406], std = [0.229, 0.224, 0.225]). During training, random horizontal flips (50% probability), mild color jitter (±0.2 brightness/contrast, ±0.1 saturation/hue), and random rotations (±15^∘) were applied to increase spatial diversity and simulate varying lighting conditions.

3.3.4. Loss Function

In order to mitigate the class imbalance present in the dataset, we employed a squared-root scaled median-frequency inverse weighting scheme. This method assigns a weight to each semantic class that is inversely proportional to its frequency of occurrence within the dataset. These weights are calculated using the formula:

w_{i} = \frac{\sqrt{\frac{f_{median}}{f_{i}}}}{\sum_{j = 1}^{C} \sqrt{\frac{f_{median}}{f_{j}}}}

(1)

where

f_{i}

is the pixel frequency of class i,

f_{median}

is the median frequency across all classes, and the square root scaling moderates weight differences while maintaining relative class priorities.

A composite loss function combining three components was implemented. All components are presented in Table 2. The Weighted Cross-Entropy Loss is fundamental for pixel-wise classification [40]. Dice Loss is particularly beneficial for small object classes and emphasizes overlap between predicted and ground truth regions [41]. Focal Loss is introduced to reduce the contribution of simple examples and focus training on hard-to-classify pixels [42].

The total loss (

L_{T o t a l}

) for the model’s optimization was computed as:

L_{T o t a l} = L_{C E} + L_{D i c e} + L_{F o c a l}

(2)

3.3.5. Evaluation Metrics

Model performance was assessed using standard semantic segmentation metrics that are presented in Table 3. Intersection-over-Union (IoU) measures pixel-wise classification accuracy for each class [43]. F1-score provides a balanced assessment considering both precision and recall [44]. The Multi-class Dice Coefficient evaluates segmentation overlap quality across all classes [45].

4. Results

4.1. Segmentation Mask Evaluation

The test set results, summarized in Table 4, provide a comparison of four encoder–decoder-based architectures (U-Net [10], DeepLabv3+ [9], FPN [11], and Segformer [8]), each paired with different encoder backbones—MobileNetV4, EfficientNet-B2, and ResNet-34. Results are grouped into Loss and Accuracy Metrics. The objective of the comparison is to assess each model and encoder combination on its semantic segmentation quality.

Across all evaluated architectures, EfficientNet-B2 consistently outperformed the other encoders, achieving higher overlap-based metrics such as Dice and IoU with low overall loss. For example, U-Net with EfficientNet-B2 attained a Dice score of 0.8392, an IoU of 0.6555, a test loss of 0.3608, a focal loss of 0.0240, and an F1-score of 0.7050. Similar improvements were observed for DeepLabv3+ and FPN when using the same encoder, confirming its reliability across different model architectures. The SegFormer architecture, when combined with EfficientNet-B2, delivered accuracy comparable to traditional CNN-based models such as FPN and DeepLabv3+, achieving a Dice score of 0.8376 and an IoU of 0.6459. In contrast, all models using MobileNetV4 consistently underperformed in accuracy metrics. For instance, U-Net with MobileNetV4 reached only 0.7919 Dice, 0.6014 IoU, and had the highest test loss.

When evaluated on downsampled images of 480 × 304 pixels (Table 5), the U-Net with a ResNet-34 encoder achieves the highest overlap metrics Dice score of 0.7766 and IoU of 0.6193, outperforming DeepLabv3+ with EfficientNet-B2 Dice score of 0.7554 and IoU of 0.6012.

Both U-Net (ResNet-34) and DeepLabv3+ (EfficientNet-B2) trained on SAM2-generated masks underperform compared to using SegFormer-B5-derived labels. For instance, U-Net’s Dice (0.6913) and IoU (0.5543) are roughly 15 points lower than with SegFormer-B5 masks. This semantic gap arises because SAM2’s class-agnostic approach lacks explicit category guidance; its masks may group adjacent regions without regard for class boundaries, whereas SegFormer-B5, pre-trained on Cityscapes, produces semantically coherent outlines aligned with object classes. Consequently, downstream models learn from noisier, semantically ambiguous labels with SAM2, leading to degraded segmentation performance seen in Table 6.

In our study, to better approximate real-world conditions and the robustness of the models, we included test scenarios that simulate partial information loss due to adverse weather, such as rain, as well as interference during data transmission leading to packet-loss artifacts. We applied transforms.RandomErasing (

p = 0.5

, scale

(0.02, 0.33)

, ratio

(0.3, 3.3)

) on test set to replace 2–33% of each image with random noise. Under this corruption (Table 7), U-Net (ResNet-34) total loss increases from 0.355 to 0.844 and Dice drops from 0.845 to 0.776, while DeepLabv3+ (EfficientNet-B2) total loss rises from 0.330 to 0.699 and Dice from 0.838 to 0.765, indicating slightly better resilience of DeepLabv3+. Moreover, these results indicate that, even under partial information loss during inference, the evaluated models maintain basic stability.

4.2. Computational Requirements Comparison

Efficient deployment of semantic segmentation models in real-world robotic systems requires consideration of inference latency and computational resource usage. This section presents a detailed evaluation of latency, memory footprint, and GPU utilization across a range of encoder–decoder architectures tested at two different input resolutions. Table 8 reports the inference performance of four state-of-the-art semantic segmentation architectures (U-Net, DeepLabv3+, FPN, SegFormer) each combined with three ImageNet-pre-trained encoders (MobileNetV4, ResNet-34, EfficientNet-B2). All models were benchmarked over 1000 forward passes using input images at resolutions of 960 × 608, with selected architectures being additionally evaluated at the reduced resolution of 480 × 304. It summarizes the following metrics: latency statistics (the minimum, maximum, mean inference time, standard deviation) all expressed in milliseconds; the GPU memory requirement for parameter loading in megabytes (MB); and the peak GPU utilization percentage during inference. Inference was measured by running each FP32 ONNX model on GPU via ONNX Runtime CUDA execution provider. In order to ensure experimental consistency, all evaluations were performed on a single computer equipped with an NVIDIA RTX 4070 GPU.

Among the evaluated configurations that were processing full-resolution images, DeepLabv3+ with MobileNetV4 delivered the shortest inference time, with an impressively low average inference time of

5.64 ms

, combined with tiny GPU memory footprint of just

24 MB

. This combination is ideal for any application where short inference time and minimal resource usage are critical.

In our experiments, MobileNetV4 outperformed other encoders in terms of processing speed across all evaluated models. Its mean of inference latency was below 9 ms for all model combinations. Additionally, MobileNetV4 demonstrated minimal peak GPU utilization in all model–encoder configurations except when integrated with FPN. EfficientNet-B2, despite requiring intermediate memory for parameter loading, achieved the longest mean inference time for the three model variants.

When ranking architectures by overall efficiency, DeepLabv3+ with MobileNetV4 ranks first, closely followed by FPN+ MobileNetV4. When paired with MobileNetV4, FPN achieves an average inference time of

6.97 ms

, while requiring only

24 MB

of GPU memory. This configuration is preferable when the improved boundary detail offered by multi-scale feature fusion outweighs the slight increase in latency. For systems where the GPU must be shared with other processes, SegFormer paired with MobileNetV4 presents a strong alternative. Despite a higher average latency of 8.59 ms, it requires only 24 MB of memory and reaches peak utilization of approximately 47%.

Reducing input image resolution by a factor of four led to efficiency gains, with average latency times decreasing by over 2.5 times and maximum GPU usage dropping by 16 percentage points during tests of both architectures. The U-Net with ResNet-34 configuration achieved both the shortest single inference time of

4.04 ms

, and minimum mean latency of

4.34 ms

.

4.3. Semantic OctoMaps Validation

To compare the quality of segmentation in the 3D space, we processed recorded sequences from rosbag files through the ROS2 node while employing the complete set of previously trained models. We set the OctoMap resolution to 25 cm, enabling the creation of a large yet relatively precise representation of the environment. For analysis, we obtained a point cloud formed by points located at the centers of occupied voxels in the map. Approximately 16% of points were removed in both scenarios as they were not assigned a color during the mapping process (these points maintained the default pure white color because the voxel does not have the required number of measurements—in our implementation, a point must be located inside the voxel in two separate scans). To assign a point to a particular class, its color values must not differ by more than 5 units in any channel from the reference color values listed in Table 1. We define non-classified points (NC) as those which, despite having assigned color values, could not be assigned to any semantic class by not meeting the color difference criterion. This stems from color blending in the OctoMap when a particular spatial region was repeatedly annotated with different labels. In such a case, there is insufficient confidence for the point to be assigned to any semantic class. To establish ground truth data, scene elements were manually annotated with cuboids.

We split the experiment into several phases. First, we evaluated segmentation performance using all model–encoder combinations using the default image resolution. Subsequently, we extended the analysis with a comparison using downscaled images and maps created by models trained on masks produced by Grounding DINO and SAM2. The results are presented in Table 9, Table 10, and Table 11, respectively. These tables store the following metrics: accuracy (ACC), weighted F1-score (wF1), where the weights are directly proportional to the fraction of points of each class in the complete map, and the percentage of unclassified voxels (NC). The index _f appended to the metric name indicates that the value was computed for the subset of points after filtering out those not assigned to any class. Finally, we compared F1-scores achieved for individual classes by top-performing models for each encoder type across all three experimental setups. The metrics are visualized in the graphs shown in Figure 6, Figure 7 and Figure 8. In Figure 9 and Figure 10, we present manual ground truth labels alongside OctoMaps generated using selected segmentation approaches.

As shown in Table 9, FPN coupled with the lightweight MobileNetV4 backbone achieves the highest accuracy on validation scene

ACC \approx 0.7988

, filtered accuracy

{ACC}_{f} \approx 0.9365

, filtered weighted

F_{1}

-score

F 1_{f} \approx 0.8627

, and low unclassified rate

NC \approx 0.1471

on the validation scene. It also generalizes well to the test scene, with

F_{1, f} \approx 0.7934

and

ACC \approx 0.7322

, indicating that hierarchical feature integration with a lightweight encoder can outperform other configurations in both accuracy and efficiency. On the other hand, DeepLabv3+ paired with EfficientNet-B2 not only achieves the lowest unclassified-voxel rate on the test split (

NC \approx 0.1148

) but also maintains competitive overall performance (

ACC \approx 0.7369

,

wF 1 \approx 0.7976

). This configuration demonstrates also an optimal balance between reducing unclassified areas and maintaining high segmentation accuracy. Furthermore, U-Net with a MobileNetV4 backbone delivers strong performance, particularly on the test scene, where it achieves the highest accuracy among all combinations (

ACC \approx 0.7383

) and the highest filtered weighted

F_{1}

-score (

wF 1_{f} \approx 0.8599

). On the validation scene, it records

ACC \approx 0.7927

,

{ACC}_{f} \approx 0.9336

,

wF 1_{f} \approx 0.9369

, and an unclassified-voxel rate of

NC \approx 0.1509

. In the test scenario, the relatively low

NC \approx 0.1239

indicates that U-Net+MobileNetV4 produces one of the most complete semantic maps.

As reported in Table 10, at lower resolution, U-Net with ResNet-34 attained the lowest unclassified-voxel rate (

NC \approx 0.1202

) on the test split, producing the most complete occupancy map-crucial for robotic scenarios that depend on continuous environment reconstruction. Moreover, on the test scene, U-Net + ResNet-34 achieves the lowest unclassified-voxel rate (

NC \approx 0.1202

), providing the most complete semantic coverage even at reduced resolution, with only marginal losses in overall accuracy and weighted

F_{1}

. In practice, downscaling can significantly reduce computational cost while sacrificing only a few percentage points of filtered accuracy and

F_{1}

In Table 11, we show that masks generated by GroundingDINO and SAM2 in their current implementation demonstrate inadequate semantic precision, leading to an increase in

NC

of approximately 20–30% and a reduction in

wF 1_{f}

of roughly 10–15%. For instance, on the validation scene, DeepLabv3+ + EfficientNet-B2 achieves

ACC \approx 0.5648

,

wF 1_{f} \approx 0.8623

, and

NC \approx 0.2941

, while U-Net + ResNet-34 records

ACC \approx 0.5228

,

wF 1_{f} \approx 0.8270

, and

NC \approx 0.3069

. These results demonstrate that GroundingDINO+SAM2 masks lack the precision necessary for reliable semantic mapping.

In Figure 6, Figure 7 and Figure 8, we compare the per-class F1-scores of the top-performing models across validation and test scenes. The model based on DeepLabv3+ with EfficientNet-B2 achieves high F1-scores of 0.91 for the Road class, 0.87 for Building, and 0.76 for Tree, demonstrating consistently strong performance across both validation and test scenes. The U-Net with MobileNetV4, despite its lightweight architecture, delivers stable performance across both validation and test scenes. On the test scene, it attains F1-scores of 0.91 for Road, 0.87 for Building, and 0.76 for Tree, matching the performance of DeepLabv3+ with EfficientNet-B2. On the validation scene, it maintains strong results with scores of 0.92, 0.88, and 0.76 for the same classes. The U-Net model with a ResNet-34 encoder shows slightly lower performance across most categories, reaching F1-scores of 0.86 for Road, 0.79 for Building, and 0.75 for Tree. In contrast, the classes Traffic objects and Grass remain the most challenging for all evaluated models, with F1-scores consistently falling below 0.3, particularly when models are trained on SAM2-derived annotations. This observation indicates that segmentation models tend to underperform when detecting objects that are small in spatial extent or infrequent within the training distribution.

5. Discussion

The conducted experiments enabled a comparison of the quality of OctoMaps generated with additional semantic information derived from image segmentation. Our findings demonstrate that MobileNetV4, the most recent and lightweight of the tested encoders, delivered the most precise environment representation while achieving the lowest peak GPU usage and shortest inference time. In combination with the FPN model, this configuration produced maps with approximately 0.80 accuracy of voxel classification on the Validation scene and over 0.72 on the Test scene, with an average inference time below 7 ms. However, switching to DeepLabv3+ led to a further reduction in mean inference latency to 5.6 ms with a marginal trade-off in accuracy.

We subsequently analyzed the impact of reducing image resolution on performance and quality. Halving both image dimensions accelerated inference by over 2.5 times and reduced peak GPU usage while maintaining nearly identical map quality. Maps generated using U-Net with ResNet-34 encoder at reduced resolution exhibited negligible accuracy loss compared to optimal full-resolution models while achieving a mean inference time of only 4.3 ms.

Additionally, we compared the quality of maps generated by models trained on masks produced by a domain-specific segmentation neural network versus masks created by a general-purpose model (specifically Grounding DINO and SAM2). When the latter were used for training, only around 50% of voxels were correctly classified. Furthermore, this approach produced inconsistent semantic labels in more than 20% of voxels (almost twice as much as models trained with masks from a specialized neural network), introducing substantial uncertainty in the map.

In summary, we evaluated robustness in multiple ways within the scope of this study, including:

Data Augmentation: To improve generalization and simulate varied conditions, all models were trained with significant data augmentation. This included random horizontal flips, mild color jitter to simulate different lighting, and random rotations to increase spatial diversity.
Performance on Unseen Data: The models were evaluated on different trajectories from the training and validation data, demonstrating the ability to generalize to new and unseen areas of the campus environment.
Model Encoder Combinations: We conducted a comprehensive evaluation of robustness across multiple encoder–decoder architectures with various backbone configurations. The primary focus of this comparison was to assess the semantic segmentation quality, latency, memory footprint, and GPU utilization for each model and encoder combination.
Variable Input Resolution: Analysis of the trade-off between model accuracy and real-time performance by testing at different input image resolutions (960 × 608 vs. 480 × 304). Our results show that reducing image resolution can accelerate inference by over 2.5 times with only a negligible loss in map quality, a critical robustness test for resource-constrained robotic systems.
Annotation Quality: We tested the training pipeline’s robustness to the quality of annotation data by comparing models trained on highly specific, domain-adapted masks (City-scapes fine-tuned SegFormer-B5) versus those trained on masks from a general-purpose model (Grounding DINO, SAM2). The results demonstrated that the quality of annotation is critical, as the general-purpose masks led to a significant drop in performance, with almost twice the number of unclassified voxels.

In the future, we aim to experiment with semantic mapping for a variety of robotic applications. Our future work includes optimizing the map generation pipeline presented in this paper by improving the computational efficiency and evaluating neural network architectures, input image dimensions, and map resolutions to identify a configuration that balances mapping accuracy and real-time performance. Additionally, we plan to develop path planning methods that will leverage additional information about classes of objects in the environment.

Author Contributions

Conceptualization, M.K. and M.R.J.; Data curation, M.C. and M.R.J.; Formal analysis, M.C., D.K. and M.R.J.; Investigation, M.C., M.K., M.R.J. and D.K.; Methodology, D.K., M.R.J. and M.C.; Software, M.C., D.K. and M.R.J.; Visualization, M.C. and D.K.; Writing—review and editing, D.K., D.B., M.C. and M.R.J.; Supervision, D.B. All authors have read and agreed to the published version of the manuscript.

Funding

Marcin Czajka and Dominik Belter were supported by the National Science Centre, Poland, under research project no UMO-2023/51/B/ST6/01646. Maciej Krupka was supported by the CHIST-ERA grant 2024/06/Y/ST6/00136 “GIS4IoRT”.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets presented in this article are not readily available due to technical and large storage limitations. Complementary to the methods and results presented in the article are available in the Github repository containing additional materials: https://github.com/mmcza/Semantic-mapping-for-Autonomous-Vehicles (accessed on 25 June 2025). Requests for access to datasets should be directed to this repository.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Garcia-Garcia, A.; Orts-Escolano, S.; Oprea, S.; Villena-Martinez, V.; García Rodríguez, J. A Review on Deep Learning Techniques Applied to Semantic Segmentation. arXiv 2017, arXiv:1704.06857. [Google Scholar] [CrossRef]
Feng, D.; Haase-Schütz, C.; Rosenbaum, L.; Hertlein, H.; Gläser, C.; Timm, F.; Wiesbeck, W.; Dietmayer, K. Deep Multi-Modal Object Detection and Semantic Segmentation for Autonomous Driving: Datasets, Methods, and Challenges. IEEE Trans. Intell. Transp. Syst. 2021, 22, 1341–1360. [Google Scholar] [CrossRef]
Kimhi, M.; Vainshtein, D.; Baskin, C.; Di Castro, D. Robot Instance Segmentation with Few Annotations for Grasping. In Proceedings of the 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Tucson, AZ, USA, 26 February–6 March 2025; pp. 7939–7949. [Google Scholar] [CrossRef]
Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The Cityscapes Dataset for Semantic Urban Scene Understanding. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 3213–3223. [Google Scholar] [CrossRef]
Brostow, G.J.; Fauqueur, J.; Cipolla, R. Semantic Object Classes in Video: A High-Definition Ground Truth Database. Pattern Recognit. Lett. 2009, 30, 88–97. [Google Scholar] [CrossRef]
Nowak, T.; Ćwian, K.; Skrzypczyński, P. Real-Time Detection of Non-Stationary Objects Using Intensity Data in Automotive LiDAR SLAM. Sensors 2021, 21, 6781. [Google Scholar] [CrossRef] [PubMed]
Muhammad, K.; Hussain, T.; Ullah, H.; Del Ser, J.; Rezaei, M.; Kumar, N.; Hijji, M.; Bellavista, P.; de Albuquerque, V.H.C. Vision-Based Semantic Segmentation in Scene Understanding for Autonomous Driving: Recent Achievements, Challenges, and Outlooks. IEEE Trans. Intell. Transp. Syst. 2022, 23, 22694–22715. [Google Scholar] [CrossRef]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. In Proceedings of the 35th International Conference on Neural Information Processing Systems (NIPS ‘21), Red Hook, NY, USA, 6–14 December 2021; pp. 1–14. [Google Scholar]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In Proceedings of the Computer Vision—ECCV 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Springer: Berlin/Heidelberg, Germany, 2018; pp. 833–851. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
Tan, M.; Le, Q. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; Volume 97, pp. 6105–6114. [Google Scholar]
Qin, D.; Leichner, C.; Delakis, M.; Fornoni, M.; Luo, S.; Yang, F.; Wang, W.; Banbury, C.; Ye, C.; Akin, B.; et al. MobileNetV4: Universal Models for the Mobile Ecosystem. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; pp. 78–96. [Google Scholar] [CrossRef]
Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; Li, F.-F. ImageNet: A Large-Scale Hierarchical Image Database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar] [CrossRef]
Chen, X.; Li, S.; Mersch, B.; Wiesmann, L.; Gall, J.; Behley, J.; Stachniss, C. Moving Object Segmentation in 3D LiDAR Data: A Learning-Based Approach Exploiting Sequential Data. IEEE Robot. Autom. Lett. 2021, 6, 6529–6536. [Google Scholar] [CrossRef]
Berrio, J.S.; Ward, J.; Worrall, S.; Zhou, W.; Nebot, E. Fusing LiDAR and Semantic Image Information in Octree Maps. In Proceedings of the ACRA Australasian Conference on Robotics and Automation, Sydney, Australia, 11–13 December 2017. [Google Scholar]
Lei, L.; Yang, Q.; Yang, L.; Shen, T.; Wang, R.; Fu, C. Deep learning implementation of image segmentation in agricultural applications: A comprehensive review. Artif. Intell. Rev. 2024, 57, 149. [Google Scholar] [CrossRef]
Picon, A.; Eguskiza, I.; Galan, P.; Gomez-Zamanillo, L.; Romero, J.; Klukas, C.; Bereciartua-Perez, A.; Scharner, M.; Navarra-Mestre, R. Crop-conditional semantic segmentation for efficient agricultural disease assessment. Artif. Intell. Agric. 2025, 15, 79–87. [Google Scholar] [CrossRef]
Anand, T.; Sinha, S.; Mandal, M.; Chamola, V.; Yu, F.R. AgriSegNet: Deep Aerial Semantic Segmentation Framework for IoT-Assisted Precision Agriculture. IEEE Sens. J. 2021, 21, 17581–17590. [Google Scholar] [CrossRef]
Zhang, R.; Zhang, Q.; Zhang, G. LSRFormer: Efficient Transformer Supply Convolutional Neural Networks With Global Information for Aerial Image Segmentation. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–13. [Google Scholar] [CrossRef]
Khan, A.; Asad, M.; Benning, M.; Roney, C.; Slabaugh, G. Crop and Couple: Cardiac Image Segmentation Using Interlinked Specialist Networks. In Proceedings of the 2024 IEEE International Symposium on Biomedical Imaging (ISBI), Athens, Greece, 27–30 May 2024; pp. 1–5. [Google Scholar] [CrossRef]
Azad, R.; Aghdam, E.K.; Rauland, A.; Jia, Y.; Avval, A.H.; Bozorgpour, A.; Karimijafarbigloo, S.; Cohen, J.P.; Adeli, E.; Merhof, D. Medical Image Segmentation Review: The Success of U-Net. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 10076–10095. [Google Scholar] [CrossRef] [PubMed]
Ivanovs, M.; Ozols, K.; Dobrajs, A.; Kadikis, R. Improving Semantic Segmentation of Urban Scenes for Self-Driving Cars with Synthetic Images. Sensors 2022, 22, 2252. [Google Scholar] [CrossRef] [PubMed]
Papadeas, I.; Tsochatzidis, L.; Amanatiadis, A.; Pratikakis, I. Real-Time Semantic Image Segmentation with Deep Learning for Autonomous Driving: A Survey. Appl. Sci. 2021, 11, 8802. [Google Scholar] [CrossRef]
Peng, S.; Guo, C.; Wu, X.; Deng, L.J. U2Net: A General Framework with Spatial-Spectral-Integrated Double U-Net for Image Fusion. In Proceedings of the 31st ACM International Conference on Multimedia (MM ‘23), Ottawa, ON, Canada, 29 October–3 November 2023; pp. 3219–3227. [Google Scholar] [CrossRef]
Li, T.; Cui, Z.; Zhang, H. Semantic segmentation feature fusion network based on transformer. Sci. Rep. 2025, 15, 6110. [Google Scholar] [CrossRef]
Hornung, A.; Wurm, K.; Bennewitz, M.; Stachniss, C.; Burgard, W. OctoMap: An Efficient Probabilistic 3D Mapping Framework Based on Octrees. Auton. Robot. 2013, 34, 189–206. [Google Scholar] [CrossRef]
McCormac, J.; Handa, A.; Leutenegger, S.; Davison, A.J. SemanticFusion: Dense 3D Semantic Mapping with Convolutional Neural Networks. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; pp. 4628–4635. [Google Scholar] [CrossRef]
Belter, D.; Wietrzykowski, J.; Skrzypczyński, P. Employing Natural Terrain Semantics in Motion Planning for a Multi-Legged Robot. J. Intell. Robot. Syst. 2019, 93, 723–743. [Google Scholar] [CrossRef]
Prabhu, A.; Liu, X.; Spasojevic, I.; Wu, Y.; Shao, Y.; Ong, D.; Lei, J.; Green, P.C.; Chaudhari, P.; Kumar, V. UAVs for Forestry: Metric-Semantic Mapping and Diameter Estimation with Autonomous Aerial Robots. Mech. Syst. Signal Process. 2024, 208, 111050. [Google Scholar] [CrossRef]
Jiao, J.; Geng, R.; Li, Y.; Xin, R.; Yang, B.; Wu, J.; Wang, L.; Liu, M.; Fan, R.; Kanoulas, D. Real-Time Metric-Semantic Mapping for Autonomous Navigation in Outdoor Environments. IEEE Trans. Autom. Sci. Eng. 2025, 22, 5729–5740. [Google Scholar] [CrossRef]
Bultmann, S.; Quenzel, J.; Behnke, S. Real-time multi-modal semantic fusion on unmanned aerial vehicles with label propagation for cross-domain adaptation. Robot. Auton. Syst. 2023, 159, 104286. [Google Scholar] [CrossRef]
Octomap_MAPPING GitHub Repository. Available online: https://github.com/OctoMap/octomap_mapping/tree/ros2 (accessed on 3 June 2025).
Nowicki, M.R. Spatiotemporal Calibration of Camera and 3D Laser Scanner. IEEE Robot. Autom. Lett. 2020, 5, 6451–6458. [Google Scholar] [CrossRef]
Language Segment-Anything GitHub Repository. Available online: https://github.com/luca-medeiros/lang-segment-anything (accessed on 31 May 2025).
Liu, S.; Zeng, Z.; Ren, T.; Li, F.; Zhang, H.; Yang, J.; Jiang, Q.; Li, C.; Yang, J.; Su, H.; et al. Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; pp. 38–55. [Google Scholar] [CrossRef]
Ravi, N.; Gabeur, V.; Hu, Y.T.; Hu, R.; Ryali, C.; Ma, T.; Khedr, H.; Rädle, R.; Rolland, C.; Gustafson, L.; et al. SAM 2: Segment Anything in Images and Videos. arXiv 2024, arXiv:2408.00714. [Google Scholar] [CrossRef]
Kingma, D.P. Adam: A Method for Stochastic Optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar] [CrossRef]
Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
Milletari, F.; Navab, N.; Ahmadi, S.A. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In Proceedings of the 2016 Fourth International Conference on 3D Vision (3DV), Stanford, CA, USA, 25–28 October 2016; pp. 565–571. [Google Scholar] [CrossRef]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar] [CrossRef]
Everingham, M.; Van Gool, L.; Williams, C.K.; Winn, J.; Zisserman, A. The Pascal Visual Object Classes (VOC) Challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
Powers, D.M. Evaluation: From Precision, Recall and F-measure to ROC, Informedness, Markedness and Correlation. J. Mach. Learn. Technol. 2011, 2, 37–63. [Google Scholar]
Dice, L.R. Measures of the Amount of Ecologic Association Between Species. Ecology 1945, 26, 297–302. [Google Scholar] [CrossRef]

Figure 1. Diagram representing the proposed workflow of the semantic OctoMap generation pipeline.

Figure 2. A remotely controlled vehicle platform used for data collection, equipped with an Ouster OS1-128 LiDAR (mast-mounted), a Lucid Triton TRI064S-CC RGB camera (front-facing), and DGPS antennas.

Figure 3. Comparison of vehicle trajectories on a campus map used to build annotated Octomaps. The blue line represents the test trajectory through the busy parking area near building 3C. The purple line shows the validation trajectory, conducted in a more complex and dynamic pattern near the Mechatronics, Biomechanics, and Nanoengineering Center.

Figure 4. Example RGB and segmentation masks from the training set.

Figure 5. Example RGB and segmentation masks from the test set.

Figure 6. Per-class F1-score comparison of segmentation models on validation and test scenes under default resolution.

Figure 7. Per-class F1-score comparison of segmentation models on validation and test scenes under low resolution.

Figure 8. Per-class F1-score comparison of segmentation models on validation and test scenes using SAM2 masks.

Figure 9. Visual comparison performed on the Validation scene between ground truth annotations and maps generated by the models that achieved the highest accuracy on the Test scene in each experimental setup.

Figure 10. Visual comparison performed on the Test scene between ground truth annotations and maps generated by the models that achieved the highest accuracy on the Test scene in each experimental setup.

Table 1. Unified nine-class color map used in this paper.

ID	Category	RGB
0	Other	(0, 0, 0)
1	Sky	(70, 130, 180)
2	Building	(70, 70, 70)
3	Grass	(152, 251, 152)
4	Road	(128, 64, 128)
5	Tree	(107, 142, 35)
6	Traffic objects	(220, 220, 0)
7	Car	(0, 0, 142)
8	Person	(255, 0, 0)

Table 2. Loss function components.

Loss Component	Formula	Implementation
Weighted Cross-Entropy	$L_{C E} = - \sum_{i = 1}^{N} w_{i} y_{i} log ({\hat{y}}_{i})$	`torch.nn.CrossEntropyLoss()`
Dice Loss	$L_{D i c e} = 1 - \frac{2 \sum_{k = 1}^{N} \| P_{k} \cap G_{k} \|}{\sum_{k = 1}^{N} (\| P_{k} \| + \| G_{k} \|)}$	`smp.losses.DiceLoss()`
Focal Loss	$L_{F o c a l} = - α {(1 - {\hat{y}}_{t})}^{γ} log ({\hat{y}}_{t})$	`smp.losses.FocalLoss()`

Table 3. Evaluation metrics.

Metric	Formula	Implementation
Intersection-over-Union (IoU)	$I o U = \frac{area (P \cap G)}{area (P \cup G)}$	`MulticlassJaccardIndex()`
F1-Score	$F 1 = \frac{2 \times Precision \times Recall}{Precision + Recall}$	`MulticlassF1Score()`
Dice Coefficient	$D i c e = \frac{2 \sum_{k = 1}^{N} \| P_{k} \cap G_{k} \|}{\sum_{k = 1}^{N} (\| P_{k} \| + \| G_{k} \|)}$	`1.0 - DiceLoss()`

Table 4. Quantitative evaluation of semantic segmentation models on test set images. Metrics are grouped into loss functions (Cross Entropy Loss (CE), Focal Loss, and Combined Loss) and segmentation quality indicators (Dice, F1-score, and Intersection over Union (IoU)).

Model	Encoder	Loss Metrics			Accuracy Metrics
Model	Encoder	CE	Focal	Total	Dice	F1-Score	IoU
U-Net [10]	MobileNetV4	0.296	0.037	0.541	0.792	0.660	0.601
	EfficientNet-B2	0.176	0.024	0.361	0.839	0.705	0.656
	ResNet-34	0.175	0.025	0.355	0.845	0.707	0.653
DeepLabv3+ [9]	MobileNetV4	0.234	0.035	0.453	0.815	0.677	0.622
	EfficientNet-B2	0.144	0.024	0.330	0.838	0.702	0.650
	ResNet-34	0.155	0.023	0.342	0.836	0.697	0.646
FPN [11]	MobileNetV4	0.237	0.033	0.457	0.814	0.690	0.635
	EfficientNet-B2	0.145	0.024	0.330	0.838	0.697	0.646
	ResNet-34	0.155	0.025	0.338	0.842	0.704	0.651
SegFormer [8]	MobileNetV4	0.190	0.027	0.408	0.810	0.694	0.640
	EfficientNet-B2	0.140	0.023	0.325	0.838	0.699	0.646
	ResNet-34	0.145	0.022	0.331	0.836	0.699	0.644

Table 5. Quantitative evaluation of semantic segmentation models on downscaled test set images.

Model	Encoder	Loss Metrics			Accuracy Metrics
Model	Encoder	CE	Focal	Total	Dice	F1-Score	IoU
U-Net [10]	ResNet-34	0.232	0.044	0.499	0.777	0.680	0.619
DeepLabv3+ [9]	EfficientNet-B2	0.209	0.036	0.490	0.755	0.664	0.601

Table 6. Quantitative evaluation of semantic segmentation models trained with masks generated by Grounding DINO + SAM2 on test set images.

Model	Encoder	Loss Metrics			Accuracy Metrics
Model	Encoder	CE	Focal	Total	Dice	F1-Score	IoU
U-Net [10]	ResNet-34	0.427	0.103	0.838	0.691	0.625	0.554
DeepLabv3+ [9]	EfficientNet-B2	0.470	0.113	0.893	0.690	0.634	0.561

Table 7. Quantitative evaluation of semantic segmentation models on a test set with random erasing.

Model	Encoder	Loss Metrics			Accuracy Metrics
Model	Encoder	CE	Focal	Total	Dice	F1-Score	IoU
U-Net [10]	ResNet-34	0.517	0.102	0.844	0.776	0.626	0.556
DeepLabv3+ [9]	EfficientNet-B2	0.371	0.093	0.699	0.765	0.607	0.531

Table 8. Inference performance across semantic segmentation models (over 1000 iterations). Latency values (Mean, Std, Min, Max) are reported in milliseconds (ms), memory required for parameters loading in megabytes (MB), and peak GPU utilization in percentage (%). Underlines represent best results for low-resolution models.

Model	Encoder	Resolution	Latency (ms)				GPU
Model	Encoder	Resolution	Mean	Std	Min	Max	Mem	Util (%)
DeepLabv3+ [9]	EfficientNet-B2	960 × 608	11.823	0.103	11.687	13.599	72	55
	MobileNetV4	960 × 608	5.646	0.269	5.355	13.505	24	53
	ResNet-34	960 × 608	10.999	0.405	10.817	19.826	138	60
	EfficientNet-B2	480 × 304	4.381	0.114	4.043	4.919	72	39
FPN [11]	EfficientNet-B2	960 × 608	11.114	0.327	10.934	20.954	72	55
	MobileNetV4	960 × 608	6.974	0.265	6.671	14.702	24	56
	ResNet-34	960 × 608	10.922	0.116	10.751	13.620	138	55
SegFormer [8]	EfficientNet-B2	960 × 608	12.937	0.319	12.720	22.477	72	58
	MobileNetV4	960 × 608	8.589	0.288	8.407	17.343	24	47
	ResNet-34	960 × 608	12.881	0.089	12.717	13.559	138	59
U-Net [10]	EfficientNet-B2	960 × 608	12.220	0.438	11.976	20.344	72	59
	MobileNetV4	960 × 608	8.562	0.273	8.362	10.776	42	54
	ResNet-34	960 × 608	12.637	0.416	12.424	20.613	138	61
	ResNet34	480 × 304	4.343	0.226	4.040	5.776	138	45

Table 9. Comparison of accuracy, weighted F1-score, and percentage of unclassified voxels in the generated maps.

Model	Encoder	Validation Scene					Test Scene
Model	Encoder	ACC	ACC_f	wF1	wF1_f	NC	ACC	ACC_f	wF1	wF1_f	NC
DeepLabv3+ [9]	EfficientNet-B2	0.794	0.931	0.857	0.935	0.147	0.737	0.833	0.798	0.853	0.115
	MobileNetV4	0.778	0.930	0.847	0.936	0.164	0.717	0.820	0.781	0.841	0.125
	ResNet-34	0.793	0.933	0.858	0.938	0.150	0.735	0.834	0.796	0.854	0.119
FPN [11]	EfficientNet-B2	0.786	0.931	0.854	0.937	0.156	0.723	0.825	0.787	0.847	0.124
	MobileNetV4	0.799	0.937	0.863	0.941	0.147	0.732	0.834	0.793	0.852	0.122
	ResNet-34	0.788	0.934	0.856	0.939	0.157	0.720	0.823	0.785	0.845	0.125
SegFormer [8]	EfficientNet-B2	0.788	0.932	0.854	0.937	0.154	0.726	0.830	0.789	0.849	0.125
	MobileNetV4	0.782	0.930	0.851	0.936	0.159	0.729	0.830	0.793	0.851	0.121
	ResNet-34	0.789	0.928	0.853	0.933	0.151	0.723	0.827	0.787	0.847	0.125
U-Net [10]	EfficientNet-B2	0.786	0.930	0.851	0.934	0.155	0.717	0.822	0.782	0.843	0.128
	MobileNetV4	0.793	0.934	0.856	0.937	0.151	0.738	0.843	0.800	0.860	0.124
	ResNet-34	0.783	0.931	0.851	0.936	0.158	0.719	0.822	0.783	0.843	0.125

Table 10. Comparison of accuracy, weighted F1-score, and percentage of unclassified voxels in the maps generated from reduced-resolution images.

Model	Encoder	Validation Scene					Test Scene
Model	Encoder	ACC	ACC_f	wF1	wF1_f	NC	ACC	ACC_f	wF1	wF1_f	NC
DeepLabv3+	EfficientNet-B2	0.797	0.933	0.861	0.938	0.146	0.734	0.837	0.797	0.856	0.123
U-Net	ResNet-34	0.781	0.930	0.850	0.936	0.160	0.733	0.834	0.796	0.853	0.120

Table 11. Comparison of accuracy, weighted F1-score, and percentage of unclassified voxels in the maps generated by segmentation models trained using annotations from GroundingDINO and SAM2.

Model	Encoder	Validation Scene					Test Scene
Model	Encoder	ACC	ACC_f	wF1	wF1_f	NC	ACC	ACC_f	wF1	wF1_f	NC
DeepLabv3+	EfficientNet-B2	0.565	0.800	0.697	0.862	0.294	0.459	0.612	0.592	0.706	0.249
U-Net	ResNet-34	0.523	0.754	0.655	0.827	0.307	0.532	0.683	0.653	0.758	0.220

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Czajka, M.; Krupka, M.; Kubacka, D.; Janiszewski, M.R.; Belter, D. A Comparison of Segmentation Methods for Semantic OctoMap Generation. Appl. Sci. 2025, 15, 7285. https://doi.org/10.3390/app15137285

AMA Style

Czajka M, Krupka M, Kubacka D, Janiszewski MR, Belter D. A Comparison of Segmentation Methods for Semantic OctoMap Generation. Applied Sciences. 2025; 15(13):7285. https://doi.org/10.3390/app15137285

Chicago/Turabian Style

Czajka, Marcin, Maciej Krupka, Daria Kubacka, Michał Remigiusz Janiszewski, and Dominik Belter. 2025. "A Comparison of Segmentation Methods for Semantic OctoMap Generation" Applied Sciences 15, no. 13: 7285. https://doi.org/10.3390/app15137285

APA Style

Czajka, M., Krupka, M., Kubacka, D., Janiszewski, M. R., & Belter, D. (2025). A Comparison of Segmentation Methods for Semantic OctoMap Generation. Applied Sciences, 15(13), 7285. https://doi.org/10.3390/app15137285

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Comparison of Segmentation Methods for Semantic OctoMap Generation

Abstract

1. Introduction

2. Related Work

2.1. Semantic Segmentation

2.2. Environment Mapping

3. Methods

3.1. Semantic OctoMap Generation

3.2. Dataset

Data Annotation

3.3. Segmentation Models Training

3.3.1. Training Configuration

3.3.2. Hyperparameters

3.3.3. Data Preparation and Augmentation

3.3.4. Loss Function

3.3.5. Evaluation Metrics

4. Results

4.1. Segmentation Mask Evaluation

4.2. Computational Requirements Comparison

4.3. Semantic OctoMaps Validation

5. Discussion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI