1. Introduction
Rapid advancements in 3D scanning technologies, including LiDAR and RGB-D depth cameras, have significantly increased interest in the intelligent processing of point cloud data. Point cloud semantic segmentation, a fundamental task in 3D scene understanding and analysis, aims to assign semantic labels to each point in 3D space accurately. This technology is crucial for applications such as autonomous driving and intelligent robotics and has played an important role in their rapid development [
1]. Unlike 2D images, 3D point clouds are unordered, sparse, and unstructured. These properties reduce the efficiency of traditional convolutional neural networks (CNNs) in processing such data [
2].
Traditional point cloud semantic segmentation relies on manually extracting neighborhood geometric features, followed by machine learning classification. Recently, deep learning has made significant strides in image segmentation and natural language processing, which have been adapted to point cloud processing. This adaptation has led to notable improvements in semantic segmentation for small-scale point clouds. However, large-scale point clouds often consist of millions or even billions of points, which significantly increases scene complexity. This complexity has slowed the development of segmentation technologies for large-scale scenes [
3,
4].
Deep learning-based point cloud semantic segmentation methods are primarily categorized into projection-based [
5,
6,
7,
8], voxel-based [
9,
10,
11,
12], point-based [
13,
14,
15,
16,
17,
18,
19,
20,
21,
22,
23], and related fusion methods [
24,
25,
26,
27]. Projection-based methods segment point clouds by converting them into 2D images, utilizing advanced 2D image segmentation algorithms. Voxel-based methods transform point clouds into voxel grids, similar to image pixels, and apply convolutional networks for feature extraction. However, projecting point clouds into 2D images inevitably leads to the loss of 3D spatial information. As a result, the intrinsic three-dimensional characteristics cannot be fully exploited. Similarly, voxelization causes detail loss due to quantization and increases computational complexity, reducing algorithm performance. In contrast, point-based methods directly use raw point clouds, preserving their fine-grained geometric structures and spatial relationships. Despite this advantage, they suffer from low processing efficiency, high computational costs, and a limited ability to extract global features.
PointNet [
13], the first method for direct point-cloud processing, applies shared multilayer perceptrons to each point and aggregates global context with symmetric pooling functions. Because it processes points independently, PointNet cannot capture the local structure needed for fine-grained detail. PointNet++ [
14] addressed this by introducing multi-level feature extraction and adaptive density sampling, which improved learning efficiency and robustness, but it still did not model relationships among neighboring points. RandLA-Net [
15] introduced a simple, efficient design based on random sampling and local feature aggregation that markedly enhanced scalability for large-scale point clouds. Zhao et al. observed that self-attention mechanisms are highly compatible with 3D vision tasks and were the first to introduce self-attention into point-cloud processing by proposing the Point Transformer [
18] architecture. The framework leverages self-attention to capture local and global context, employs vector attention (VA) within local kNN neighborhoods for information exchange, and uses learnable relative positional encodings to model spatial relationships. Consequently, it achieved strong performance in point-cloud semantic segmentation. However, the Point Transformer has several limitations. First, the parameter count of vector attention increases rapidly with channel dimensionality, making the model prone to overfitting. Second, kNN search and relative positional encoding are computationally expensive, which reduces efficiency. Finally, the overall framework is difficult to scale. To address these issues, Wu et al. proposed Grouped Vector Attention (GVA) in the improved Point Transformer V2 [
19] network, replacing VA with group-shared weights to reduce parameters and mitigate overfitting. Nevertheless, Point Transformer V2 still depended on kNN search and complex positional encoding, which constrained efficiency and receptive field expansion. Point Transformer V3 [
20] further improved efficiency by using serialized neighborhoods based on space-filling curves instead of kNN queries, simplifying attention interactions, and replacing relative positional encoding with enhanced conditional positional encoding (xCPE), thereby accelerating inference and reducing memory consumption. Fan et al. proposed SCF-Net [
21], which learns spatial contextual features to capture correlations between points and thereby improves segmentation performance. Liu et al. introduced DG-Net [
2], which fuses geometric structure with semantic features and models long-range dependencies, enabling better segmentation of objects with complex geometries. NeiEA-Net [
28] optimizes local neighborhoods in 3D Euclidean space through two modules: Neighborhood Feature Enhancement (NFE) and Neighborhood Feature Aggregation (NFA), effectively learning local details in point clouds. LACV-Net [
16] addresses local perception ambiguity and insufficient global features in large-scale point cloud semantic segmentation through three components: the Local Adaptive Feature Augmentation (LAFA) module, aggregation loss function, and Comprehensive Vector of Locally Aggregated Descriptors (C-VLAD) module. However, its segmentation accuracy is still limited in scenes with complex scale variations.
Recent methods have advanced modeling of inter-point dependencies and long-range context, substantially improving large-scale point cloud semantic segmentation. However, explicit modeling of geometric perturbations—such as point position offsets and variations in surface normals—remains limited in many existing approaches. Consequently, accurately characterizing complex geometric boundaries (for example, wall corners and object edges) and extracting discriminative features for small structures (e.g., thin pole, fence) are still challenging. In addition, downsampling often removes fine details, making it difficult to reconcile global context with preservation of local structure. Existing attention-based point-cloud networks also have notable limitations. First, attention mechanisms are often applied along a single dimension and thus optimize only one aspect of feature representation, which is inadequate for the multi-dimensional feature expression required in complex scenes. For example, the attention mechanism in RandLA-Net [
15] focuses solely on spatial local feature aggregation and cannot suppress redundant channel-wise information. Although NeiEA-Net [
28] enhances neighborhood spatial features, it does not perform adaptive weighting in the channel dimension. Second, attention deployment is imbalanced: most networks concentrate attention modules in the encoder for local feature extraction, while the decoder relies mainly on simple skip connections for feature fusion and lacks effective attention guidance. Consequently, important features compressed during encoding are easily corrupted by redundant information during decoding, which limits segmentation accuracy. Finally, large-scale point-cloud datasets often exhibit severe class imbalance, and existing models generally have insufficient capacity to learn categories that contain few sample points. To address these challenges, this paper further optimized the LACV-Net [
16] architecture. During the encoding stage, offset information was extracted separately from spatial coordinates and feature channels, and contextual information was fused via max pooling and weighted summation. This symmetric interaction strategy alleviated semantic ambiguity among neighboring points and strengthened local contextual representation. Meanwhile, a Spatial Attention Mechanism (SAM) was introduced to generate attention maps using 1 × 1 convolution and transposed convolution, followed by Sigmoid activation to weight local features and emphasize key spatial regions. In the decoding stage, a Channel Attention Mechanism (CAM) was incorporated; it learned channel-wise weights through a shared MLP applied to globally averaged and max-pooled features, substantially enhancing feature expressiveness. The symmetric deployment of spatial and channel attention mechanisms effectively overcame the dimensional and structural limitations of existing attention-based networks. Moreover, Lovász-Softmax Loss [
29] is introduced as an auxiliary optimization target alongside the original loss. This complementary mechanism enhances the model’s ability to segment small targets, class boundaries, and large object regions. It overcomes the limitations of using a single loss for region-level optimization, thereby improving segmentation accuracy. Key innovations include the following:
- (a)
A Spatial-Feature Dynamic Aggregation (SFDA) module is constructed. Offset information is extracted through symmetric interaction in two dimensions (spatial position and feature channel), enhancing the local context representation of point clouds.
- (b)
An Adaptive Attention Fusion Mechanism (AAFM) module is designed. Spatial Attention Mechanism (SAM) and Channel Attention Mechanism (CAM) are deployed symmetrically in the encoder and decoder, respectively. This builds symmetric saliency perception across spatial and channel dimensions, improving the model’s feature expression capability.
- (c)
An IoU-oriented loss function collaborative optimization framework is established. Lovász-Softmax Loss is introduced as an auxiliary optimization objective. Through a complementary mechanism, the model’s ability to segment small targets, class boundaries, and complete regions of large objects is enhanced, overcoming the limitations of a single loss in region-level optimization.
2. Methodology
The LSSCC-Net point cloud semantic segmentation network, introduced in this paper, builds on the LACV-Net architecture [
16]. LACV-Net employs an encoder–decoder framework, utilizing the Local Adaptive Feature Augmentation (LAFA) module to reduce local perception ambiguity. It integrates global features with the Comprehensive Vector of Locally Aggregated Descriptors (C-VLAD) module and enhances model convergence with an aggregation loss function, excelling in large-scale point cloud segmentation tasks. To address the limitations of LACV-Net in local structure modeling, feature selection, and loss optimization, three key improvements are introduced while preserving the original encoder–decoder architecture. First, a Spatial–Feature Dynamic Aggregation (SFDA) module is introduced during the encoding stage to enhance robustness to spatial structure variations through symmetric interaction modeling. Second, this study incorporated the Adaptive Attention Fusion Mechanism (AAFM) module, adding the Spatial Attention Mechanism (SAM) in the encoding stage to emphasize salient regions, and the Channel Attention Mechanism (CAM) in the decoding stage to enhance feature expression comprehensively. Third, this study employed Lovász-Softmax Loss as an auxiliary loss optimization objective to improve segmentation capabilities for small objects, category boundaries, and large object regions. This section details the LSSCC-Net network and its enhanced modules. The improved LACV-Net architecture is termed the LSSCC-Net, as depicted in
Figure 1.
LSSCC-Net comprises an encoder, a decoder, a classifier, and a C-VLAD (Comprehensive Vector of Locally Aggregated Descriptors) module that connects the encoder and decoder. In the encoding phase, five layers are implemented. Each layer processes input features using the Local Adaptive Feature Augmentation (LAFA) and SFDA modules. After merging outputs from these modules, the Spatial Attention Mechanism (SAM) applies feature weighting to emphasize critical spatial features. The encoder progressively reduces the number of points through downsampling (N → N/4 → N/16 → N/64 → N/256) while increasing the channel size (16 → 64 → 128 → 256 → 512). These downsampled features serve as input for the next encoder layer. Positioned between the encoder and decoder, the C-VLAD module aggregates local features from all preceding encoding layers. The decoder restores the point cloud count through upsampling. Each decoding layer incorporates the Channel Attention Mechanism (CAM) and a Multi-Layer Perceptron (MLP), using skip connections to transfer features between the encoder and decoder, thus preserving local detail. Finally, the classifier employs three fully connected layers and a Dropout layer to predict semantic labels, completing the semantic segmentation process.
2.1. Spatial-Feature Dynamic Aggregation Module
The original model aggregates neighborhood information primarily using local relative coordinates and simple attention pooling. This design limits its ability to handle geometric perturbations and feature variations between points, thereby hindering accurate characterization of complex boundaries and fine-grained structures. To overcome these limitations, the network presented here incorporates a SFDA module into the encoding stage, enhancing local structure perception by modeling symmetric interactions between spatial positions and feature channels (
Figure 2). The central idea of the SFDA module is to extract offset information through symmetric interactions along two dimensions—spatial location and feature channel—to enrich local contextual representations of point clouds and reduce semantic ambiguity among neighboring points (for example, classification confusion arising from densely distributed points near boundaries and the difficulty of identifying local geometric structures of small objects).
This module initially establishes spatial and feature offsets using local coordinate differences. It then encodes and fuses these offsets to create enhanced features capable of perturbation perception. Specifically, first, the K-nearest neighbor (KNN) algorithm based on Euclidean distance is used to find the neighboring points
of the center point
, and the corresponding feature information is denoted as
and
. The absolute position of the center point and the relative positions of its neighboring points are concatenated into local contextual information
. Correspondingly, the local spatial contextual information is denoted as
, and the local feature contextual information is denoted as
, which is represented as follows:
Relying solely on a fixed structure
and neighborhood constraints in the three-dimensional space of point cloud data often results in poor generalization and feature redundancy. To address this, the module learns offsets symmetrically from both spatial and feature dimensions to enhance local contextual point information. This process involves two main steps. One is spatial offset learning: based on the rich feature information of
, the spatial offset is learned through a Multi-Layer Perceptron (MLP). Adjusting the coordinates of neighboring points can obtain enhanced local spatial contextual information, which can be expressed as
where
represents the enhanced local spatial contextual information, and
is the coordinate of the neighboring point after offset.
refers to a Multi-Layer Perceptron (MLP). The second component involves feature offset learning: utilizing the enhanced local spatial contextual information, the feature offset is further refined. By adjusting the features of neighboring points, enhanced local feature contextual information can be obtained, expressed as
The enhanced local feature contextual information
is derived from several components:
, representing the feature information at the center point, and
, representing the feature information at a neighboring point. After applying an offset, the neighboring point’s feature becomes
. Subsequently,
and
are integrated using a Multi-Layer Perceptron (MLP) to produce the local spatial-feature contextual information
.
The integration of the SFDA module into each encoding layer allows for the step-by-step processing of point cloud data at varying resolutions, effectively capturing multi-scale information. Each SFDA module receives spatial coordinates and feature information as inputs, producing enriched feature representations through enhanced symmetric information interaction and aggregation in both spatial and feature dimensions.
2.2. Adaptive Attention Fusion Mechanism Module
Existing networks often struggle to automatically identify key spatial regions or informative feature dimensions in point cloud data, which limits their ability to capture salient information. An AAFM module is proposed in this study to overcome this limitation. The module symmetrically integrates a Spatial Attention Mechanism (SAM) in the encoder and a Channel Attention Mechanism (CAM) in the decoder, forming a saliency perception module for both spatial and channel dimensions. The encoder’s primary role is feature extraction. Spatial Attention Mechanism (SAM) enhances this by generating an attention map using 1 × 1 convolution and transposed convolution. It then applies the Sigmoid function to activate and weight local features, enabling the model to focus on crucial spatial regions and extract more representative features. Conversely, the decoder’s main task is feature recovery. Channel Attention Mechanism (CAM) contributes by learning channel weights through a shared Multi-Layer Perceptron (MLP) after performing global average pooling and max pooling. This process helps adjust the weights of different channels, thereby recovering more accurate semantic information. By applying the Spatial Attention Mechanism (SAM) and the Channel Attention Mechanism (CAM) to the encoder and decoder, respectively, the module leverages complementary strengths through symmetric collaboration. This approach optimizes feature extraction and recovery, enhancing the model’s performance and generalization ability.
2.2.1. Spatial Attention Mechanism
During the collection of 3D point cloud data, various sources of noise and irrelevant information often arise, degrading point cloud semantic segmentation performance. To address this issue, our proposed network incorporates the Spatial Attention Mechanism (SAM) during the encoding stage. This mechanism enhances feature extraction by concentrating on key regions and filtering out redundant information, thereby aiding the model in capturing essential spatial structure features. As shown in
Figure 3, the Spatial Attention Mechanism (SAM) performs
transposed convolution on the feature
processed by the LAFA module and the SFDA module to generate an intermediate feature map
with the same number of channels as the input feature:
The
activation function is applied to the intermediate feature map
to map the values of
to the range [0, 1], obtaining the spatial attention map
:
Subsequently, the spatial attention map
is element-wise multiplied by the input feature
to obtain the feature
weighted by the Spatial Attention Mechanism (SAM).
2.2.2. Channel Attention Mechanism
Processing point cloud data in large-scale scenes often suffers from inter-class ambiguity. Objects with similar shapes and structures can be challenging to label correctly with semantic tags. To address this, the paper incorporates the Channel Attention Mechanism (CAM) during the decoding stage. Channel Attention Mechanism (CAM) enhances the network’s feature expression capability by adjusting channel weights and learning inter-channel correlations to highlight crucial channel information. As shown in
Figure 4, the Channel Attention Mechanism (CAM) performs global average pooling and global max pooling operations on the input feature
respectively, compressing the feature information of each channel into a scalar to obtain the average-pooled feature vector
and the max-pooled feature vector
. Two fully connected operations are performed on
and
respectively to obtain intermediate feature vectors
and
:
where
and
are the weight matrices of the fully connected layers, and
is the activation function. Then,
and
are added, and the
activation function is applied to obtain the channel attention map
:
Subsequently, the channel attention map
is element-wise multiplied by the input feature
to obtain the feature
enhanced by the Channel Attention Mechanism (CAM).
2.3. Constructing an IoU-Oriented Loss Function Collaborative Optimization Framework
An IoU-oriented optimization strategy is adopted, incorporating Lovász-Softmax Loss as an auxiliary component to form a collaborative optimization framework. This framework combines aggregation loss with the Lovász-Softmax Loss . By employing a complementary mechanism, the model enhanced its segmentation abilities for small objects, category boundaries, and large object regions. This approach overcame the limitations associated with using a single loss in region-level optimization.
The Lovász-Softmax Loss aims to reformulate the optimization of the Jaccard index, also known as Intersection over Union (IoU), into a continuously differentiable problem. IoU is a crucial metric for evaluating point cloud semantic segmentation, effectively reflecting segmentation quality. However, its non-convex and non-differentiable nature complicates direct optimization within neural networks. By applying the Lovász extension to sorted errors, the Lovász-Softmax Loss transforms IoU loss into a convex and differentiable surrogate loss function, making it amenable to optimization through gradient descent.
The Lovász-Softmax Loss utilizes the Lovász extension to prioritize misclassified samples based on their potential to enhance the IoU. In multi-class semantic segmentation, it optimizes the IoU loss for each category c and combines these losses to determine the overall loss. The Lovász-Softmax Loss is represented as
where
denotes the point-wise error for the
i-th class among the
C classes, and
denotes the Lovász extension of the Jaccard index. The final total loss function can be expressed as
3. Experiments and Analysis
To comprehensively evaluate the performance of the proposed LSSCC-Net architecture for large-scale point cloud semantic segmentation, experiments were conducted on two widely used benchmark datasets: the outdoor Toronto3D [
30] and the indoor S3DIS [
31]. Ablation experiments were also conducted on modified network modules to validate the contributions of the proposed components.
All experiments were conducted on a hardware platform featuring an Intel Core i7-14700KF CPU and an NVIDIA GeForce RTX 4070 Ti SUPER GPU (16 GB VRAM), operating on Ubuntu 20.04. The model was constructed and trained using TensorFlow 2.4 within a Python 3.8 environment. Point cloud visualization primarily utilized CloudCompare (version 2.13.2) software. For the Toronto3D and S3DIS datasets, grid sampling resolutions were 0.06 m and 0.04 m, respectively. Hardware limitations necessitated different hyperparameter settings from the baseline network: the initial number of input points for training was 30,720, with a batch size of 4, over 100 epochs. The Adam optimizer was used with an initial learning rate of 0.01, and the neighborhood search range K was set to 16.
To quantitatively evaluate semantic segmentation performance, three metrics were adopted: overall accuracy (OA), class-wise Intersection-over-Union (IoU), and mean Intersection-over-Union (mIoU). The formulas for these metrics are outlined below:
In this context, represents true positives, indicating positive samples correctly predicted by the model. stands for true negatives, referring to negative samples accurately identified as negative. denotes false positives, where negative samples are incorrectly predicted as positive. signifies false negatives, where positive samples are mistakenly identified as negative. indicates the number of categories in the dataset.
3.1. Analysis of Point Cloud Semantic Segmentation Results
3.1.1. Semantic Segmentation for Outdoor Scenes
The Toronto3D outdoor laser point cloud dataset was gathered on Toronto roads using a mobile laser scanning (MLS) system mounted on a vehicle. It features various urban street scenes, covering an area of about 1 km, divided into four regions, each approximately 250 m long. The dataset comprises over 78.3 million points, each with XYZ coordinates, RGB color information, and manually annotated labels for 8 semantic categories, including building, road, tree, and car. In this study, we adhered to the official training and test set division: Region 2 was the test set, while the other three regions served as the training set.
As shown in
Table 1, LSSCC-Net achieves the best overall performance compared with other methods in terms of Overall Accuracy (OA, 97.7%) and mean Intersection over Union (mIoU, 83.6%). Specifically, it attains the optimal IoU values for three categories, namely “road”, “road mark”, and “fence”, which are 0.4%, 3.6%, and 3.9% higher than those of the second-ranked algorithm, respectively. Under identical experimental settings, LSSCC-Net improves OA by 0.4% and mIoU by 2.1% compared with the baseline LACV-Net. Furthermore, for six categories (road, road mark, natural, building, pole, and fence), the IoU values of the improved LSSCC-Net are 0.4%, 3.6%, 0.7%, 0.6%, 3.6%, and 8.1% higher than those of the baseline LACV-Net, respectively. These results demonstrate the effectiveness of the proposed improvements.
To effectively compare the segmentation performance of various methods on the Toronto3D dataset,
Figure 5 illustrates the overall scene segmentation visualizations for RandLA-Net, LACV-Net, and LSSCC-Net. Additionally,
Figure 6 displays the local semantic segmentation results for the same methods on the Toronto3D dataset.
Figure 5 demonstrates that LSSCC-Net (ours) more accurately characterizes complex boundaries and detailed structures, significantly enhancing segmentation capabilities for small objects, class boundaries, and large object regions. This improvement can be mainly attributed to the Spatial–Feature Dynamic Aggregation (SFDA) module in LSSCC-Net. By jointly modeling local spatial features, it strengthens local structural perception, improving segmentation of dense, confusing points near boundaries and small objects with complex geometries. Additionally, the model optimizes feature extraction and restoration through an Adaptive Attention Fusion Mechanism (AAFM), enhancing performance and generalization. A–D denote representative local regions, whose enlarged visualizations are shown in
Figure 6.
Figure 6 illustrates that LSSCC-Net outperforms LACV-Net and RandLA-Net in distinguishing small objects and accurately segmenting boundaries in the three local detailed regions. The red boxes highlight areas where LSSCC-Net provides more precise segmentation and clearer boundary delineation than the baseline LACV-Net. The experimental results show that LSSCC-Net offers detailed segmentation of boundaries between road and road mark, as well as accurate segmentation of junctions between util.line and pole. Additionally, it excels in segmenting small objects that are often confused with other categories, such as pole bases and fence.
3.1.2. Semantic Segmentation for Indoor Scenes
The S3DIS (Stanford Large-Scale 3D Indoor Spaces) dataset, released by Stanford University, is a widely used resource for indoor scene understanding. It comprises six areas, each containing 11 room types, totaling 272 rooms. Scenes range from 0.5 to 2.5 million points, with an overall count of about 290 million points. Each point includes XYZ coordinates, RGB color data, and manually annotated labels, covering 13 semantic categories like ceiling, floor, wall, beam, and column.
We employed two evaluation methods to assess the proposed network: using Area 5 as an independent test set and conducting 6-fold cross-validation. Area 5 was chosen due to its unique characteristics, such as distinctive room layouts, object distributions, and point cloud densities, which make the segmentation results more representative. In the 6-fold cross-validation, each of the six large indoor areas in the dataset served as the test set once, with the remaining five areas used for training. This process was repeated six times, and the results were averaged to provide a comprehensive evaluation metric.
Table 2 shows that, relative to other recent top-performing methods, LSSCC-Net attains slightly lower overall accuracy (OA) than LGGCM and DG-Net, and its mean IoU (mIoU) is marginally below DG-Net. Nevertheless, the proposed network achieves the highest IoU in three categories—column, window, and board—exceeding the second-best method by 2.8%, 0.2%, and 1.5%, respectively, and it ranks second in IoU for an additional four categories. Under identical experimental settings and hyperparameters, the improved LSSCC-Net matches the baseline LACV-Net in OA (88.3%) while improving mIoU to 65.2%, a gain of 0.9%. Compared with LACV-Net, LSSCC-Net increases IoU in eight categories: ceiling (+0.1%), floor (+0.7%), wall (+0.5%), column (+8.9%), window (+0.2%), door (+5.2%), table (+0.3%), and board (+5.4%). These gains validate the effectiveness of the proposed improvements.
Notably, compared with the Toronto3D dataset, the overall performance improvement of LSSCC-Net on the S3DIS dataset was relatively limited. This difference may be attributed to intrinsic variations in point-cloud scene characteristics. A category-level analysis showed that the proposed method produced more pronounced IoU gains for slender or geometrically complex classes, such as columns, windows, and boards, indicating an advantage in modeling complex geometric structures. Furthermore, the SFDA module proved more effective at characterizing objects with clear boundaries or elongated shapes, while its contribution to large planar surfaces (e.g., floor and ceiling) was relatively limited. Overall, experimental results across both datasets suggested that LSSCC-Net had greater benefits in scenes with complex geometry, imprecise boundaries, and a high proportion of small-scale objects. This trait made it particularly effective in outdoor urban environments such as Toronto3D, whereas performance gains were comparatively modest in indoor scenes dominated by regular planar structures.
Table 3 presents the quantitative evaluation results from a 6-fold cross-validation on the S3DIS dataset. Using the experimental setup and hyperparameters specified in this study, the enhanced LSSCC-Net method showed increases of 0.4% in OA (reaching 88.5%) and 0.8% in mIoU (achieving 70.9%) compared to the baseline LACV-Net. These results suggest that the improved method exhibits robust performance and adaptability across various point cloud scenarios.
Figure 7 illustrates the segmentation outcomes of RandLA-Net, LACV-Net, and LSSCC-Net on Area 5 of the S3DIS dataset. Red boxes highlight areas where LSSCC-Net surpasses LACV-Net in segmentation accuracy. The results show that LSSCC-Net excels in differentiating walls from columns, accurately segments similar categories, and enhances boundary segmentation between distinct objects. The performance advantage primarily stems from LSSCC-Net’s inclusion of an additional SFDA module during the encoding phase. This module extracts offset information from both spatial positions and feature channels, facilitating interaction between these dimensions. Consequently, it effectively resolves semantic ambiguity in neighboring points.
3.2. Ablation Experiments
To analyze the contributions of individual components in LSSCC-Net, an ablation study was conducted focusing on three key components: the SFDA module, the AAFM, and the Lovász-Softmax Loss. These experiments were carried out using the Toronto3D dataset, adhering to the study’s specific experimental environment and hyperparameters. In line with previous experiments, Region 2 served as the test set, while the other three regions were used for training.
Table 4 displays the results of the ablation experiments, with bold numbers indicating the best performance for each metric. The table uses the following notations: A: This denotes the removal of the SFDA module. Initially, features in each encoding layer were processed separately by the Local Adaptive Feature Augmentation (LAFA) module and the SFDA module, then combined. Without the SFDA module, features are processed solely by the LAFA module and directly input into the Spatial Attention Module (SAM) for further processing. B: This indicates the removal of the AAFM. It involves eliminating the Spatial Attention Module (SAM) in the encoding layers and the Channel Attention Module (CAM) in the decoding layers, thereby halting the optimization of feature extraction and restoration. C: This refers to the removal of the Lovász-Softmax Loss, retaining only the aggregation loss function of the baseline network.
Table 4 reveals several key observations. When the SFDA module is removed from LSSCC-Net, Model A experiences a 0.2% decrease in mean Intersection over Union (mIoU). Specifically, the IoU for road mark and fence drops by 1.4% and 3.6%, respectively, while the IoU for building and pole increases by 1.6% and 1.7%. This result suggests that the SFDA module enhances local contextual representation, particularly for objects with ambiguous boundaries and complex local geometries. In contrast, removing the AAFM from LSSCC-Net results in Model B showing a 0.4% decline in mIoU. The IoU for road mark decreases by 0.9%, and for fence, it drops by 4.1%. These findings underscore the importance of the AAFM in improving segmentation accuracy for these features. The AAFM enhances the model’s ability to identify key regions by integrating the Spatial Attention Module (SAM) in the encoder and the Channel Attention Module (CAM) in the decoder. These modules jointly optimize feature extraction and restoration. In comparison to LSSCC-Net, Model C, which excludes the Lovász-Softmax Loss and retains only the baseline network’s aggregation loss function, shows a 0.9% decrease in mIoU. Specifically, the IoU values for road mark, util.line, pole, and fence decrease by 2.8%, 1.2%, 1.7%, and 2.3%, respectively. This indicates that incorporating Lovász-Softmax Loss as an auxiliary loss function complements the original aggregation loss function, thereby enhancing the model’s segmentation capability for small targets and class boundaries.
3.3. Efficiency Analysis
We evaluated the computational efficiency of the proposed model on the Toronto3D dataset using per-batch training time and total inference time as efficiency metrics, while OA and mIoU served as performance indicators. All experiments used a batch size of 4. The results are summarized in
Table 5.
Table 5 shows that LSSCC-Net achieves the best segmentation performance at the cost of reduced efficiency. Its per-batch training time was 496.82 ms and total inference time was 69.66 s, both longer than those of the baseline LACV-Net (322.37 ms and 58.67 s, respectively). The efficiency gap mainly results from the structural enhancements in LSSCC-Net: dual-dimensional offset learning in the Spatial–Feature Dynamic Aggregation (SFDA) module and the symmetric deployment of attention mechanisms in the Adaptive Attention Fusion Mechanism (AAFM) increase computational complexity, thereby raising both training and inference costs. Notably, although LSSCC-Net incurs higher per-batch computation time than LACV-Net, it converges in fewer training epochs. On the Toronto3D dataset, LSSCC-Net reached its optimal mIoU within 62 epochs, a 37.4% reduction in training epochs compared with LACV-Net (99 epochs). Similarly, on the S3DIS dataset, LSSCC-Net converged in 58 epochs, a 30.9% reduction relative to LACV-Net (84 epochs).
5. Conclusions
To tackle semantic segmentation in large-scale point cloud scenes, this study enhanced the LACV-Net baseline, introducing the LSSCC-Net architecture. Key contributions and implementation details include the following: First, the Spatial-Feature Dynamic Aggregation (SFDA) module was added during the encoding phase. This module extracts offset information through symmetric interaction between spatial positions and feature channels, enhancing the local contextual representation of point clouds. Consequently, the proposed approach improves robustness to variations in spatial structures. Subsequently, an Adaptive Attention Fusion Mechanism (AAFM) was developed to improve model performance. During the encoding phase, a Spatial Attention Module (SAM) was integrated to emphasize and enhance significant regions. In the decoding phase, a Channel Attention Module (CAM) was incorporated to improve feature expression comprehensively. This design combines the strengths of both attention mechanisms to optimize feature extraction and restoration, thereby enhancing overall performance and generalization. Finally, Lovász-Softmax Loss was introduced as an auxiliary optimization objective, creating a framework focused on IoU-oriented loss function collaboration. This framework enhances the model’s ability to segment small objects, delineate class boundaries, and accurately capture complete regions of large objects.
This paper conducted experiments using the outdoor point cloud dataset Toronto3D and the indoor dataset S3DIS to assess the proposed model. For Toronto3D, using Region 2 as the test set, LSSCC-Net achieved an overall accuracy (OA) of 97.7% and a mean Intersection-over-Union (mIoU) of 83.6%. For S3DIS, with Area 5 as the test set, it achieved an OA of 88.3% and an mIoU of 65.2%. Quantitative comparisons with recent state-of-the-art methods indicate that LSSCC-Net is effective in segmenting small-scale objects with complex geometric structures and produces more accurate boundary delineation across different object categories, leading to improved overall segmentation performance.