1. Introduction
Airborne LiDAR is widely used for large-scale 3D mapping and urban modelling. LiDAR systems generate high-density point cloud data, which represents the geometric structure of the ground and objects on it with high accuracy. LiDAR data has been extensively utilised in urban areas across a wide range of applications, including the classification of land cover and use, the generation of digital elevation models (DEM) and digital surface models (DSM), the extraction of buildings and roads, the creation of 3D city models, forestry analyses, and disaster management [
1,
2,
3,
4,
5,
6].
In the context of LiDAR data analysis, semantic segmentation entails the automated allocation of each point in the point cloud to its designated class (e.g., road, building, vehicle, pedestrian, tree). This facilitates comprehension of the real-world 3D scene, both geometrically and semantically. The accurate classification of these classes is of particular importance in applications such as urban modelling, infrastructure monitoring, and autonomous driving [
7,
8,
9]. Nevertheless, due to the irregular structure of LiDAR point cloud data, varying point densities, noise, and complex spatial relationships between objects, the semantic segmentation of LiDAR point clouds remains challenging due to irregular structure, varying density, and complex spatial relationships [
8,
10,
11].
Traditional approaches to point cloud classification rely on manually designed geometric features and classical machine learning algorithms. These methods frequently require extensive feature engineering and may struggle to capture complex spatial patterns in large-scale datasets [
8,
12,
13].
In recent years, significant advancements have been made in the field of point cloud analysis using deep learning methods. There are three primary groups of deep learning-based approaches that are utilised to address this issue: point-based, voxel-based, and projection-based (the conversion of a 3D point into a 2D range image) [
7,
14,
15]. PointNet is a deep learning architecture that accepts three-dimensional point clouds as input without undergoing any transformation and operates in a permutation-invariant manner, i.e., it is unaffected by the order of the points. In the initial stage, this network processes each point independently and combines the extracted features into a single global feature using a symmetric function called max pooling to perform tasks such as object classification and segmentation [
16]. PointNet++ is a hierarchical neural network architecture developed to process 3D point clouds using deep learning methods. To address the limitations of PointNet in capturing local structures, PointNet++ employs a strategy that involves the subdivision of the input data into smaller sub-regions and the extraction of features derived from the physical or mathematical distances between points within the point cloud [
17]. These architectures represent some of the fundamental methods that are currently being utilised in a significant number of point cloud semantic segmentation studies. Voxel-based approaches (e.g., VoxNet, 3D-ShapeNet) convert the point cloud into a 3D grid and apply 3D convolutions; while these approaches effectively capture global context and volumetric structure, memory and computational costs increase cubically due to empty voxels [
18,
19,
20,
21]. Conversely, projection-based methods employ 3D point projection onto 2D range images, leveraging pre-trained 2D CNN architectures. 3D-MiniNet and numerous LiDAR networks achieve high speeds in real-time autonomous driving through this approach. However, it should be noted that some spatial accuracy and depth detail may be lost during projection [
14,
22,
23,
24].
The differentiation between edifices, terrain, and flora in data derived from aerial LiDAR point clouds is challenging due to both geometric similarities and limitations in data quality. The ground and building roofs frequently manifest as wide, flat, and gently sloping surfaces, resulting in building–ground confusion in methodologies that rely exclusively on rudimentary geometric characteristics such as elevation or planarity [
25,
26]. In areas of dense vegetation, the number of ground points becomes sparse or even disappears entirely. LiDAR pulses are obstructed by the canopy, resulting in vegetative growth that creates a quasi-planar appearance, resembling the qualities of a flat surface [
27,
28]. In particular, low shrubs, grassy areas, and vegetation in close proximity to the ground approach the ground in terms of height and roughness, further complicating classification [
28,
29].
In complex urban areas, where modern development and natural vegetation coexist, the overlap of all three classes within the same area serves to further blur class boundaries. The problem is exacerbated by sparse point clouds, irregular sampling, and objects at different scales [
29,
30,
31]. Consequently, recent studies have concentrated on the simultaneous separation of buildings, ground, and vegetation using multi-feature spaces (height, shape, texture, density), adaptive neighborhood selection, and advanced machine learning/deep learning models (Random Forest, XGBoost, and deep neural networks). Nevertheless, errors and ambiguities remain significant, particularly in areas with dense vegetation, shading, and complex roof–tree interactions [
12,
29,
30,
31,
32,
33].
3D LiDAR sensors provide a detailed point cloud representation of the environment across a wide range of fields, from autonomous driving to smart city infrastructure. The direct processing of this data in 3D space is challenging due to variations in density and scale, noise, irregular sampling, and large data volumes. Consequently, in recent years, 3D deep learning methods operating on point clouds—particularly semantic segmentation models—have played a key role in the automatic understanding of complex scenes. Nevertheless, the efficient processing of large-scale open-area scenes and data-related issues such as class imbalance continue to limit the performance of existing methods [
34,
35].
The objective of this study is to propose an approach to the semantic segmentation of aerial LiDAR point clouds, with a focus on multi-class semantic segmentation, particularly the building class. In accordance with the study, a four-class segmentation problem (Unclassified, Vegetation, Ground, and Building) was defined using the Oregon LiDAR dataset obtained via the OpenTopography platform.
The main contributions of this study can be summarized as follows:
Different deep learning architectures, including baseline architectures (PointNet, PointNet++ MSG, and VoxelNet Lite) as well as recent state-of-the-art methods such as Point Transformer, KPConv, and RandLA-Net, were comparatively evaluated on the same dataset for point cloud segmentation.
A transformer-based feature fusion approach named PointNet2 MSG Transformer, based on the PointNet++ MSG architecture, which enables multi-scale feature extraction, has been proposed.
The effects of different training configurations, loss functions, ensemble learning, and test-time augmentation methods on model performance were analysed.
The experimental results demonstrate that the proposed approach achieves an mIoU of 51.74% and an accuracy of 61.50% on the test dataset.
The primary challenges in point cloud segmentation stem from an irregular and sparse data structure, variations in density, noise, and geometric uncertainty. Furthermore, the principal factors contributing to class-specific performance reductions are as follows: imbalanced and analogous classes, diminutive and pivotal objects, disparate labelling schemes, and onerous labelling processes. In order to address these challenges, it is imperative to employ sampling techniques that are designed to maintain class balance, utilise networks in a manner that is effective in the utilisation of contextual information, and employ high-quality labels that are standardised. The present study proposes a Transformer-enhanced PointNet++ MSG architecture that enables adaptive multi-scale feature fusion. Additionally, model ensembles and test-time augmentation (TTA) were employed to enhance prediction stability, particularly in challenging boundary regions. The proposed approach was evaluated for multi-class LiDAR segmentation with particular attention to the building class. Unlike conventional PointNet++-based approaches that rely on fixed multi-scale feature concatenation, this study proposes a scale-aware transformer fusion mechanism that dynamically learns the importance of features at different scales. This enables improved representation of complex structures such as vegetation and building boundaries, particularly under class imbalance conditions.
The remainder of the paper is organized as follows:
Section 2 reviews relevant studies in the literature on LiDAR point cloud segmentation and deep learning-based approaches.
Section 3 provides a detailed discussion of the dataset used, data pre-processing, sampling strategies, and the proposed PointNet2 MSG Transformer architecture.
Section 4 presents the experimental results and performance analyses, while
Section 5 discusses the findings.
Section 6 summarizes the general conclusions drawn from the study and outlines future work.
3. Materials and Methods
3.1. Data Source and Dataset Structure
The LiDAR dataset used in this study was generated as part of the Oregon Department of Geology and Mineral Industries (DOGMI) LiDAR Program and was obtained via the OpenTopography platform [
65]. The study area covers the region surrounding Oregon State University (
Figure 1).
The raw LiDAR point cloud data was divided into 100 m × 100 m square tiles using the CloudCompare v2.14.alpha software program. This process ensured that the data was divided into smaller, more manageable units for training purposes. A total of 580 tiles were obtained. Of these, 480 were allocated for training and 100 for testing (
Figure 2).
During model training, 20% of the training data was set aside for validation purposes (val_split = 0.2). As a result, 384 out of 480 training tiles were used for training the model, while 96 were used for validation. The 100 tiles that made up the test data were not included in the training process but were used solely for the final performance evaluation.
For each point in the dataset, there are X, Y, and Z coordinates, red, green, and blue color values, intensity information, and a class label. Therefore, the data contains geometric and radiometric properties.
Four classes have been defined for semantic segmentation. The classes are as follows: 0 is ‘Unclassified’, 1 is ‘Vegetation’, 2 is ‘Ground’ and 3 is ‘Building’.
Table 1 shows the distribution of points across these classes in the dataset.
Examining
Table 1 shows that the Vegetation class makes up 57.21% of the dataset, while the Ground class makes up 27.89%. Meanwhile, the Building and Unclassified classes each account for around 7%. This indicates that the class distribution in the dataset is imbalanced. These proportions were taken into account during model training and performance evaluation.
The dataset presents a significant class imbalance problem, where vegetation dominates more than half of the total points, while building and unclassified classes represent less than 10%. This imbalance directly affects the learning process and motivates the use of specialized loss functions and evaluation strategies in this study.
3.2. Fixed Point Sampling Strategy
PointNet and PointNet++ based architectures are designed to operate on fixed-size input tensors. Therefore, it is not possible to feed raw LiDAR point clouds, which contain varying numbers of points, directly into the model. To ensure that the model can accept inputs of the same size for each sample, all point cloud segments have been converted to contain a fixed number of points.
In LiDAR scenes, however, there is no single ‘correct’ number of points. The typical range varies depending on the task and the scale of the object or scene. For example, 1024–2048 points are commonly selected for small-scale studies with simple objects, whereas 2048–4096 points are chosen for scenes containing more complex objects [
66,
67]. In this study, each point cloud sample contained 4096 points during the training phase. This sampling size is commonly used in PointNet++-based semantic segmentation studies and provides sufficient geometric detail for hierarchical feature extraction [
68].
Two cases were considered when fixing the number of points. If a point cloud segment contained 4096 or more points, random subsampling was applied [
69,
70], and 4096 points were selected at random and fed into the model as input. If there were fewer than 4096 points in a tile, padding was applied and points were resampled from the existing set to increase the input size to 4096. This method ensures that all samples in the dataset are included in the training process [
71,
72].
In the model input, each point is represented by a feature vector consisting of seven attributes. These attributes are the X, Y, and Z coordinates, the R, G, and B color values, and the LiDAR intensity data. Thus, the model can utilize both the point’s spatial location and its radiometric properties.
In order to reduce computational costs during the inference phase, the point clouds were downsampled to 2048 points. The model then generates class predictions for these points, which are subsequently interpolated onto the original point cloud. For this process, the nearest neighbor interpolation method was used. For each original point, the nearest point in the downsampled set was identified, and its prediction was assigned to the corresponding original point. Thus, class predictions were obtained for the entire point cloud [
73]. This interpolation technique is effective within the PointNet++ ecosystem for transferring knowledge learned on sparse points to full-resolution point clouds, thereby reducing computational cost.
3.3. Compared Deep Learning Models
This study applied a total of four different model architectures to evaluate the performance of semantic segmentation on LiDAR point clouds. The models in question are PointNet, PointNet++ MSG, VoxelNet Lite and the PointNet2 MSG Transformer model developed in this study (
Table 2). These models were selected for comparison because they employ different methods of representing point clouds. PointNet uses a direct point-based approach, PointNet++ uses an advanced point-based approach that hierarchically models local geometric relationships, and VoxelNet Lite uses a voxel-based representation method. Therefore, these models provide a suitable basis on which to evaluate the performance of the proposed architecture against different data representation strategies.
PointNet is one of the first deep learning architectures designed to process point cloud data directly. The model treats the point cloud as an unordered set of points, performing feature extraction using shared multilayer perceptron layers for each point. These point features are then combined via a symmetric pooling operation to obtain a global feature vector. Although this approach is simple and computationally efficient, it can only represent local geometric relationships to a limited extent [
16].
PointNet++ is an enhanced version of the PointNet architecture that uses a hierarchical structure to capture local geometric features more effectively. The model extracts features at different scales by progressively downsampling the point cloud through Set Abstraction layers. This study used the Multi-Scale Grouping (MSG) variant of the PointNet++ architecture. The MSG approach enables multi-scale feature learning by forming neighborhood groups at different radii [
17].
The third architecture used for comparison is VoxelNet Lite. VoxelNet-based approaches convert the point cloud into a regular voxel grid structure, performing feature extraction using 3D convolution operations. This enables the spatial structure of the point cloud to be modeled with a more regular representation. In this study, VoxelNet Lite—a lighter version of the VoxelNet architecture—was used to reduce computational costs [
21].
These three models represent different strategies for data representation. PointNet uses a direct point-based approach; PointNet++ uses a hierarchical point-based approach that considers local features, and VoxelNet Lite uses a voxel-based representation. Therefore, these models provide a suitable basis for evaluating the performance of the proposed architecture.
3.4. Proposed Model: PointNet2 MSG Transformer
This study presents a PointNet++ MSG-based model architecture developed to enhance the performance of semantic segmentation on LiDAR point clouds. This model combines multi-scale feature extraction with a transformer-based feature fusion mechanism and is named the ‘PointNet2 MSG Transformer’ (
Figure 3).
The PointNet++ architecture is designed to perform hierarchical feature extraction from point clouds. It learns geometric features at different scales by progressively subsampling the point cloud through Set Abstraction (SA) layers. Each SA layer selects a specific number of points and extracts local features from the regions surrounding them. The Farthest Point Sampling (FPS) method is used during this process to select sample points that represent the spatial distribution of the point cloud.
At the model’s input, each point is represented by a seven-dimensional feature vector.
Here, the coordinates x, y, z represent the point’s spatial position; the values r, g, b represent color information; and the value I represents LiDAR intensity information.
To ensure a fixed size, the point clouds used as model input have been standardized to 4096 points. This sampling size is commonly used in PointNet++-based semantic segmentation studies and provides sufficient geometric detail for hierarchical feature extraction.
The PointNet2 MSG Transformer architecture contains four Set Abstraction layers. The number of points is gradually reduced across these layers.
This hierarchical structure facilitates the identification of both local and global geometric features within the point cloud. Small-scale geometric details are learned at lower levels, while broader spatial contexts are represented at higher levels.
In the PointNet++ MSG architecture, each Set Abstraction layer creates neighborhood regions with different radii to perform multi-scale grouping. This approach allows geometric structures of different sizes to be represented simultaneously.
This study extends the classic PointNet++ MSG architecture with the scale-aware transformer fusion mechanism (
Figure 3).
3.4.1. Scale-Aware Transformer Fusion
The present study proposes an approach to the standard multi-scale feature fusion method utilised in the traditional PointNet++ MSG architecture, namely a fusion mechanism termed Scale-Aware Transformer Fusion. The objective of this module is to establish a more adaptable representation by applying weightings to features obtained at different scales, with these weightings determined by feature content rather than by a predetermined fusion rule. As shown in
Figure 4, multi-scale features are converted into tokens and processed by a Transformer encoder, followed by an adaptive gating mechanism.
In the PointNet++ MSG architecture, each Set Abstraction (SA) layer generates three distinct features from three neighbourhood regions, with differing radii centred on the same centroid points. The following definition is provided for these features:
In this context, each
is representative of the features of scale t. B denotes the batch size, S represents the number of centroids, and T is the number of scales. The complete notation used in this study is summarized in
Table 3.
Given the potential for variation in the number of channels across scales, all features are transformed into a common dimension through the utilisation of linear projection layers.
Subsequent to this process, the features from the three scales are treated as a sequence for each centroid point. Consequently, three “tokens” are created for each centroid:
The transformation is applied independently to each centroid point. The data is thus reorganised as follows:
This architecture demonstrates that the Scale-Aware Transformer Fusion module models only the inter-scale relationship. In summary, the model calculates attention across the three scales for each centroid, as opposed to across the entire point cloud.
The Transformer Encoder employs a multi-head self-attention mechanism. The query, key, and value matrices are computed for the input tokens:
The attention mechanism is defined as follows:
Here, Q, K and V represent the query, key, and value matrices, respectively.
It is through this process that each scale is updated, with the update being based on information received from the other scales. The architecture employed in this study consists of two Transformer Encoder layers, each of which incorporates residual connections, LayerNorm, and GELU activation.
Figure 5 illustrates how self-attention models the relationships between tokens corresponding to different spatial scales.
At this point, the proposed Scale-Aware Transformer Fusion module differs significantly from direct attention-based point cloud models, such as the Point Transformer. Point Transformer aims to model spatial relationships by computing attention across all points or broad neighborhoods. The proposed architecture, however, establishes attention only across three scales for each centroid; that is, it focuses on modelling inter-scale relationships. While the attention cost in Point Transformer increases with the number of points, in this work, attention operates on a fixed T = 3 tokens. Consequently, the computational cost is minimal due to the fixed and small number of tokens (T = 3), and the hierarchical structure of PointNet++ is maintained.
In this approach, the attention matrix size is only 3 × 3 for each head, keeping the model’s computational overhead limited. Unlike methods that use global attention across the entire point cloud, this design offers a more efficient architectural configuration.
The Transformer Encoder output is obtained as follows:
This output represents updated features that incorporate cross-scale interactions for each centroid point and is passed to the Adaptive Scale Aggregation module, where the scales will be weighted and combined in the next stage.
Consequently, Scale-Aware Transformer Fusion provides a dynamically learned fusion mechanism for each location, as opposed to a fixed aggregation of multi-scale features. This approach contributes to a more balanced representation, particularly in LiDAR scenes where different geometric structures coexist.
3.4.2. Adaptive Scale Aggregation
The Scale-Aware Transformer Fusion module has been demonstrated to generate updated token representations that incorporate cross-scale interactions for each centroid point. The output obtained at this stage is expressed as follows:
In this instance, three scale tokens are present for each centroid point, designated T = 3. The reduction in these tokens to a single feature vector is achieved by means of a weighted aggregation mechanism known as Adaptive Scale Aggregation.
The purpose of this mechanism is to determine which of the three scales is more important for each centroid point and to perform the aggregation accordingly. The process is facilitated by a learnable gating structure.
Initially, a score value is calculated for each token:
This process is performed using a small network consisting of linear layers and activation functions:
The obtained scores are then normalised using the softmax function and converted into weights:
Utilising these weights, the three scale tokens are weighted and combined:
Consequently, a single feature vector is obtained for each centroid point:
The application of this process to all centroid points results in the following conversion of the output:
The hyperparameters of the fusion modules, including the number of centroids, token dimensions, and Transformer settings (e.g., number of heads and feedforward dimensions), vary across encoder levels and are summarized in
Table 4.
It is this structure that enables the model to dynamically select the scale for each location, thus obviating the need for fixed aggregation methods. In summary, the model has the capacity to assign greater weight to small-radius (detail-focused) features in certain regions and to large-radius (overall structure-focused) features in other regions.
The proposed Adaptive Scale Aggregation mechanism is distinct from fixed fusion methods, such as classical concatenation or averaging. While fixed methods treat all scales equally, in this approach, the contribution of each scale varies depending on the data content. This approach offers a more flexible representation, particularly in LiDAR scenes where different geometric structures coexist.
In addition, the weight generated by the gating mechanism can be utilized for analysis. The utilisation of these weights facilitates the examination of the scales preferred by the model in specific regions, thereby providing additional information that enhances interpretability.
The features obtained in the Set Abstraction layers and the proposed fusion modules are propagated back to higher-resolution point representations via standard Feature Propagation (FP) layers as defined in the original PointNet++ architecture. These layers enable point-level predictions by interpolating low-resolution features into denser point sets.
The final stage of the model consists of a segmentation head, where class probabilities are computed for each point using multi-layer perceptron (MLP) layers.
In this study, semantic segmentation was performed using four classes: Unclassified, Vegetation, Ground, and Building. The model output is a vector representing the class probabilities for each point.
In this context, zc denotes the class logit value, while C signifies the total number of classes.
The proposed architecture aims to learn a more powerful representation of the point cloud by combining PointNet++ MSG-based multi-scale feature extraction with a transformer-based feature integration mechanism. Unlike traditional PointNet++ MSG architectures that rely on fixed feature concatenation, the proposed model introduces a transformer-based fusion mechanism that enables adaptive weighting of multi-scale features. This allows the model to dynamically focus on the most informative spatial scales depending on the input structure, which is particularly beneficial for complex classes such as vegetation and building boundaries.
3.5. Training Configuration
The proposed model and comparison models were trained using a PyTorch-based training pipeline. The Adam optimizer was utilised during the training process. The initial learning rate was set to 0.001, and weight decay was set to 1 × 10−4. L2 regularization was applied to prevent overfitting of the model parameters.
The Adam optimization algorithm performs parameter updates using first- and second-moment estimates.
During the training phase, the batch size was set to 8. The maximum training duration was set to 50 epochs. Nevertheless, in order to prevent the continuation of training when no further enhancement in model performance was observed, early stopping was implemented. In this mechanism, the validation mIoU metric calculated on the validation dataset was monitored, and training was stopped if no improvement was observed for 15 consecutive epochs.
The learning rate was updated throughout the training process using the CosineAnnealingWarmRestarts scheduler. This approach assists the model in avoiding stagnation in local minima by progressively diminishing the learning rate through the utilisation of a cosine function, in conjunction with the execution of warm restarts at designated epoch intervals. The scheduler parameters were set as follows:
T0 = 10;
Tmult = 2;
ηmin = 10−6.
The training process was conducted using an RTX 5070 Ti GPU (16 GB VRAM). The utilisation of GPU acceleration during model training was observed, with the CPU exclusively engaged in data loading operations.
3.6. Loss Functions
In this study, a range of loss functions was evaluated in order to ascertain their impact on performance during model training. The Dice Loss function was utilised as the primary loss function. The Dice Loss metric is particularly advantageous in the context of semantic segmentation problems that are characterised by class imbalance, given its direct correlation with the Intersection-over-Union (IoU) metric.
The Dice coefficient is calculated as follows:
In this context, symbolises the probability value as predicted by the model, whilst yc denotes the true class label. The total Dice Loss value is obtained by taking the average of the Dice coefficients calculated for all classes.
In addition to Dice Loss, the Combined Loss function, which is used in conjunction with Cross Entropy Loss, was also evaluated. The total loss function is calculated as follows:
In the present study, the value of α was set at 0.5.
Furthermore, the FocalDice loss function, which is based on Focal Loss, was the subject of experimental evaluation in order to place greater emphasis on learning from difficult examples. Focal Loss is a machine learning technique that enhances the model’s capacity to learn from challenging examples by assigning greater weights to misclassified instances.
3.7. Evaluation Metrics
In order to evaluate the performance of the model, the Intersection over Union (IoU), Overall Accuracy (OA), mean IoU (mIoU), Precision, Recall, and F1-score metrics were employed. These metrics are widely utilised within the domain of semantic segmentation.
The IoU value for a class is calculated as follows:
In this context, TPc denotes true positive, FPc indicates false positive, and FNc signifies false negative.
The mean intersection over union (mIoU), which is widely used in semantic segmentation tasks, is the average IoU value calculated for all classes.
where
K is the total number of classes.
Precision, Recall, and F1-score for each class are defined as:
The classification system under scrutiny comprises four classes: unclassified, vegetation, ground, and building.
Furthermore, the Overall Accuracy (OA) metric was calculated in order to evaluate the model’s overall accuracy performance.
All metrics were computed on a per-point basis, and class-wise values were reported to provide a detailed evaluation of model performance across different semantic categories. In addition to overall metrics, per-class IoU, Precision, Recall, and F1-score values were also analyzed.
To ensure numerical stability, a small constant was added to denominators during metric computation.
Overall Accuracy (OA) is computed as a global metric over all points. In contrast, IoU, Precision, Recall, and F1-score are defined per class and subsequently averaged across classes (macro-average) to obtain mIoU, mean Precision, mean Recall, and mean F1-score. This distinction allows the evaluation to capture both overall performance and class-wise behavior, which is particularly important in imbalanced point cloud datasets.
3.8. Ensemble and Test-Time Augmentation
In order to enhance the performance of the model and improve the robustness of the predictions, the ensemble method and test-time augmentation (TTA) were applied in combination. The ensemble approach is predicated on the principle of combining the outputs of multiple models trained under different configurations. The objective of this method is to generate more reliable predictions without being constrained by the errors of a single model.
In this study, the PointNet2 MSG Transformer architecture was subjected to training using various training configurations, thereby creating multiple model variants (
Figure 6). Following the conduction of experimental evaluations, it was determined that the three models that demonstrated the most optimal performance on the validation dataset would be utilised within an ensemble framework (
Table 5).
It is acknowledged that the training of these models is undertaken using divergent architectural structures and loss functions, thus resulting in disparate error patterns being produced by each model. The employment of an ensemble approach capitalises on these variations to generate predictions that are more balanced.
During the ensemble phase, each model generates independent predictions based on the same input point cloud. These predictions are then combined to produce the final class prediction. The class probabilities generated by each model are then combined using a weighted average method.
The ensemble forecast can be expressed as follows:
Here, M denotes the number of models used in the ensemble and Pi(c) denotes the probability value generated by the i.th model for class c. The wi coefficients represent the weights assigned to each model.
The following weight values were used based on the experimental evaluations:
These weights correspond to the ‘dice_only’, ‘deeper_head’ and ‘combined’ models, respectively.
In addition to the ensemble method, Test-Time Augmentation (TTA) was employed to enhance the stability of the model’s predictions. In the TTA approach, the input point cloud is fed into the model multiple times after undergoing various transformations, with the resulting predictions then being combined. In this study, five different augmentations were applied to each point cloud.
The model was rerun for each augmentation, and the final prediction was obtained by averaging the resulting prediction probabilities.
In the final stage, the ensemble and TTA outputs were combined to obtain the final predictions (
Table 6). This approach improved the model’s overall performance, producing more balanced predictions, particularly in cases of class imbalance.
The results demonstrate that using ensemble models and test-time augmentation improves model performance. Notably, the IoU score for the “building” class increased by around 2.12%.
4. Experiments and Results
This section evaluates the performance of the proposed PointNet2 MSG Transformer model on the problem of LiDAR point cloud semantic segmentation. Experiments were conducted on a four-class semantic segmentation problem involving unclassified areas, vegetation, ground and buildings. Model performance was analysed using the mIoU and overall accuracy metrics. A comprehensive ablation study was also conducted to investigate the impact of different loss functions, architectural changes and inference strategies on performance.
4.1. Experimental Setup
The experiments were conducted using a PyTorch-based training infrastructure. All models were trained using the Adam optimisation algorithm with an initial learning rate of 0.001. The Cosine Annealing with Warm Restarts scheduler was used for the developed model to ensure more stable updates to the learning rate.
During model training, the batch size was set to 8, and 4096 points were used per training sample. During training, point cloud samples were reduced to a fixed 4096 points using either random subsampling if the number of points exceeded 4096, or padding with repeated samples if the number of points was fewer than 4096. Early stopping was applied during training, with the model checkpoint yielding the best validation performance being saved.
The model input features consist of a seven-dimensional feature vector: (
x,
y,
z,
r,
g,
b,
intensity). Here, x, y, and z represent coordinate information, r, g, and b represent color values, and intensity represents the LiDAR intensity value. Model performance was evaluated using the following metrics in all experiments: Mean Intersection over Union (mIoU), Overall Accuracy, Precision, Recall and F1-score. Although Precision, Recall, and F1-score are defined in
Section 3.7, the evaluation in this study primarily focuses on mIoU and overall accuracy, as these are the standard metrics used in LiDAR semantic segmentation benchmarks.
4.2. Baseline Model Comparison
The first phase of the study involved comparing the performance of different deep learning architectures on LiDAR point cloud segmentation. Three different traditional model architectures (PointNet, PointNet++ MSG, and VoxelNet Lite) were evaluated alongside recent state-of-the-art architectures such as Point Transformer, KPConv, and RandLA-Net for comparison.
The findings demonstrate that while the proposed method attains the maximum overall mIoU (51.74%), disparate architectures demonstrate variances in efficacy across specific classes. For instance, Point Transformer achieves the best performance in the building class (48.28% IoU), whereas KPConv performs better on vegetation due to its deformable convolution mechanism. The findings of this study suggest that disparate architectural designs can capture distinct geometric properties of LiDAR data (
Table 7).
The proposed model achieves the highest overall performance among the compared architectures, reaching 51.74% mIoU and 61.50% accuracy. Compared to PointNet++ MSG, the model provides a clear improvement in overall segmentation performance. While Point Transformer achieves higher performance in the building class, the proposed method demonstrates a more balanced performance across all classes. This indicates that the integration of multi-scale features with transformer-based fusion contributes to more stable predictions in complex scenes.
These models were trained using the same dataset and similar training parameters, and their performance was compared. Initial experiments revealed that the PointNet++ MSG model performed better than the other architectures (
Figure 7 and
Figure 8). In particular, it was observed that the model could learn geometric features at different scales more effectively thanks to its multi-scale grouping mechanism.
Consequently, subsequent stages of the study involved developing a new model based on the PointNet++ MSG architecture.
In order to ensure a fair comparison, it was imperative that all models were trained using the same dataset splits, point sampling strategy (4096 points), and preprocessing pipeline, including normalization and feature standardization. The identical maximum training budget (50 epochs) and early stopping criterion, which is based on the validation mIoU, were applied to all models.
However, architecture-specific training strategies were employed to ensure optimal performance for each model. While the baseline models were trained using cross-entropy loss with StepLR scheduling, the proposed PointNet2 MSG Transformer was trained using Dice loss and cosine annealing with warm restarts.
Despite the heterogeneity of the optimisation strategies employed, a uniform evaluation of the models was conducted on a shared test dataset, thereby ensuring internal consistency and enabling a reliable comparison.
4.3. Ablation Study
A comprehensive ablation study was conducted to evaluate how different training configurations affect the performance of the proposed model. Various loss functions and architectural variations were tested as part of this study.
The results show that using Dice Loss significantly improves the performance of the model. The model using Dice Loss achieved an improvement in mIoU of approximately 2.79% compared to the baseline model using Cross-Entropy Loss (
Table 8).
However, increasing the number of segmentation head layers did not result in the expected improvement in performance. Although a deeper head structure increased the number of model parameters, this did not result in a significant improvement in overall performance. Similarly, combining Focal Loss and Dice Loss caused a significant drop in model performance.
Overall, these results suggest that the Dice Loss function is more suitable for LiDAR point cloud segmentation (
Figure 9 and
Figure 10).
To further improve the performance of the model, ensemble and test-time augmentation methods were employed. The ensemble method involved combining the predictions of three models that had been trained using different configurations.
The models used in the ensemble are as follows:
- -
A model trained using Dice Loss;
- -
A model with a DeeperHead architecture;
- -
A model using a combined loss.
The model predictions were then combined using the weighted average method.
4.4. Effect of Ensemble and Test-Time Augmentation
As shown in
Table 6, the results indicate that the performance of the model improves when ensemble and test-time augmentation methods are used. Specifically, when ensemble and TTA are employed in conjunction, the model’s mIoU value increases to 51.74%. This represents an approximate improvement of 4.03% in mIoU compared to the baseline model.
Analysis by class revealed that the greatest improvement was seen in the ‘building’ class. Using the ensemble method, the IoU value for the ‘building’ class increased by approximately 2.12%.
Figure 11,
Figure 12 and
Figure 13 show how different prediction strategies affect model performance. Using a single model yielded an mIoU value of 50.50%, whereas applying test-time augmentation increased this value to 51.15%. Using the ensemble method increased the mIoU value to 51.55%, and using the ensemble method in combination with test-time augmentation achieved the highest performance of 51.74%.
A similar pattern emerged in the accuracy metric. The best single model achieved an accuracy of 60.83%, whereas the combination of ensemble and test-time augmentation increased this figure to 61.50%.
Figure 14 shows the class-based IoU results. The final model showed slight improvements across all classes. Notably, the IoU value for the ‘building’ class increased by 2.12%, rising from 39.9% to 42.0%.
4.5. Qualitative Segmentation Results
In addition to the quantitative metrics, visual comparisons were created using selected point cloud segments from the test dataset, in order to analyse the spatial behaviour of the model outputs. These comparisons are presented in four panels for the same samples: the input point cloud; the ground truth labels; the prediction of the best single model (the model trained using Dice Loss); and the prediction of the final method (ensemble + test-time augmentation).
Figure 15a,b show a comparative display of the input point cloud for a test sample, the ground truth labels, the best single-model prediction (a model trained using Dice Loss) and the final model prediction (an ensemble model with test-time augmentation).
Figure 16 shows a comparative display of the input point cloud for a test sample, the ground truth labels, PointNet++ MSG, PointNet2 MSG Ensemble + TTA prediction, RandLA-Net, Point Transformer, KPConv. In the visualisations, each class is represented by a different colour: ‘Unclassified’ is shown in grey, ‘Vegetation’ in green, ‘Ground’ in brown, and ‘Building’ in red.
Analysis of the visual results shows that the proposed method is highly effective in accurately identifying building boundaries. The final model utilises ensemble and test-time augmentation to produce more consistent class distributions and reduce misclassifications compared to a single model’s predictions. Specifically, roof surfaces are segmented more comprehensively in the building class, and errors resulting from overlap with the ground are reduced.
Additionally, despite having the highest point density in the dataset, the vegetation class is one of the most challenging to segment due to its geometric diversity. Examining the model predictions reveals that errors may occur in some boundary regions between the vegetation and ground classes. However, the final model produces more balanced results in these regions.
As illustrated by
Figure 17, the training behaviour of the proposed model during the learning process is evident. As demonstrated in the accompanying figure, the training loss exhibits a consistent decrease across epochs, thereby signifying that the network is progressively acquiring discriminative geometric features from the LiDAR point clouds. The most substantial decline in loss is observed during the initial training stages, particularly within the first ten epochs. Subsequent to this, the decrease becomes more gradual.
Concurrently, the validation mIoU demonstrates an overall increasing trend, rising from approximately 35% in the initial epochs to around 50% in the later stages of training. Despite minor fluctuations observed during intermediate periods, the overall pattern suggests stable learning behaviour.
It is important to note that the validation curve remains stable without severe overfitting, suggesting that the model maintains a balanced learning process and is able to generalise effectively to unseen data. The learning curves demonstrate that the model attains a relatively stable convergence after approximately 15–20 epochs.
It is therefore concluded that the proposed PointNet2 MSG Transformer architecture and the applied ensemble strategy provide an effective approach for multi-class LiDAR point cloud segmentation.
4.6. Error Analysis
In order to conduct a more thorough examination of the model’s class-based performance, a confusion matrix was calculated for the final model. As illustrated in
Figure 18, the confusion matrix derived from the test dataset has been standardized by row.
A thorough examination of the confusion matrix discloses that certain classes are predicted with a greater degree of accuracy than others. The “Unclassified” category was found to be accurately classified in 90% of cases. In a similar vein, the “Ground” class exhibits a high degree of classification accuracy, with an 85% accuracy rate (
Table 9).
A class-wise evaluation of the model’s performance reveals that it demonstrates strong performance in ground and unclassified classes, while relatively lower performance is observed for vegetation and building classes. This behavior is consistent with the geometric complexity and class imbalance present in the dataset.
A thorough examination of the confusion matrix discloses that certain classes are predicted with a greater degree of accuracy than others. The Unclassified class was found to be correctly classified in 90% of cases. In a similar vein, the Ground class exhibits a high level of classification success, attaining an accuracy rate of 85%.
Conversely, the Vegetation class exhibits a lower accuracy rate. While approximately 45% of the points belonging to the “Vegetation” class are correctly classified, 48% are predicted as the “Ground” class. This finding suggests that the geometric similarity between vegetation and the ground may impede the model’s ability to differentiate between them.
The correct classification rate for the “Building” class was calculated to be 64%. It is evident that 33% of the building points are predicted to be vegetation. This finding suggests that class confusion may occur, particularly in areas with trees and vegetation around buildings.
The confusion matrix results demonstrate that the model exhibits a high degree of accuracy in distinguishing between the Ground and Building classes, while the Vegetation class frequently exhibits confusion with the other classes.
5. Discussion
In this study, the Scale-Aware Transformer Fusion and Adaptive Scale Aggregation modules were integrated into the PointNet++ MSG architecture with a view to addressing the semantic segmentation problem in aerial LiDAR point clouds. The findings indicate that the proposed method attains an enhancement in overall performance in comparison to the baseline model (mIoU: 51.74%). While this enhancement may appear to be restricted in absolute terms, it is nonetheless valuable in that the model engenders a more balanced performance across different classes. It is evident that minor increments in mean metrics (mIoU, IoU, accuracy) may be associated with substantial alterations in the aggregate number of errors, given the substantial quantity of samples in the point cloud datasets.
The enhancement in performance can be ascribed to the content-dependent processing of multi-scale features as opposed to their predetermined combination. In the classical PointNet++ MSG approach, features obtained from neighbourhoods of different radii are directly combined; in this study, however, each scale is treated as a token within the Transformer, and cross-scale relationships are modelled. This architecture facilitates the adaptive weighting of information obtained from different scales for each centroid point. It is hypothesised that this approach could contribute to the representation of different geometric patterns in LiDAR scenes with heterogeneous structures.
Class-based analysis provides a more comprehensive understanding of the model’s behaviour. Despite the fact that the proposed method does not attain the highest IoU values across all classes, it produces balanced results in terms of overall performance. The enhancement observed, particularly in the vegetation class, can be attributed to the model’s capacity to represent the irregular geometries of multi-scale structures. Conversely, certain attention-based methods have been observed to yield superior performance in the building class. This phenomenon can be attributed to the inherent properties of building surfaces, which exhibit a greater prevalence of regular and planar structures. Consequently, architectures that directly model spatial relationships are able to demonstrate a competitive advantage within this domain. The proposed approach, however, focuses on the adaptive fusion of multi-scale local features rather than establishing global attention across the entire point cloud.
The model’s behaviour in complex areas may be related to the joint consideration of multi-scale features. At class boundaries, such as those between buildings and vegetation, it is necessary to utilise both fine geometric details and broader contextual information in combination. The fusion mechanism utilised in this study has the potential to enhance the accuracy of predictions in such domains, as it models the interaction of information from different scales. Furthermore, it is hypothesised that test-time augmentation (TTA) and ensemble approaches can assist in reducing prediction variance, particularly in regions where certainty is lacking.
The present study is subject to several limitations. Firstly, the experiments were conducted using a single aerial LiDAR dataset, which limits the model’s ability to generalise to different data distributions. However, the present study opted for a dataset that reflects real-world conditions rather than one that has been fully cleaned. This choice was made to evaluate the model’s performance against practical challenges such as noise, class imbalance, and data heterogeneity. Nevertheless, further validation on disparate datasets is requisite. In addition, the proposed transformer-based fusion architecture involves extra computational steps compared to fixed fusion methods. However, this cost remains limited in practice due to the attention calculation being performed on a limited number of scale tokens (T = 3).
In subsequent studies, it is imperative to assess the proposed method on diverse aerial LiDAR datasets and evaluate its generalisation performance. Furthermore, increasing dataset diversity and testing with real-world field data will more accurately reflect the model’s performance in practical applications. Furthermore, the exploration of diverse variations in the transformer-based fusion mechanism and more intricate architectural structures is a possibility. In particular, the potential for investigation lies in class-balanced learning strategies and data augmentation methods that are specifically designed to enhance performance in the building class.
6. Conclusions
This study addressed the four-class semantic segmentation problem on aerial LiDAR point clouds, comparing various deep learning approaches with a focus on multi-class point segmentation. The experimental results showed that the PointNet2 MSG Transformer architecture achieved the highest performance. The proposed approach consists of a PointNet++ MSG-based encoder structure; Scale-Aware Transformer Fusion modules, which are used after each Set Abstraction layer; a decoder with four Feature Propagation layers; a Dice Loss-based training strategy; and a combination of weighted ensemble and test-time augmentation. The final system achieved an mIoU of 51.74% and an accuracy of 61.50% on the test dataset. This represents the best performance among the models compared.
The study not only examined a single-model architecture but also the effects of different training and inference strategies. The results demonstrate that Dice Loss is more effective than Cross-Entropy-based approaches on this class-imbalanced dataset. Furthermore, combining ensemble and test-time augmentation strategies led to improved performance, particularly in the ‘building’ class, resulting in the final model outperforming single-model performance.
The findings suggest that a multi-class semantic segmentation approach is better suited to LiDAR-based building detection problems than single-class approaches. This is because the classes ‘building’, ‘vegetation’, ‘ground’ and ‘unclassified’ coexist in real-world applications, and the boundaries between these classes directly affect model performance. In this study, the vegetation class posed the greatest challenge. Although a significant improvement was achieved in this class using Dice Loss, the results suggest that further improvements are possible.
This study has some limitations. Firstly, the experiments were conducted on a single aerial LiDAR dataset, which limits the evaluation of generalization across different data distributions. Although a real-world dataset was intentionally selected to reflect practical conditions, further validation on additional airborne LiDAR datasets is required. Additionally, the class distribution in the dataset is imbalanced, with the vegetation class particularly dominant. Therefore, it is necessary to investigate how the model performs on different datasets and under different scene conditions.
Future work will expand data augmentation strategies and employ class-balanced sampling methods. More advanced transformer-based fusion architectures will be tested, and post-processing approaches will be evaluated to enhance spatial consistency after inference. Furthermore, performance could be improved by utilising additional models obtained from different ensemble strategies and training variations.