Deep Learning-Based Semantic Segmentation of Airborne LiDAR Point Clouds Using a Transformer-Enhanced PointNet++ Architecture

Sevinc, Hacer Kubra; Karas, Ismail Rakip

doi:10.3390/geomatics6030043

Open AccessArticle

Deep Learning-Based Semantic Segmentation of Airborne LiDAR Point Clouds Using a Transformer-Enhanced PointNet++ Architecture

by

Hacer Kubra Sevinc

^1,*

and

Ismail Rakip Karas

²

¹

Vocational School of Information Technologies, Karabük University, 78050 Karabük, Turkey

²

Department of Computer Engineering, Karabük University, 78050 Karabük, Turkey

^*

Author to whom correspondence should be addressed.

Geomatics 2026, 6(3), 43; https://doi.org/10.3390/geomatics6030043

Submission received: 22 March 2026 / Revised: 22 April 2026 / Accepted: 27 April 2026 / Published: 29 April 2026

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

The proposed transformer-enhanced PointNet++ model achieves the highest overall mIoU (51.74%) among both baseline and state-of-the-art architectures on the Oregon LiDAR dataset.
Integrating scale-aware transformer-based feature fusion with ensemble learning and test-time augmentation improves segmentation stability and class-wise balance.

What are the implications of the main findings?

In LiDAR semantic segmentation tasks, transformer-based adaptive multi-scale feature fusion provides a robust alternative to fixed feature aggregation methods.
The proposed approach is a practical and scalable solution for addressing class imbalance and geometric complexity in real-world airborne LiDAR applications such as urban mapping and smart city systems.

Abstract

Airborne LiDAR (Light Detection and Ranging) data is widely used in urban modelling and three-dimensional spatial analysis studies. However, the irregular structure of LiDAR point clouds, varying point densities, and class imbalances observed in the datasets make semantic segmentation problematic. This study addresses the four-class semantic segmentation problem (unclassified, vegetation, ground, and building) on aerial LiDAR point clouds, with a particular focus on multi-class segmentation. The Oregon LiDAR Program dataset was obtained through the OpenTopography platform for use in this study. The point cloud data were resampled to 4096 points to ensure a fixed input size; for each point, the X, Y, and Z coordinates, along with the RGB and intensity features, were utilized. Experimental studies compared the proposed method with both baseline models (PointNet, PointNet++ MSG, and VoxelNet Lite) and recent state-of-the-art architectures, including Point Transformer, KPConv, and RandLA-Net. Additionally, the PointNet2 MSG Transformer model was developed based on the PointNet++ MSG architecture and includes a transformer-based feature fusion module. Different loss functions and training configurations were evaluated, and the effects of ensemble learning and test-time augmentation strategies on model performance were analyzed. The experimental results show that the proposed approach achieved a mean Intersection over Union (IoU) of 51.74% and an accuracy of 61.50% on the test dataset. These results demonstrate that combining multi-scale feature extraction with transformer-based feature fusion is an effective approach for semantic segmentation of LiDAR point clouds and multi-class segmentation tasks.

Keywords:

LiDAR point cloud; 3D point cloud segmentation; deep learning; machine learning; PointNet++; transformer-based architectures

Graphical Abstract

1. Introduction

Airborne LiDAR is widely used for large-scale 3D mapping and urban modelling. LiDAR systems generate high-density point cloud data, which represents the geometric structure of the ground and objects on it with high accuracy. LiDAR data has been extensively utilised in urban areas across a wide range of applications, including the classification of land cover and use, the generation of digital elevation models (DEM) and digital surface models (DSM), the extraction of buildings and roads, the creation of 3D city models, forestry analyses, and disaster management [1,2,3,4,5,6].

In the context of LiDAR data analysis, semantic segmentation entails the automated allocation of each point in the point cloud to its designated class (e.g., road, building, vehicle, pedestrian, tree). This facilitates comprehension of the real-world 3D scene, both geometrically and semantically. The accurate classification of these classes is of particular importance in applications such as urban modelling, infrastructure monitoring, and autonomous driving [7,8,9]. Nevertheless, due to the irregular structure of LiDAR point cloud data, varying point densities, noise, and complex spatial relationships between objects, the semantic segmentation of LiDAR point clouds remains challenging due to irregular structure, varying density, and complex spatial relationships [8,10,11].

Traditional approaches to point cloud classification rely on manually designed geometric features and classical machine learning algorithms. These methods frequently require extensive feature engineering and may struggle to capture complex spatial patterns in large-scale datasets [8,12,13].

In recent years, significant advancements have been made in the field of point cloud analysis using deep learning methods. There are three primary groups of deep learning-based approaches that are utilised to address this issue: point-based, voxel-based, and projection-based (the conversion of a 3D point into a 2D range image) [7,14,15]. PointNet is a deep learning architecture that accepts three-dimensional point clouds as input without undergoing any transformation and operates in a permutation-invariant manner, i.e., it is unaffected by the order of the points. In the initial stage, this network processes each point independently and combines the extracted features into a single global feature using a symmetric function called max pooling to perform tasks such as object classification and segmentation [16]. PointNet++ is a hierarchical neural network architecture developed to process 3D point clouds using deep learning methods. To address the limitations of PointNet in capturing local structures, PointNet++ employs a strategy that involves the subdivision of the input data into smaller sub-regions and the extraction of features derived from the physical or mathematical distances between points within the point cloud [17]. These architectures represent some of the fundamental methods that are currently being utilised in a significant number of point cloud semantic segmentation studies. Voxel-based approaches (e.g., VoxNet, 3D-ShapeNet) convert the point cloud into a 3D grid and apply 3D convolutions; while these approaches effectively capture global context and volumetric structure, memory and computational costs increase cubically due to empty voxels [18,19,20,21]. Conversely, projection-based methods employ 3D point projection onto 2D range images, leveraging pre-trained 2D CNN architectures. 3D-MiniNet and numerous LiDAR networks achieve high speeds in real-time autonomous driving through this approach. However, it should be noted that some spatial accuracy and depth detail may be lost during projection [14,22,23,24].

The differentiation between edifices, terrain, and flora in data derived from aerial LiDAR point clouds is challenging due to both geometric similarities and limitations in data quality. The ground and building roofs frequently manifest as wide, flat, and gently sloping surfaces, resulting in building–ground confusion in methodologies that rely exclusively on rudimentary geometric characteristics such as elevation or planarity [25,26]. In areas of dense vegetation, the number of ground points becomes sparse or even disappears entirely. LiDAR pulses are obstructed by the canopy, resulting in vegetative growth that creates a quasi-planar appearance, resembling the qualities of a flat surface [27,28]. In particular, low shrubs, grassy areas, and vegetation in close proximity to the ground approach the ground in terms of height and roughness, further complicating classification [28,29].

In complex urban areas, where modern development and natural vegetation coexist, the overlap of all three classes within the same area serves to further blur class boundaries. The problem is exacerbated by sparse point clouds, irregular sampling, and objects at different scales [29,30,31]. Consequently, recent studies have concentrated on the simultaneous separation of buildings, ground, and vegetation using multi-feature spaces (height, shape, texture, density), adaptive neighborhood selection, and advanced machine learning/deep learning models (Random Forest, XGBoost, and deep neural networks). Nevertheless, errors and ambiguities remain significant, particularly in areas with dense vegetation, shading, and complex roof–tree interactions [12,29,30,31,32,33].

3D LiDAR sensors provide a detailed point cloud representation of the environment across a wide range of fields, from autonomous driving to smart city infrastructure. The direct processing of this data in 3D space is challenging due to variations in density and scale, noise, irregular sampling, and large data volumes. Consequently, in recent years, 3D deep learning methods operating on point clouds—particularly semantic segmentation models—have played a key role in the automatic understanding of complex scenes. Nevertheless, the efficient processing of large-scale open-area scenes and data-related issues such as class imbalance continue to limit the performance of existing methods [34,35].

The objective of this study is to propose an approach to the semantic segmentation of aerial LiDAR point clouds, with a focus on multi-class semantic segmentation, particularly the building class. In accordance with the study, a four-class segmentation problem (Unclassified, Vegetation, Ground, and Building) was defined using the Oregon LiDAR dataset obtained via the OpenTopography platform.

The main contributions of this study can be summarized as follows:

Different deep learning architectures, including baseline architectures (PointNet, PointNet++ MSG, and VoxelNet Lite) as well as recent state-of-the-art methods such as Point Transformer, KPConv, and RandLA-Net, were comparatively evaluated on the same dataset for point cloud segmentation.
A transformer-based feature fusion approach named PointNet2 MSG Transformer, based on the PointNet++ MSG architecture, which enables multi-scale feature extraction, has been proposed.
The effects of different training configurations, loss functions, ensemble learning, and test-time augmentation methods on model performance were analysed.
The experimental results demonstrate that the proposed approach achieves an mIoU of 51.74% and an accuracy of 61.50% on the test dataset.

The primary challenges in point cloud segmentation stem from an irregular and sparse data structure, variations in density, noise, and geometric uncertainty. Furthermore, the principal factors contributing to class-specific performance reductions are as follows: imbalanced and analogous classes, diminutive and pivotal objects, disparate labelling schemes, and onerous labelling processes. In order to address these challenges, it is imperative to employ sampling techniques that are designed to maintain class balance, utilise networks in a manner that is effective in the utilisation of contextual information, and employ high-quality labels that are standardised. The present study proposes a Transformer-enhanced PointNet++ MSG architecture that enables adaptive multi-scale feature fusion. Additionally, model ensembles and test-time augmentation (TTA) were employed to enhance prediction stability, particularly in challenging boundary regions. The proposed approach was evaluated for multi-class LiDAR segmentation with particular attention to the building class. Unlike conventional PointNet++-based approaches that rely on fixed multi-scale feature concatenation, this study proposes a scale-aware transformer fusion mechanism that dynamically learns the importance of features at different scales. This enables improved representation of complex structures such as vegetation and building boundaries, particularly under class imbalance conditions.

The remainder of the paper is organized as follows: Section 2 reviews relevant studies in the literature on LiDAR point cloud segmentation and deep learning-based approaches. Section 3 provides a detailed discussion of the dataset used, data pre-processing, sampling strategies, and the proposed PointNet2 MSG Transformer architecture. Section 4 presents the experimental results and performance analyses, while Section 5 discusses the findings. Section 6 summarizes the general conclusions drawn from the study and outlines future work.

2. Related Work

2.1. Point Cloud-Based Semantic Segmentation

A point cloud is a set of (x, y, z) points in a three-dimensional coordinate system that shows the outline of an object or surface. In the case of airborne LiDAR systems, these coordinates may be accompanied by additional attributes, such as reflection intensity, return number, and scan angle [36]. Sometimes, RGB colour information is also included. The most fundamental characteristic that distinguishes point clouds from traditional digital images is their unstructured and unordered nature [37].

Semantic segmentation involves assigning each point in a point cloud to one of a set of predefined semantic categories (e.g., ground, building, tall vegetation, or water body) [38]. This process identifies the location of objects and provides a complete understanding of the scene by indicating what each data point represents [39].

Semantic segmentation has a wide range of applications in aerial LiDAR data. In urban planning, for example, the automatic segmentation of building roofs and facades is essential for creating 3D city models. In forestry, tree species classification, biomass estimation, and forest health analysis depend directly on the accuracy of the segmentation [40]. In infrastructure inspection processes, such as power line and pole detection, deep learning-based segmentation methods are widely used to reduce the manual workload and enhance safety. In cultural heritage preservation efforts, semantic segmentation is an indispensable tool for detailed geometric analysis of historical structures [41].

Airborne LiDAR data poses unique challenges due to its capacity to survey much larger areas than terrestrial laser scanning (TLS) can and because the sensor is mounted on a moving platform (an aircraft or UAV). The most significant challenge is the issue of variable point density. Changes in flight altitude, speed, and scanning angle result in a non-uniform distribution of ground points. This causes classical algorithms that search for fixed-radius neighborhoods to fail in low-density regions [42].

The second major challenge is the ‘occlusion’ (shading) effect. When laser beams are emitted at vertical or oblique angles, the lower parts of buildings or objects beneath dense tree cover cannot be fully scanned. This leads to gaps or missing geometry in the dataset [43]. Aerial LiDAR scenes typically consist of millions or even billions of points. Processing data on such a large scale poses significant challenges in terms of memory management and computation time [44]. Finally, erroneous reflections (outliers) caused by sensor noise and atmospheric conditions significantly reduce accuracy rates, particularly when segmenting fine structures such as power lines [45].

2.2. PointNet and Related Architectural Approaches

Before the era of deep learning, point cloud analysis relied heavily on manually created geometric features. These methods involved calculating statistical values such as curvature, normals, linearity, or planarity within the local neighborhood of each point to attempt classification [41]. However, these features were insufficient for capturing semantic variations in complex scenes, and manual parameter tuning was required for different datasets.

Another approach, known as “voxelization,” has enabled the use of 3D convolutional neural networks (CNNs) by converting point clouds into regular 3D grids. However, this method wastes memory by processing empty spaces, and computational costs increase exponentially as grid resolution increases [46]. Additionally, fitting data to the grid causes the loss of small geometric details, making it impossible to distinguish fine objects, particularly in aerial LiDAR data. These limitations have prompted the development of architectures that accept data in its original form and learn features directly from raw coordinates.

Proposed in 2017, PointNet is the first deep learning architecture to take point clouds as direct input and extract features independently of the order of the points [47]. Its success is based on multi-layer perceptrons (MLPs) being applied independently to each point, followed by a global symmetry function (typically max-pooling) [48]. This structure ensures that the same global feature vector is produced regardless of changes to the order of the points in the input set.

PointNet uses alignment networks, known as ‘T-Net’, to ensure robustness against geometric transformations (e.g., rotation and scaling) in the data. These networks enable the model to learn independently of the viewpoint by aligning the data to a canonical space before processing [49]. Although PointNet produced groundbreaking results for its time in object classification and part segmentation tasks, its architecture, which processes each point independently, fails to capture local relationships between points and spatial context [50]. This methodological limitation results in significantly poorer performance, particularly in distinguishing the characteristic structures of buildings or the complex local textures of trees in large-scale aerial LiDAR scenes.

PointNet++ was developed to address the lack of local features in PointNet by extending the hierarchical structure of CNNs to point clouds. It consists of ‘Set Abstraction’ layers that progressively divide the data into smaller subsets, extracting local features from each one [51]. These layers comprise three main steps: sampling, grouping, and the PointNet layer.

During the sampling phase, the Farthest Point Sampling (FPS) algorithm selects the center points that best represent the scene. During the clustering phase, the neighboring points around these center points are grouped together using either the ball query (fixed radius) method or the K-nearest neighbors (KNN) method [52]. Finally, a small PointNet is applied to each local group to extract the geometric summary of that region. This hierarchical approach significantly improves segmentation performance by enabling the model to first learn small details and then broader structures.

One of the major challenges in aerial LiDAR data is variable point density. This has been addressed in the PointNet++ architecture using the Multi-Scale Grouping (MSG) mechanism. Using a fixed radius for grouping results in insufficient sampling in regions with low data density and an unnecessary computational load in regions with high density [53].

MSG defines multiple neighbourhood regions with different radius values around each sampling point and concatenates the features obtained from these regions [54]. For instance, a small radius captures fine details, such as a building’s chimney, while a larger radius helps to understand the object’s general form, such as the roof of a building. This method ensures that the model produces consistent features, even in regions with varying levels of density. In aerial LiDAR data, the MSG mechanism plays a critical role in maintaining segmentation accuracy, particularly in the transitions between dense areas where flight paths intersect and sparse edge regions [55].

2.3. Transformer Models and Ensemble Learning Approaches

Although PointNet++ performs well with regard to local geometry in aerial LiDAR data, the requirement to handle global context and varying density in large-scale scans has led to a shift towards attention-based Transformer architectures [56].

Unlike traditional methods, which typically rely on fixed neighborhood radii or K-nearest neighbor (KNN) strategies, Transformers can dynamically weight the relationship between each point and all other points (or points within a specified region).

The greatest advantage of this approach to aerial LiDAR data is the ability to establish long-range dependencies. For instance, in a city-scale scan, the semantic relationship between a building’s roof and ground-level objects can be understood not only through local geometric features but also in the context of the entire scene. Transformer models can evaluate points that share similar semantic features but are spatially distant from one another within the same context while preserving structural integrity across large scenes [56]. This capability allows for a more accurate determination of object boundaries, particularly in complex urban areas or terrain with dense vegetation.

Recent advances in point cloud segmentation have given rise to a variety of architectures that address large-scale and complex scenes. Transformer-based models, such as Point Transformer [57], employ attention mechanisms to capture contextual relationships between points and have demonstrated improved representation capabilities in complex environments. In addition to attention-based approaches, convolution-based methods such as KPConv [58] and efficient sampling-based networks such as RandLA-Net [59] have also achieved strong performance by focusing on geometric feature extraction and scalability for large point clouds.

The class imbalance present within the context of aerial LiDAR 3D point clouds poses a considerable challenge, particularly given that instances pertaining to rare or diminutive objects (for example, power lines, traffic signs and chimneys) are typically represented by a minimal number of instances. This imbalance has been shown to result in deep learning-based segmentation and classification models focusing on common classes (e.g., ground, buildings, trees) and achieving low accuracy on minority classes [35,60,61]. Among the methods developed in recent years, data augmentation (e.g., cut-and-paste, synthetic sample generation), sampling strategies (over-sampling/under-sampling), weighted loss functions (e.g., focal loss), and attention mechanisms specific to rare classes are of particular note [60,62]. These approaches have been shown to adapt to the irregular and sparse nature of aerial LiDAR data, as well as the complexity of large-scale scenes [63]. This adaptability has been demonstrated to enhance the recognition of rare classes and improve the overall segmentation performance. Nevertheless, further research is required to develop new methods that can achieve fully balanced performance in real-world applications [64].

Despite these advances, challenges pertaining to multi-scale feature representation and varying point density persist, particularly in the context of airborne LiDAR data. In this context, the method proposed in this study focuses on the adaptive fusion of multi-scale features within the PointNet++ MSG framework, with the aim of improving robustness in heterogeneous airborne LiDAR scenes.

In summary, while existing approaches—including Transformer-based and convolution-based methods—have achieved significant progress in LiDAR point cloud segmentation, they still face challenges in handling class imbalance, capturing multi-scale geometric structures, and modeling long-range dependencies simultaneously. In particular, the use of fixed multi-scale feature fusion limits the adaptability of these methods in complex scenes. To address these limitations, this study proposes a transformer-enhanced multi-scale feature fusion approach combined with ensemble learning strategies to improve segmentation robustness and stability.

3. Materials and Methods

3.1. Data Source and Dataset Structure

The LiDAR dataset used in this study was generated as part of the Oregon Department of Geology and Mineral Industries (DOGMI) LiDAR Program and was obtained via the OpenTopography platform [65]. The study area covers the region surrounding Oregon State University (Figure 1).

The raw LiDAR point cloud data was divided into 100 m × 100 m square tiles using the CloudCompare v2.14.alpha software program. This process ensured that the data was divided into smaller, more manageable units for training purposes. A total of 580 tiles were obtained. Of these, 480 were allocated for training and 100 for testing (Figure 2).

During model training, 20% of the training data was set aside for validation purposes (val_split = 0.2). As a result, 384 out of 480 training tiles were used for training the model, while 96 were used for validation. The 100 tiles that made up the test data were not included in the training process but were used solely for the final performance evaluation.

For each point in the dataset, there are X, Y, and Z coordinates, red, green, and blue color values, intensity information, and a class label. Therefore, the data contains geometric and radiometric properties.

Four classes have been defined for semantic segmentation. The classes are as follows: 0 is ‘Unclassified’, 1 is ‘Vegetation’, 2 is ‘Ground’ and 3 is ‘Building’. Table 1 shows the distribution of points across these classes in the dataset.

Examining Table 1 shows that the Vegetation class makes up 57.21% of the dataset, while the Ground class makes up 27.89%. Meanwhile, the Building and Unclassified classes each account for around 7%. This indicates that the class distribution in the dataset is imbalanced. These proportions were taken into account during model training and performance evaluation.

The dataset presents a significant class imbalance problem, where vegetation dominates more than half of the total points, while building and unclassified classes represent less than 10%. This imbalance directly affects the learning process and motivates the use of specialized loss functions and evaluation strategies in this study.

3.2. Fixed Point Sampling Strategy

PointNet and PointNet++ based architectures are designed to operate on fixed-size input tensors. Therefore, it is not possible to feed raw LiDAR point clouds, which contain varying numbers of points, directly into the model. To ensure that the model can accept inputs of the same size for each sample, all point cloud segments have been converted to contain a fixed number of points.

In LiDAR scenes, however, there is no single ‘correct’ number of points. The typical range varies depending on the task and the scale of the object or scene. For example, 1024–2048 points are commonly selected for small-scale studies with simple objects, whereas 2048–4096 points are chosen for scenes containing more complex objects [66,67]. In this study, each point cloud sample contained 4096 points during the training phase. This sampling size is commonly used in PointNet++-based semantic segmentation studies and provides sufficient geometric detail for hierarchical feature extraction [68].

Two cases were considered when fixing the number of points. If a point cloud segment contained 4096 or more points, random subsampling was applied [69,70], and 4096 points were selected at random and fed into the model as input. If there were fewer than 4096 points in a tile, padding was applied and points were resampled from the existing set to increase the input size to 4096. This method ensures that all samples in the dataset are included in the training process [71,72].

In the model input, each point is represented by a feature vector consisting of seven attributes. These attributes are the X, Y, and Z coordinates, the R, G, and B color values, and the LiDAR intensity data. Thus, the model can utilize both the point’s spatial location and its radiometric properties.

In order to reduce computational costs during the inference phase, the point clouds were downsampled to 2048 points. The model then generates class predictions for these points, which are subsequently interpolated onto the original point cloud. For this process, the nearest neighbor interpolation method was used. For each original point, the nearest point in the downsampled set was identified, and its prediction was assigned to the corresponding original point. Thus, class predictions were obtained for the entire point cloud [73]. This interpolation technique is effective within the PointNet++ ecosystem for transferring knowledge learned on sparse points to full-resolution point clouds, thereby reducing computational cost.

3.3. Compared Deep Learning Models

This study applied a total of four different model architectures to evaluate the performance of semantic segmentation on LiDAR point clouds. The models in question are PointNet, PointNet++ MSG, VoxelNet Lite and the PointNet2 MSG Transformer model developed in this study (Table 2). These models were selected for comparison because they employ different methods of representing point clouds. PointNet uses a direct point-based approach, PointNet++ uses an advanced point-based approach that hierarchically models local geometric relationships, and VoxelNet Lite uses a voxel-based representation method. Therefore, these models provide a suitable basis on which to evaluate the performance of the proposed architecture against different data representation strategies.

PointNet is one of the first deep learning architectures designed to process point cloud data directly. The model treats the point cloud as an unordered set of points, performing feature extraction using shared multilayer perceptron layers for each point. These point features are then combined via a symmetric pooling operation to obtain a global feature vector. Although this approach is simple and computationally efficient, it can only represent local geometric relationships to a limited extent [16].

PointNet++ is an enhanced version of the PointNet architecture that uses a hierarchical structure to capture local geometric features more effectively. The model extracts features at different scales by progressively downsampling the point cloud through Set Abstraction layers. This study used the Multi-Scale Grouping (MSG) variant of the PointNet++ architecture. The MSG approach enables multi-scale feature learning by forming neighborhood groups at different radii [17].

The third architecture used for comparison is VoxelNet Lite. VoxelNet-based approaches convert the point cloud into a regular voxel grid structure, performing feature extraction using 3D convolution operations. This enables the spatial structure of the point cloud to be modeled with a more regular representation. In this study, VoxelNet Lite—a lighter version of the VoxelNet architecture—was used to reduce computational costs [21].

These three models represent different strategies for data representation. PointNet uses a direct point-based approach; PointNet++ uses a hierarchical point-based approach that considers local features, and VoxelNet Lite uses a voxel-based representation. Therefore, these models provide a suitable basis for evaluating the performance of the proposed architecture.

3.4. Proposed Model: PointNet2 MSG Transformer

This study presents a PointNet++ MSG-based model architecture developed to enhance the performance of semantic segmentation on LiDAR point clouds. This model combines multi-scale feature extraction with a transformer-based feature fusion mechanism and is named the ‘PointNet2 MSG Transformer’ (Figure 3).

The PointNet++ architecture is designed to perform hierarchical feature extraction from point clouds. It learns geometric features at different scales by progressively subsampling the point cloud through Set Abstraction (SA) layers. Each SA layer selects a specific number of points and extracts local features from the regions surrounding them. The Farthest Point Sampling (FPS) method is used during this process to select sample points that represent the spatial distribution of the point cloud.

At the model’s input, each point is represented by a seven-dimensional feature vector.

p_{i} = (x_{i}, y_{i}, z_{i}, r_{i}, g_{i}, b_{i}, I_{i})

Here, the coordinates x, y, z represent the point’s spatial position; the values r, g, b represent color information; and the value I represents LiDAR intensity information.

To ensure a fixed size, the point clouds used as model input have been standardized to 4096 points. This sampling size is commonly used in PointNet++-based semantic segmentation studies and provides sufficient geometric detail for hierarchical feature extraction.

The PointNet2 MSG Transformer architecture contains four Set Abstraction layers. The number of points is gradually reduced across these layers.

4096 \to 1024 \to 256 \to 64 \to 16

This hierarchical structure facilitates the identification of both local and global geometric features within the point cloud. Small-scale geometric details are learned at lower levels, while broader spatial contexts are represented at higher levels.

In the PointNet++ MSG architecture, each Set Abstraction layer creates neighborhood regions with different radii to perform multi-scale grouping. This approach allows geometric structures of different sizes to be represented simultaneously.

This study extends the classic PointNet++ MSG architecture with the scale-aware transformer fusion mechanism (Figure 3).

3.4.1. Scale-Aware Transformer Fusion

The present study proposes an approach to the standard multi-scale feature fusion method utilised in the traditional PointNet++ MSG architecture, namely a fusion mechanism termed Scale-Aware Transformer Fusion. The objective of this module is to establish a more adaptable representation by applying weightings to features obtained at different scales, with these weightings determined by feature content rather than by a predetermined fusion rule. As shown in Figure 4, multi-scale features are converted into tokens and processed by a Transformer encoder, followed by an adaptive gating mechanism.

In the PointNet++ MSG architecture, each Set Abstraction (SA) layer generates three distinct features from three neighbourhood regions, with differing radii centred on the same centroid points. The following definition is provided for these features:

s c a l e_f e a t s = [F^{(0)}, F^{(1)}, F^{(2)}]

In this context, each

{\tilde{F}}^{(t)} \in R^{B \times S \times C}

is representative of the features of scale t. B denotes the batch size, S represents the number of centroids, and T is the number of scales. The complete notation used in this study is summarized in Table 3.

Given the potential for variation in the number of channels across scales, all features are transformed into a common dimension through the utilisation of linear projection layers.

{\tilde{F}}^{(t)} \in R^{B \times S \times C}

Subsequent to this process, the features from the three scales are treated as a sequence for each centroid point. Consequently, three “tokens” are created for each centroid:

X \in R^{B \times S \times T \times C}, Τ = 3

The transformation is applied independently to each centroid point. The data is thus reorganised as follows:

X \in R^{(B \cdot S) \times T \times C}

This architecture demonstrates that the Scale-Aware Transformer Fusion module models only the inter-scale relationship. In summary, the model calculates attention across the three scales for each centroid, as opposed to across the entire point cloud.

The Transformer Encoder employs a multi-head self-attention mechanism. The query, key, and value matrices are computed for the input tokens:

Q = X W^{Q}, K = X W^{K}, V = X W^{V}

The attention mechanism is defined as follows:

A t t e n t i o n (Q, K, V) = s o f t m a x (\frac{{Q K}^{T}}{\sqrt{d_{k}}}) V

Here, Q, K and V represent the query, key, and value matrices, respectively.

It is through this process that each scale is updated, with the update being based on information received from the other scales. The architecture employed in this study consists of two Transformer Encoder layers, each of which incorporates residual connections, LayerNorm, and GELU activation. Figure 5 illustrates how self-attention models the relationships between tokens corresponding to different spatial scales.

At this point, the proposed Scale-Aware Transformer Fusion module differs significantly from direct attention-based point cloud models, such as the Point Transformer. Point Transformer aims to model spatial relationships by computing attention across all points or broad neighborhoods. The proposed architecture, however, establishes attention only across three scales for each centroid; that is, it focuses on modelling inter-scale relationships. While the attention cost in Point Transformer increases with the number of points, in this work, attention operates on a fixed T = 3 tokens. Consequently, the computational cost is minimal due to the fixed and small number of tokens (T = 3), and the hierarchical structure of PointNet++ is maintained.

In this approach, the attention matrix size is only 3 × 3 for each head, keeping the model’s computational overhead limited. Unlike methods that use global attention across the entire point cloud, this design offers a more efficient architectural configuration.

The Transformer Encoder output is obtained as follows:

X_{e n c} \in R^{(B \cdot S) \times T \times C}

This output represents updated features that incorporate cross-scale interactions for each centroid point and is passed to the Adaptive Scale Aggregation module, where the scales will be weighted and combined in the next stage.

Consequently, Scale-Aware Transformer Fusion provides a dynamically learned fusion mechanism for each location, as opposed to a fixed aggregation of multi-scale features. This approach contributes to a more balanced representation, particularly in LiDAR scenes where different geometric structures coexist.

3.4.2. Adaptive Scale Aggregation

The Scale-Aware Transformer Fusion module has been demonstrated to generate updated token representations that incorporate cross-scale interactions for each centroid point. The output obtained at this stage is expressed as follows:

X_{e n c} \in R^{(B \cdot S) \times T \times C}

In this instance, three scale tokens are present for each centroid point, designated T = 3. The reduction in these tokens to a single feature vector is achieved by means of a weighted aggregation mechanism known as Adaptive Scale Aggregation.

The purpose of this mechanism is to determine which of the three scales is more important for each centroid point and to perform the aggregation accordingly. The process is facilitated by a learnable gating structure.

Initially, a score value is calculated for each token:

l_{t} = T o k e n G a t e (x_{e n c}^{(t)})

This process is performed using a small network consisting of linear layers and activation functions:

T o k e n G a t e = R^{C} ⟶ R

The obtained scores are then normalised using the softmax function and converted into weights:

w_{t} = \frac{e^{l_{t}}}{\sum_{t^{'} = 1}^{T} e^{l_{t}}}

Utilising these weights, the three scale tokens are weighted and combined:

f_{f u s e d} = \sum_{t = 1}^{T} w_{t} \cdot x_{e n c}^{(t)}

Consequently, a single feature vector is obtained for each centroid point:

f_{f u s e d} \in R^{C}

The application of this process to all centroid points results in the following conversion of the output:

F_{f u s e d} \in R^{B \times S \times C}

The hyperparameters of the fusion modules, including the number of centroids, token dimensions, and Transformer settings (e.g., number of heads and feedforward dimensions), vary across encoder levels and are summarized in Table 4.

It is this structure that enables the model to dynamically select the scale for each location, thus obviating the need for fixed aggregation methods. In summary, the model has the capacity to assign greater weight to small-radius (detail-focused) features in certain regions and to large-radius (overall structure-focused) features in other regions.

The proposed Adaptive Scale Aggregation mechanism is distinct from fixed fusion methods, such as classical concatenation or averaging. While fixed methods treat all scales equally, in this approach, the contribution of each scale varies depending on the data content. This approach offers a more flexible representation, particularly in LiDAR scenes where different geometric structures coexist.

In addition, the weight generated by the gating mechanism can be utilized for analysis. The utilisation of these weights facilitates the examination of the scales preferred by the model in specific regions, thereby providing additional information that enhances interpretability.

The features obtained in the Set Abstraction layers and the proposed fusion modules are propagated back to higher-resolution point representations via standard Feature Propagation (FP) layers as defined in the original PointNet++ architecture. These layers enable point-level predictions by interpolating low-resolution features into denser point sets.

The final stage of the model consists of a segmentation head, where class probabilities are computed for each point using multi-layer perceptron (MLP) layers.

In this study, semantic segmentation was performed using four classes: Unclassified, Vegetation, Ground, and Building. The model output is a vector representing the class probabilities for each point.

P (y_{i} = c) = \frac{e^{z_{c}}}{\sum_{j = 1}^{C} e^{z_{j}}}

In this context, z_c denotes the class logit value, while C signifies the total number of classes.

The proposed architecture aims to learn a more powerful representation of the point cloud by combining PointNet++ MSG-based multi-scale feature extraction with a transformer-based feature integration mechanism. Unlike traditional PointNet++ MSG architectures that rely on fixed feature concatenation, the proposed model introduces a transformer-based fusion mechanism that enables adaptive weighting of multi-scale features. This allows the model to dynamically focus on the most informative spatial scales depending on the input structure, which is particularly beneficial for complex classes such as vegetation and building boundaries.

3.5. Training Configuration

The proposed model and comparison models were trained using a PyTorch-based training pipeline. The Adam optimizer was utilised during the training process. The initial learning rate was set to 0.001, and weight decay was set to 1 × 10⁻⁴. L2 regularization was applied to prevent overfitting of the model parameters.

The Adam optimization algorithm performs parameter updates using first- and second-moment estimates.

During the training phase, the batch size was set to 8. The maximum training duration was set to 50 epochs. Nevertheless, in order to prevent the continuation of training when no further enhancement in model performance was observed, early stopping was implemented. In this mechanism, the validation mIoU metric calculated on the validation dataset was monitored, and training was stopped if no improvement was observed for 15 consecutive epochs.

The learning rate was updated throughout the training process using the CosineAnnealingWarmRestarts scheduler. This approach assists the model in avoiding stagnation in local minima by progressively diminishing the learning rate through the utilisation of a cosine function, in conjunction with the execution of warm restarts at designated epoch intervals. The scheduler parameters were set as follows:

T₀ = 10;
T_mult = 2;
η_min = 10⁻⁶.

The training process was conducted using an RTX 5070 Ti GPU (16 GB VRAM). The utilisation of GPU acceleration during model training was observed, with the CPU exclusively engaged in data loading operations.

3.6. Loss Functions

In this study, a range of loss functions was evaluated in order to ascertain their impact on performance during model training. The Dice Loss function was utilised as the primary loss function. The Dice Loss metric is particularly advantageous in the context of semantic segmentation problems that are characterised by class imbalance, given its direct correlation with the Intersection-over-Union (IoU) metric.

The Dice coefficient is calculated as follows:

{D i c e}_{c} = \frac{2 \sum ({\hat{P}}_{c} y_{c}) + \in}{\sum {\hat{P}}_{c} + \sum y_{c} + \in}

In this context,

{\hat{P}}_{c}

symbolises the probability value as predicted by the model, whilst yc denotes the true class label. The total Dice Loss value is obtained by taking the average of the Dice coefficients calculated for all classes.

In addition to Dice Loss, the Combined Loss function, which is used in conjunction with Cross Entropy Loss, was also evaluated. The total loss function is calculated as follows:

L = α . C E + (1 - α) . D i c e L o s s

In the present study, the value of α was set at 0.5.

Furthermore, the FocalDice loss function, which is based on Focal Loss, was the subject of experimental evaluation in order to place greater emphasis on learning from difficult examples. Focal Loss is a machine learning technique that enhances the model’s capacity to learn from challenging examples by assigning greater weights to misclassified instances.

F o c a l L o s s = - \sum_{i = 1}^{i = n} {(i - p_{i})}^{γ} {l o g}_{b} (p_{i})

3.7. Evaluation Metrics

In order to evaluate the performance of the model, the Intersection over Union (IoU), Overall Accuracy (OA), mean IoU (mIoU), Precision, Recall, and F1-score metrics were employed. These metrics are widely utilised within the domain of semantic segmentation.

The IoU value for a class is calculated as follows:

{I o U}_{c} = \frac{{T P}_{c}}{{T P}_{c} + {F P}_{c} + {F N}_{c}}

In this context, TP_c denotes true positive, FP_c indicates false positive, and FN_c signifies false negative.

The mean intersection over union (mIoU), which is widely used in semantic segmentation tasks, is the average IoU value calculated for all classes.

m I o U = \frac{1}{K} \sum_{c = 1}^{K} {I o U}_{c}

where K is the total number of classes.

Precision, Recall, and F1-score for each class are defined as:

{P r e c i s i o n}_{C} = \frac{{T P}_{c}}{{T P}_{c} + {F P}_{c}}

{R e c a l l}_{c} = \frac{{T P}_{c}}{{T P}_{c} + {F N}_{c}}

{F 1}_{c} = \frac{2 \cdot {P r e c i s i o n}_{c} \cdot {R e c a l l}_{c}}{{P r e c i s i o n}_{c} + {R e c a l l}_{c}}

The classification system under scrutiny comprises four classes: unclassified, vegetation, ground, and building.

Furthermore, the Overall Accuracy (OA) metric was calculated in order to evaluate the model’s overall accuracy performance.

O A = \frac{N u m b e r o f c o r r e c t l y c l a s s i f i e d p o i n t s}{T o t a l n u m b e r o f p o i n t s}

All metrics were computed on a per-point basis, and class-wise values were reported to provide a detailed evaluation of model performance across different semantic categories. In addition to overall metrics, per-class IoU, Precision, Recall, and F1-score values were also analyzed.

To ensure numerical stability, a small constant was added to denominators during metric computation.

Overall Accuracy (OA) is computed as a global metric over all points. In contrast, IoU, Precision, Recall, and F1-score are defined per class and subsequently averaged across classes (macro-average) to obtain mIoU, mean Precision, mean Recall, and mean F1-score. This distinction allows the evaluation to capture both overall performance and class-wise behavior, which is particularly important in imbalanced point cloud datasets.

3.8. Ensemble and Test-Time Augmentation

In order to enhance the performance of the model and improve the robustness of the predictions, the ensemble method and test-time augmentation (TTA) were applied in combination. The ensemble approach is predicated on the principle of combining the outputs of multiple models trained under different configurations. The objective of this method is to generate more reliable predictions without being constrained by the errors of a single model.

In this study, the PointNet2 MSG Transformer architecture was subjected to training using various training configurations, thereby creating multiple model variants (Figure 6). Following the conduction of experimental evaluations, it was determined that the three models that demonstrated the most optimal performance on the validation dataset would be utilised within an ensemble framework (Table 5).

It is acknowledged that the training of these models is undertaken using divergent architectural structures and loss functions, thus resulting in disparate error patterns being produced by each model. The employment of an ensemble approach capitalises on these variations to generate predictions that are more balanced.

During the ensemble phase, each model generates independent predictions based on the same input point cloud. These predictions are then combined to produce the final class prediction. The class probabilities generated by each model are then combined using a weighted average method.

The ensemble forecast can be expressed as follows:

P_{e n s e m b l e} (c) = \sum_{i = 1}^{M} w_{i} P_{i} (c)

Here, M denotes the number of models used in the ensemble and P_i(c) denotes the probability value generated by the i.th model for class c. The w_i coefficients represent the weights assigned to each model.

The following weight values were used based on the experimental evaluations:

w = (0.5, 0.3, 0.2)

These weights correspond to the ‘dice_only’, ‘deeper_head’ and ‘combined’ models, respectively.

In addition to the ensemble method, Test-Time Augmentation (TTA) was employed to enhance the stability of the model’s predictions. In the TTA approach, the input point cloud is fed into the model multiple times after undergoing various transformations, with the resulting predictions then being combined. In this study, five different augmentations were applied to each point cloud.

The model was rerun for each augmentation, and the final prediction was obtained by averaging the resulting prediction probabilities.

In the final stage, the ensemble and TTA outputs were combined to obtain the final predictions (Table 6). This approach improved the model’s overall performance, producing more balanced predictions, particularly in cases of class imbalance.

The results demonstrate that using ensemble models and test-time augmentation improves model performance. Notably, the IoU score for the “building” class increased by around 2.12%.

4. Experiments and Results

This section evaluates the performance of the proposed PointNet2 MSG Transformer model on the problem of LiDAR point cloud semantic segmentation. Experiments were conducted on a four-class semantic segmentation problem involving unclassified areas, vegetation, ground and buildings. Model performance was analysed using the mIoU and overall accuracy metrics. A comprehensive ablation study was also conducted to investigate the impact of different loss functions, architectural changes and inference strategies on performance.

4.1. Experimental Setup

The experiments were conducted using a PyTorch-based training infrastructure. All models were trained using the Adam optimisation algorithm with an initial learning rate of 0.001. The Cosine Annealing with Warm Restarts scheduler was used for the developed model to ensure more stable updates to the learning rate.

During model training, the batch size was set to 8, and 4096 points were used per training sample. During training, point cloud samples were reduced to a fixed 4096 points using either random subsampling if the number of points exceeded 4096, or padding with repeated samples if the number of points was fewer than 4096. Early stopping was applied during training, with the model checkpoint yielding the best validation performance being saved.

The model input features consist of a seven-dimensional feature vector: (x, y, z, r, g, b, intensity). Here, x, y, and z represent coordinate information, r, g, and b represent color values, and intensity represents the LiDAR intensity value. Model performance was evaluated using the following metrics in all experiments: Mean Intersection over Union (mIoU), Overall Accuracy, Precision, Recall and F1-score. Although Precision, Recall, and F1-score are defined in Section 3.7, the evaluation in this study primarily focuses on mIoU and overall accuracy, as these are the standard metrics used in LiDAR semantic segmentation benchmarks.

4.2. Baseline Model Comparison

The first phase of the study involved comparing the performance of different deep learning architectures on LiDAR point cloud segmentation. Three different traditional model architectures (PointNet, PointNet++ MSG, and VoxelNet Lite) were evaluated alongside recent state-of-the-art architectures such as Point Transformer, KPConv, and RandLA-Net for comparison.

The findings demonstrate that while the proposed method attains the maximum overall mIoU (51.74%), disparate architectures demonstrate variances in efficacy across specific classes. For instance, Point Transformer achieves the best performance in the building class (48.28% IoU), whereas KPConv performs better on vegetation due to its deformable convolution mechanism. The findings of this study suggest that disparate architectural designs can capture distinct geometric properties of LiDAR data (Table 7).

The proposed model achieves the highest overall performance among the compared architectures, reaching 51.74% mIoU and 61.50% accuracy. Compared to PointNet++ MSG, the model provides a clear improvement in overall segmentation performance. While Point Transformer achieves higher performance in the building class, the proposed method demonstrates a more balanced performance across all classes. This indicates that the integration of multi-scale features with transformer-based fusion contributes to more stable predictions in complex scenes.

These models were trained using the same dataset and similar training parameters, and their performance was compared. Initial experiments revealed that the PointNet++ MSG model performed better than the other architectures (Figure 7 and Figure 8). In particular, it was observed that the model could learn geometric features at different scales more effectively thanks to its multi-scale grouping mechanism.

Consequently, subsequent stages of the study involved developing a new model based on the PointNet++ MSG architecture.

In order to ensure a fair comparison, it was imperative that all models were trained using the same dataset splits, point sampling strategy (4096 points), and preprocessing pipeline, including normalization and feature standardization. The identical maximum training budget (50 epochs) and early stopping criterion, which is based on the validation mIoU, were applied to all models.

However, architecture-specific training strategies were employed to ensure optimal performance for each model. While the baseline models were trained using cross-entropy loss with StepLR scheduling, the proposed PointNet2 MSG Transformer was trained using Dice loss and cosine annealing with warm restarts.

Despite the heterogeneity of the optimisation strategies employed, a uniform evaluation of the models was conducted on a shared test dataset, thereby ensuring internal consistency and enabling a reliable comparison.

4.3. Ablation Study

A comprehensive ablation study was conducted to evaluate how different training configurations affect the performance of the proposed model. Various loss functions and architectural variations were tested as part of this study.

The results show that using Dice Loss significantly improves the performance of the model. The model using Dice Loss achieved an improvement in mIoU of approximately 2.79% compared to the baseline model using Cross-Entropy Loss (Table 8).

However, increasing the number of segmentation head layers did not result in the expected improvement in performance. Although a deeper head structure increased the number of model parameters, this did not result in a significant improvement in overall performance. Similarly, combining Focal Loss and Dice Loss caused a significant drop in model performance.

Overall, these results suggest that the Dice Loss function is more suitable for LiDAR point cloud segmentation (Figure 9 and Figure 10).

To further improve the performance of the model, ensemble and test-time augmentation methods were employed. The ensemble method involved combining the predictions of three models that had been trained using different configurations.

The models used in the ensemble are as follows:

-: A model trained using Dice Loss;
-: A model with a DeeperHead architecture;
-: A model using a combined loss.

The model predictions were then combined using the weighted average method.

4.4. Effect of Ensemble and Test-Time Augmentation

As shown in Table 6, the results indicate that the performance of the model improves when ensemble and test-time augmentation methods are used. Specifically, when ensemble and TTA are employed in conjunction, the model’s mIoU value increases to 51.74%. This represents an approximate improvement of 4.03% in mIoU compared to the baseline model.

Analysis by class revealed that the greatest improvement was seen in the ‘building’ class. Using the ensemble method, the IoU value for the ‘building’ class increased by approximately 2.12%.

Figure 11, Figure 12 and Figure 13 show how different prediction strategies affect model performance. Using a single model yielded an mIoU value of 50.50%, whereas applying test-time augmentation increased this value to 51.15%. Using the ensemble method increased the mIoU value to 51.55%, and using the ensemble method in combination with test-time augmentation achieved the highest performance of 51.74%.

A similar pattern emerged in the accuracy metric. The best single model achieved an accuracy of 60.83%, whereas the combination of ensemble and test-time augmentation increased this figure to 61.50%.

Figure 14 shows the class-based IoU results. The final model showed slight improvements across all classes. Notably, the IoU value for the ‘building’ class increased by 2.12%, rising from 39.9% to 42.0%.

4.5. Qualitative Segmentation Results

In addition to the quantitative metrics, visual comparisons were created using selected point cloud segments from the test dataset, in order to analyse the spatial behaviour of the model outputs. These comparisons are presented in four panels for the same samples: the input point cloud; the ground truth labels; the prediction of the best single model (the model trained using Dice Loss); and the prediction of the final method (ensemble + test-time augmentation).

Figure 15a,b show a comparative display of the input point cloud for a test sample, the ground truth labels, the best single-model prediction (a model trained using Dice Loss) and the final model prediction (an ensemble model with test-time augmentation). Figure 16 shows a comparative display of the input point cloud for a test sample, the ground truth labels, PointNet++ MSG, PointNet2 MSG Ensemble + TTA prediction, RandLA-Net, Point Transformer, KPConv. In the visualisations, each class is represented by a different colour: ‘Unclassified’ is shown in grey, ‘Vegetation’ in green, ‘Ground’ in brown, and ‘Building’ in red.

Analysis of the visual results shows that the proposed method is highly effective in accurately identifying building boundaries. The final model utilises ensemble and test-time augmentation to produce more consistent class distributions and reduce misclassifications compared to a single model’s predictions. Specifically, roof surfaces are segmented more comprehensively in the building class, and errors resulting from overlap with the ground are reduced.

Additionally, despite having the highest point density in the dataset, the vegetation class is one of the most challenging to segment due to its geometric diversity. Examining the model predictions reveals that errors may occur in some boundary regions between the vegetation and ground classes. However, the final model produces more balanced results in these regions.

As illustrated by Figure 17, the training behaviour of the proposed model during the learning process is evident. As demonstrated in the accompanying figure, the training loss exhibits a consistent decrease across epochs, thereby signifying that the network is progressively acquiring discriminative geometric features from the LiDAR point clouds. The most substantial decline in loss is observed during the initial training stages, particularly within the first ten epochs. Subsequent to this, the decrease becomes more gradual.

Concurrently, the validation mIoU demonstrates an overall increasing trend, rising from approximately 35% in the initial epochs to around 50% in the later stages of training. Despite minor fluctuations observed during intermediate periods, the overall pattern suggests stable learning behaviour.

It is important to note that the validation curve remains stable without severe overfitting, suggesting that the model maintains a balanced learning process and is able to generalise effectively to unseen data. The learning curves demonstrate that the model attains a relatively stable convergence after approximately 15–20 epochs.

It is therefore concluded that the proposed PointNet2 MSG Transformer architecture and the applied ensemble strategy provide an effective approach for multi-class LiDAR point cloud segmentation.

4.6. Error Analysis

In order to conduct a more thorough examination of the model’s class-based performance, a confusion matrix was calculated for the final model. As illustrated in Figure 18, the confusion matrix derived from the test dataset has been standardized by row.

A thorough examination of the confusion matrix discloses that certain classes are predicted with a greater degree of accuracy than others. The “Unclassified” category was found to be accurately classified in 90% of cases. In a similar vein, the “Ground” class exhibits a high degree of classification accuracy, with an 85% accuracy rate (Table 9).

A class-wise evaluation of the model’s performance reveals that it demonstrates strong performance in ground and unclassified classes, while relatively lower performance is observed for vegetation and building classes. This behavior is consistent with the geometric complexity and class imbalance present in the dataset.

A thorough examination of the confusion matrix discloses that certain classes are predicted with a greater degree of accuracy than others. The Unclassified class was found to be correctly classified in 90% of cases. In a similar vein, the Ground class exhibits a high level of classification success, attaining an accuracy rate of 85%.

Conversely, the Vegetation class exhibits a lower accuracy rate. While approximately 45% of the points belonging to the “Vegetation” class are correctly classified, 48% are predicted as the “Ground” class. This finding suggests that the geometric similarity between vegetation and the ground may impede the model’s ability to differentiate between them.

The correct classification rate for the “Building” class was calculated to be 64%. It is evident that 33% of the building points are predicted to be vegetation. This finding suggests that class confusion may occur, particularly in areas with trees and vegetation around buildings.

The confusion matrix results demonstrate that the model exhibits a high degree of accuracy in distinguishing between the Ground and Building classes, while the Vegetation class frequently exhibits confusion with the other classes.

5. Discussion

In this study, the Scale-Aware Transformer Fusion and Adaptive Scale Aggregation modules were integrated into the PointNet++ MSG architecture with a view to addressing the semantic segmentation problem in aerial LiDAR point clouds. The findings indicate that the proposed method attains an enhancement in overall performance in comparison to the baseline model (mIoU: 51.74%). While this enhancement may appear to be restricted in absolute terms, it is nonetheless valuable in that the model engenders a more balanced performance across different classes. It is evident that minor increments in mean metrics (mIoU, IoU, accuracy) may be associated with substantial alterations in the aggregate number of errors, given the substantial quantity of samples in the point cloud datasets.

The enhancement in performance can be ascribed to the content-dependent processing of multi-scale features as opposed to their predetermined combination. In the classical PointNet++ MSG approach, features obtained from neighbourhoods of different radii are directly combined; in this study, however, each scale is treated as a token within the Transformer, and cross-scale relationships are modelled. This architecture facilitates the adaptive weighting of information obtained from different scales for each centroid point. It is hypothesised that this approach could contribute to the representation of different geometric patterns in LiDAR scenes with heterogeneous structures.

Class-based analysis provides a more comprehensive understanding of the model’s behaviour. Despite the fact that the proposed method does not attain the highest IoU values across all classes, it produces balanced results in terms of overall performance. The enhancement observed, particularly in the vegetation class, can be attributed to the model’s capacity to represent the irregular geometries of multi-scale structures. Conversely, certain attention-based methods have been observed to yield superior performance in the building class. This phenomenon can be attributed to the inherent properties of building surfaces, which exhibit a greater prevalence of regular and planar structures. Consequently, architectures that directly model spatial relationships are able to demonstrate a competitive advantage within this domain. The proposed approach, however, focuses on the adaptive fusion of multi-scale local features rather than establishing global attention across the entire point cloud.

The model’s behaviour in complex areas may be related to the joint consideration of multi-scale features. At class boundaries, such as those between buildings and vegetation, it is necessary to utilise both fine geometric details and broader contextual information in combination. The fusion mechanism utilised in this study has the potential to enhance the accuracy of predictions in such domains, as it models the interaction of information from different scales. Furthermore, it is hypothesised that test-time augmentation (TTA) and ensemble approaches can assist in reducing prediction variance, particularly in regions where certainty is lacking.

The present study is subject to several limitations. Firstly, the experiments were conducted using a single aerial LiDAR dataset, which limits the model’s ability to generalise to different data distributions. However, the present study opted for a dataset that reflects real-world conditions rather than one that has been fully cleaned. This choice was made to evaluate the model’s performance against practical challenges such as noise, class imbalance, and data heterogeneity. Nevertheless, further validation on disparate datasets is requisite. In addition, the proposed transformer-based fusion architecture involves extra computational steps compared to fixed fusion methods. However, this cost remains limited in practice due to the attention calculation being performed on a limited number of scale tokens (T = 3).

In subsequent studies, it is imperative to assess the proposed method on diverse aerial LiDAR datasets and evaluate its generalisation performance. Furthermore, increasing dataset diversity and testing with real-world field data will more accurately reflect the model’s performance in practical applications. Furthermore, the exploration of diverse variations in the transformer-based fusion mechanism and more intricate architectural structures is a possibility. In particular, the potential for investigation lies in class-balanced learning strategies and data augmentation methods that are specifically designed to enhance performance in the building class.

6. Conclusions

This study addressed the four-class semantic segmentation problem on aerial LiDAR point clouds, comparing various deep learning approaches with a focus on multi-class point segmentation. The experimental results showed that the PointNet2 MSG Transformer architecture achieved the highest performance. The proposed approach consists of a PointNet++ MSG-based encoder structure; Scale-Aware Transformer Fusion modules, which are used after each Set Abstraction layer; a decoder with four Feature Propagation layers; a Dice Loss-based training strategy; and a combination of weighted ensemble and test-time augmentation. The final system achieved an mIoU of 51.74% and an accuracy of 61.50% on the test dataset. This represents the best performance among the models compared.

The study not only examined a single-model architecture but also the effects of different training and inference strategies. The results demonstrate that Dice Loss is more effective than Cross-Entropy-based approaches on this class-imbalanced dataset. Furthermore, combining ensemble and test-time augmentation strategies led to improved performance, particularly in the ‘building’ class, resulting in the final model outperforming single-model performance.

The findings suggest that a multi-class semantic segmentation approach is better suited to LiDAR-based building detection problems than single-class approaches. This is because the classes ‘building’, ‘vegetation’, ‘ground’ and ‘unclassified’ coexist in real-world applications, and the boundaries between these classes directly affect model performance. In this study, the vegetation class posed the greatest challenge. Although a significant improvement was achieved in this class using Dice Loss, the results suggest that further improvements are possible.

This study has some limitations. Firstly, the experiments were conducted on a single aerial LiDAR dataset, which limits the evaluation of generalization across different data distributions. Although a real-world dataset was intentionally selected to reflect practical conditions, further validation on additional airborne LiDAR datasets is required. Additionally, the class distribution in the dataset is imbalanced, with the vegetation class particularly dominant. Therefore, it is necessary to investigate how the model performs on different datasets and under different scene conditions.

Future work will expand data augmentation strategies and employ class-balanced sampling methods. More advanced transformer-based fusion architectures will be tested, and post-processing approaches will be evaluated to enhance spatial consistency after inference. Furthermore, performance could be improved by utilising additional models obtained from different ensemble strategies and training variations.

Author Contributions

Conceptualization, H.K.S. and I.R.K.; methodology, H.K.S.; writing—original draft preparation, H.K.S. and I.R.K.; writing—review and editing, H.K.S. and I.R.K.; visualization, H.K.S.; supervisor, I.R.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original data presented in the study are openly available at https://doi.org/10.5069/G9QC01D1.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Fekete, A.; Cserep, M. Tree segmentation and change detection of large urban areas based on airborne LiDAR. Comput. Geosci. 2021, 156, 104900. [Google Scholar] [CrossRef]
Li, J. Research on processing and application of lidar point cloud data. In Proceedings of the International Conference on Remote Sensing, Surveying, and Mapping (RSSM 2025); Wu, J., Wang, J., Eds.; SPIE: Xi’an, China, 2025; p. 84. [Google Scholar] [CrossRef]
Sharifisoraki, Z.; Dey, A.; Selzler, R.; Amini, M.; Green, J.R.; Rajan, S.; Kwamena, F.A. Monitoring Critical Infrastructure Using 3D LiDAR Point Clouds. IEEE Access 2023, 11, 314–336. [Google Scholar] [CrossRef]
Uciechowska-Grakowicz, A.; Herrera-Granados, O.; Biernat, S.; Bac-Bronowicz, J. Usage of Airborne LiDAR Data and High-Resolution Remote Sensing Images in Implementing the Smart City Concept. Remote Sens. 2023, 15, 5776. [Google Scholar] [CrossRef]
Yan, W.Y.; Shaker, A.; El-Ashmawy, N. Urban land cover classification using airborne LiDAR data: A review. Remote Sens. Environ. 2015, 158, 295–310. [Google Scholar] [CrossRef]
Li, X.; Liu, C.; Wang, Z.; Xie, X.; Li, D.; Xu, L. Airborne LiDAR: State-of-the-art of system design, technology and application. Meas. Sci. Technol. 2021, 32, 032002. [Google Scholar] [CrossRef]
Mosco, S.; Fusaro, D.; Li, W.; Menegatti, E.; Pretto, A. Point-Plane Projections for Accurate LiDAR Semantic Segmentation in Small Data Scenarios. arXiv 2025. [Google Scholar] [CrossRef]
Ren, X.; Yu, B.; Wang, Y. Semantic Segmentation Method for Road Intersection Point Clouds Based on Lightweight LiDAR. Appl. Sci. 2024, 14, 4816. [Google Scholar] [CrossRef]
Mei, J.; Gao, B.; Xu, D.; Yao, W.; Zhao, X.; Zhao, H. Semantic Segmentation of 3D LiDAR Data in Dynamic Scene Using Semi-Supervised Learning. IEEE Trans. Intell. Transp. Syst. 2020, 21, 2496–2509. [Google Scholar] [CrossRef]
Wen, S.; Wang, T.; Tao, S. Hybrid CNN-LSTM Architecture for LiDAR Point Clouds Semantic Segmentation. IEEE Robot. Autom. Lett. 2022, 7, 5811–5818. [Google Scholar] [CrossRef]
Gao, B.; Pan, Y.; Li, C.; Geng, S.; Zhao, H. Are We Hungry for 3D LiDAR Data for Semantic Segmentation? A Survey of Datasets and Methods. IEEE Trans. Intell. Transp. Syst. 2022, 23, 6063–6081. [Google Scholar] [CrossRef]
Chakraborty, D.; Dey, E. Segmentation of LiDAR point cloud data in urban areas using adaptive neighborhood selection technique. PLoS ONE 2024, 19, e0307138. [Google Scholar] [CrossRef]
Zou, Y.; Weinacker, H.; Koch, B. Towards Urban Scene Semantic Segmentation with Deep Learning from LiDAR Point Clouds: A Case Study in Baden-Württemberg, Germany. Remote Sens. 2021, 13, 3220. [Google Scholar] [CrossRef]
Li, Q.; Du, Q.; Tian, L.; Liao, W.; Lu, G. Enhanced Semantic Segmentation of LiDAR Point Clouds Using Projection-Based Deep Learning Networks. IEEE Trans. Geosci. Remote Sens. 2025, 63, 3627917. [Google Scholar] [CrossRef]
Triess, L.T.; Peter, D.; Rist, C.B.; Zollner, J.M. Scan-based Semantic Segmentation of LiDAR Point Clouds: An Experimental Study. In Proceedings of the 2020 IEEE Intelligent Vehicles Symposium (IV), Las Vegas, NV, USA; IEEE: New York, NY, USA, 2020; pp. 1116–1121. [Google Scholar] [CrossRef]
Qi, C.; Su, H.; Mo, K.; Guibas, L. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2017; pp. 77–85. [Google Scholar] [CrossRef]
Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space. In Advances in Neural Information Processing Systems; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Available online: https://proceedings.neurips.cc/paper_files/paper/2017/file/d8bf84be3800d12f74d8b05e9b89836f-Paper.pdf (accessed on 30 January 2026).
Atik, M.; Duran, Z. An Efficient Ensemble Deep Learning Approach for Semantic Point Cloud Segmentation Based on 3D Geometric Features and Range Images. Sensors 2022, 22, 6210. [Google Scholar] [CrossRef]
Li, Y.; Ma, L.; Zhong, Z.; Liu, F.; Chapman, M.A.; Cao, D.; Li, J. Deep Learning for LiDAR Point Clouds in Autonomous Driving: A Review. IEEE Trans. Neural Netw. Learn. Syst. 2020, 32, 3412–3432. [Google Scholar] [CrossRef]
Zhang, J.; Zhao, X.; Chen, Z.; Lu, Z. A Review of Deep Learning-Based Semantic Segmentation for Point Cloud. IEEE Access 2019, 7, 179118–179133. [Google Scholar] [CrossRef]
Zhou, Y.; Tuzel, O. VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection. arXiv 2017. [Google Scholar] [CrossRef]
Jhaldiyal, A.; Chaudhary, N. Semantic segmentation of 3D LiDAR data using deep learning: A review of projection-based methods. Appl. Intell. 2022, 53, 6844–6855. [Google Scholar] [CrossRef]
Alonso, I.; Riazuelo, L.; Montesano, L.; Murillo, A. 3D-MiniNet: Learning a 2D Representation From Point Clouds for Fast and Efficient 3D LIDAR Semantic Segmentation. IEEE Robot. Autom. Lett. 2020, 5, 5432–5439. [Google Scholar] [CrossRef]
Czajka, M.; Krupka, M.; Kubacka, D.; Janiszewski, M.R.; Belter, D. A Comparison of Segmentation Methods for Semantic OctoMap Generation. Appl. Sci. 2025, 15, 7285. [Google Scholar] [CrossRef]
Chen, D.; Zhang, L.; Li, J.; Liu, R. Urban building roof segmentation from airborne lidar point clouds. Int. J. Remote Sens. 2012, 33, 6497–6515. [Google Scholar] [CrossRef]
Liu, M.; Shao, Y.; Li, R.; Wang, Y.; Sun, X.; Wang, J.; You, Y. Method for extraction of airborne LiDAR point cloud buildings based on segmentation. PLoS ONE 2020, 15, e0232778. [Google Scholar] [CrossRef]
Yamashita, T.; Wester, D.; Tewes, M.; Young, J.; Lombardi, J. Distinguishing Buildings from Vegetation in an Urban-Chaparral Mosaic Landscape with LiDAR-Informed Discriminant Analysis. Remote Sens. 2023, 15, 1703. [Google Scholar] [CrossRef]
Liu, C.; Wang, H.; Feng, B.; Wang, C.; Lei, X.; Chang, J. Integrating Elevation Frequency Histogram and Multi-Feature Gaussian Mixture Model for Ground Filtering of UAV LiDAR Point Clouds in Densely Vegetated Areas. Remote Sens. 2025, 17, 3261. [Google Scholar] [CrossRef]
Kurdi, F.T.; Amakhchan, W.; Gharineiat, Z.; Boulaassal, H.; Kharki, O.E. Contribution of Geometric Feature Analysis for Deep Learning Classification Algorithms of Urban LiDAR Data. Sensors 2023, 23, 7360. [Google Scholar] [CrossRef]
Aljumaily, H.; Laefer, D.; Cuadra, D.; Velasco, M. Point cloud voxel classification of aerial urban LiDAR using voxel attributes and random forest approach. Int. J. Appl. Earth Obs. Geoinf. 2023, 118, 103208. [Google Scholar] [CrossRef]
Vijaywargiya, J.; Ramiya, A. Semantic segmentation of urban airborne LiDAR data of varying landcover diversity using XGBoost. IET Comput. Vis. 2024, 19, 12334. [Google Scholar] [CrossRef]
Kuprowski, M.; Drozda, P. Feature Selection for Airbone LiDAR Point Cloud Classification. Remote Sens. 2023, 15, 561. [Google Scholar] [CrossRef]
Gilani, S.A.N.; Awrangjeb, M.; Lu, G. An Automatic Building Extraction and Regularisation Technique Using LiDAR Point Cloud Data and Orthoimage. Remote Sens. 2016, 8, 258. [Google Scholar] [CrossRef]
Chen, D.; Wang, Y.; Zhang, L.; Kang, Z. Enhanced Local Feature Learning With Simple Offset Attention for Semantic Segmentation of Large-Scale Point Clouds. IEEE Trans. Geosci. Remote Sens. 2024, 62, 3453966. [Google Scholar] [CrossRef]
Han, J.; Liu, K.; Li, W.; Zhang, F.; Xia, X. Generating Inverse Feature Space for Class Imbalance in Point Cloud Semantic Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 5778–5793. [Google Scholar] [CrossRef]
Rauch, L.; Braml, T. Semantic Point Cloud Segmentation with Deep-Learning-Based Approaches for the Construction Industry: A Survey. Appl. Sci. 2023, 13, 9146. [Google Scholar] [CrossRef]
Kharroubi, A. Semantic Segmentation for 3D Change Detection in Urban and Railway Context Using LiDAR Point Clouds. Ph.D. Thesis, ULiège—Université de Liège, Liège, Belgium, 2025. [Google Scholar]
Yang, S.; Hou, M.; Li, S. Three-Dimensional Point Cloud Semantic Segmentation for Cultural Heritage: A Comprehensive Review. Remote Sens. 2023, 15, 548. [Google Scholar] [CrossRef]
Zhang, R.; Wu, Y.; Jin, W.; Meng, X. Deep-Learning-Based Point Cloud Semantic Segmentation: A Survey. Electronics 2023, 12, 3642. [Google Scholar] [CrossRef]
Deng, S.; Xu, Q.; Yue, Y.; Jing, S.; Wang, Y. Individual tree detection and segmentation from unmanned aerial vehicle-LiDAR data based on a trunk point distribution indicator. Comput. Electron. Agric. 2024, 218, 108717. [Google Scholar] [CrossRef]
Zhang, Q.; Peng, Y.; Zhang, Z.; Li, T. Semantic Segmentation of Spectral LiDAR Point Clouds Based on Neural Architecture Search. IEEE Trans. Geosci. Remote Sens. 2023, 61, 3284995. [Google Scholar] [CrossRef]
Gu, Y.; Xiao, Z.; Li, X. A Spatial Alignment Method for UAV LiDAR Strip Adjustment in Nonurban Scenes. IEEE Trans. Geosci. Remote Sens. 2023, 61, 3281692. [Google Scholar] [CrossRef]
Liu, X.; Li, J.; Nazeer, M.; Wong, M.S. Advanced point cloud completion for urban trees: A novel approach using enhanced SnowflakeNet. Urban For. Urban Green. 2025, 113, 129107. [Google Scholar] [CrossRef]
SinghInnovative, R. Methods for 3D Point Cloud Processing of Large Data Sets and Its Practical Implementations. Ph.D. Thesis, University of Gloucestershire, Cheltenham, UK, 2022. [Google Scholar] [CrossRef]
Santo, A.; Heredia, E.; Viegas, C.; Valiente, D.; Gil, A. Ground Segmentation for LiDAR Point Clouds in Structured and Unstructured Environments Using a Hybrid Neural—Geometric Approach. Technologies 2025, 13, 162. [Google Scholar] [CrossRef]
Zhang, A.; Li, S.; Wu, J.; Li, S.; Zhang, B. Exploring Semantic Information Extraction From Different Data Forms in 3D Point Cloud Semantic Segmentation. IEEE Access 2023, 11, 61929–61949. [Google Scholar] [CrossRef]
Zhang, P.; Kong, C.; Xu, Y.; Zhang, C.; Jin, J.; Li, T.; Jiang, X.; Tang, D. An Improved PointNet++ Based Method for 3D Point Cloud Geometric Features Segmentation in Mechanical Parts. Procedia CIRP 2024, 129, 25–30. [Google Scholar] [CrossRef]
Kimura, M.; Shimizu, R.; Hirakawa, Y.; Goto, R.; Saito, Y. On permutation-invariant neural networks. arXiv 2024. [Google Scholar] [CrossRef]
Hoanh, L. Geometric Invariance of Pointnet; Tampere University: Tampere, Finland, 2021; Available online: https://trepo.tuni.fi/bitstream/handle/10024/132838/LeHoanh.pdf?sequence=3 (accessed on 25 April 2026).
Haznedar, B.; Bayraktar, R.; Ozturk, A.E.; Arayici, Y. Implementing PointNet for point cloud segmentation in the heritage context. Herit. Sci. 2023, 11, 2. [Google Scholar] [CrossRef]
Luo, C.; Cheng, N.; Ma, S.; Xiang, J.; Li, X.; Lei, S.; Li, P. mini-PointNetPlus: A Local Feature Descriptor in Deep Learning Model for Real-time 3D Environment Perception. In Proceedings of the 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Abu Dhabi, United Arab Emirates; IEEE: New York, NY, USA, 2024; pp. 9362–9366. [Google Scholar] [CrossRef]
Lu, H.; Rezapour, M.; Baha, H.; Niazi, M.K.K.; Narayanan, A.; Gurcan, M.N. Gene pointNet for tumor classification. Neural Comput. Appl. 2024, 36, 21107–21121. [Google Scholar] [CrossRef]
Ma, K.; Yan, F.; Li, S.; Huang, G.; Jia, X.; Wang, F.; Chen, L. Low-Overlap Registration of Multi-Source LiDAR Point Clouds in Urban Scenes Through Dual-Stage Feature Pruning and Progressive Hierarchical Methods. Remote Sens. 2025, 17, 2938. [Google Scholar] [CrossRef]
Li, X.; Li, R.; Chen, G.; Fu, C.-W.; Cohen-Or, D.; Heng, P.-A. A Rotation-Invariant Framework for Deep Point Cloud Analysis. IEEE Trans. Vis. Comput. Graph. 2022, 28, 4503–4514. [Google Scholar] [CrossRef]
Wang, J.; Liu, Y.; Tan, H.; Zhang, M. A survey on weakly supervised 3D point cloud semantic segmentation. IET Comput. Vis. 2024, 18, 329–342. [Google Scholar] [CrossRef]
Lu, D.; Xie, Q.; Wei, M.; Gao, K.; Xu, L.; Li, J. Transformers in 3D Point Clouds: A Survey. arXiv 2022, arXiv:2205.07417. [Google Scholar] [CrossRef]
Zhao, H.; Jiang, L.; Jia, J.; Torr, P.; Koltun, V. Point Transformer. arXiv 2020. [Google Scholar] [CrossRef]
Thomas, H.; Qi, C.R.; Deschaud, J.-E.; Marcotegui, B.; Goulette, F.; Guibas, L.J. KPConv: Flexible and Deformable Convolution for Point Clouds. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV); IEEE: New York, NY, USA, 2019. [Google Scholar]
Hu, Q.; Yang, B.; Xie, S.; Rosa, S.; Guo, Y.; Wang, Z.; Trigoni, N.; Markham, A. RandLA-Net: Efficient Semantic Segmentation of Large-Scale Point Clouds. arXiv 2019. [Google Scholar] [CrossRef]
Zhang, X.; Lin, D.; Soergel, U. Target-aware attentional network for rare class segmentation in large-scale LiDAR point clouds. ISPRS J. Photogramm. Remote Sens. 2024, 220, 32–50. [Google Scholar] [CrossRef]
Aissou, B.; Aissa, A.; Dairi, A.; Harrou, F.; Wichmann, A.; Kada, M. Building Roof Superstructures Classification from Imbalanced and Low Density Airborne LiDAR Point Cloud. IEEE Sens. J. 2021, 21, 14960–14976. [Google Scholar] [CrossRef]
Liu, B.; Qi, X. Class-Balanced PolarMix for Data Augmentation of 3D LIDAR Point Clouds Semantic Segmentation. J. Internet Technol. 2025, 26, 65–75. [Google Scholar] [CrossRef]
Nong, X.; Bai, W.; Liu, G. Airborne LiDAR point cloud classification using PointNet++ network with full neighborhood features. PLoS ONE 2023, 18, e0280346. [Google Scholar] [CrossRef] [PubMed]
Chauhan, P.L.; Vijaywargiya, J.; Ramiya, A.M. Addressing class imbalance challenge in Semantic Segmentation of ALS data through performance analysis of RandLA-NET and PointNET ++. In Proceedings of the 2023 IEEE India Geoscience and Remote Sensing Symposium (InGARSS), Bangalore, India; IEEE: New York, NY, USA, 2023; pp. 1–4. [Google Scholar] [CrossRef]
OpenTopography. Oregon Department of Geology and Mineral Industries Lidar Program Data; OpenTopography: La Jolla, CA, USA, 2011. [Google Scholar] [CrossRef]
Fan, Z.; Wei, J.; Zhang, R.; Zhang, W. Tree Species Classification Based on PointNet++ and Airborne Laser Survey Point Cloud Data Enhancement. Forests 2023, 14, 1246. [Google Scholar] [CrossRef]
Kouhi, R.M.; Daniel, S.; Giguère, P. Data Preparation Impact on Semantic Segmentation of 3D Mobile LiDAR Point Clouds Using Deep Neural Networks. Remote Sens. 2023, 15, 982. [Google Scholar] [CrossRef]
Kim, D.-H.; Ko, C.-U.; Kim, D.-G.; Kang, J.-T.; Park, J.-M.; Cho, H.-J. Automated Segmentation of Individual Tree Structures Using Deep Learning over LiDAR Point Cloud Data. Forests 2023, 14, 1159. [Google Scholar] [CrossRef]
Hu, Q.; Yang, B.; Xie, L.; Rosa, S.; Guo, Y.; Wang, Z.; Trigoni, N.; Markham, A. Learning Semantic Segmentation of Large-Scale Point Clouds with Random Sampling. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 3083288. [Google Scholar] [CrossRef]
Lin, C.-L.; Yang, H.-W.; Chuang, C.-H. Powerful Sample Reduction Techniques for Constructing Effective Point Cloud Object Classification Models. Electronics 2025, 14, 2439. [Google Scholar] [CrossRef]
Li, D.; Wei, Y.; Zhu, R. A comparative study on point cloud down-sampling strategies for deep learning-based crop organ segmentation. Plant Methods 2023, 19, 124. [Google Scholar] [CrossRef]
Seo, H.; Joo, S. Influence of Preprocessing and Augmentation on 3D Point Cloud Classification Based on a Deep Neural Network: PointNet. In Proceedings of the 2020 20th International Conference on Control, Automation and Systems (ICCAS), Busan, Republic of Korea; IEEE: New York, NY, USA, 2020; pp. 895–899. [Google Scholar] [CrossRef]
Zhao, Y.; Bai, L.; Huang, X. FIDNet: LiDAR Point Cloud Semantic Segmentation with Fully Interpolation Decoding. In Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic; IEEE: New York, NY, USA, 2021; pp. 4453–4458. [Google Scholar] [CrossRef]

Figure 1. The area of work.

Figure 2. Test-Train Tiles.

Figure 3. PointNet2 MSG Transformer architecture.

Figure 4. Scale-Aware Transformer Fusion and Adaptive Scale Aggregation modules. Multi-scale MSG features are converted into tokens (T = 3), processed by a Transformer encoder to capture inter-scale dependencies, and adaptively aggregated using a softmax-based gating mechanism to produce a fused feature representation. The symbol “*” denotes element-wise multiplication, consistent with the formulation of

f_{f u s e d}

.

Figure 4. Scale-Aware Transformer Fusion and Adaptive Scale Aggregation modules. Multi-scale MSG features are converted into tokens (T = 3), processed by a Transformer encoder to capture inter-scale dependencies, and adaptively aggregated using a softmax-based gating mechanism to produce a fused feature representation. The symbol “*” denotes element-wise multiplication, consistent with the formulation of

f_{f u s e d}

.

Figure 5. Conceptual illustration of self-attention over multi-scale tokens for a single centroid. Each token corresponds to features extracted at a different neighborhood scale (T = 3). The Transformer encoder models inter-scale relationships through multi-head self-attention, enabling interaction between small-, medium-, and large-scale features before adaptive aggregation.

Figure 6. Ensemble and test-time augmentation inference pipeline used for final prediction. The symbol “*” denotes multiplication in the weighted averaging step.

Figure 7. Baseline model comparison—Overall Accuracy.

Figure 8. Baseline model comparison—mIoU.

Figure 9. Ablation Study—Test Overall Accuracy.

Figure 10. Ablation Study—Test mIoU.

Figure 11. Ensemble & TTA effect—Test Accuracy.

Figure 12. Ensemble & TTA effect—Test mIoU.

Figure 13. Ensemble & TTA Effect Accuracy and mIoU.

Figure 14. Comparison of Per-Class IoU (Best Single vs. Final).

Figure 15. (a). Qualitative segmentation results on a representative airborne LiDAR tile. (Sample 12) (a) Input point cloud, (b) Ground truth, (c) prediction of the best single model trained with Dice loss, and (d) prediction of the proposed method (Ensemble + TTA). The proposed method produces more consistent predictions and reduces misclassification noise, particularly in complex regions. (b) Qualitative segmentation results on a representative airborne LiDAR tile. (Sample 47), (a) Input point cloud, (b) Ground truth, (c) prediction of the best single model trained with Dice loss, and (d) prediction of the proposed method (Ensemble + TTA).

Figure 16. Qualitative comparison of different models on the same test sample (Sample 12). The figure shows the input point cloud, ground truth, and predictions from PointNet++ MSG, the proposed method (Ensemble + TTA), RandLA-Net, Point Transformer, and KPConv. The proposed method demonstrates more stable predictions and improved consistency across classes compared to baseline models.

Figure 17. Learning curves of the proposed model during training. The figure illustrates the progression of training loss and validation mIoU over the course of 25 epochs. As the training progresses, there is a gradual decrease in the loss function, accompanied by an increase and subsequent stabilisation of the validation mIoU. This outcome indicates the attainment of stable convergence and consistent learning behaviour.

Figure 18. The normalized confusion matrix for the final model (Ensemble + TTA) is presented.

Table 1. Point distribution by class.

ID	Class	Number of Points	Percentage
0	Unclassified	3,465,115	7.21%
1	Vegetation	27,513,533	57.21%
2	Ground	13,414,662	27.89%
3	Building	3,697,440	7.69%
Total		48,090,750	100%

Table 2. Overall Performance Comparison.

Rank	Model	Test mIoU	Test Accuracy
1	PointNet2 MSG Transformer (Ensemble + TTA)	51.74%	61.50%
2	PointNet2 MSG Transformer (best only-Dice)	50.50%	60.83%
3	PointNet++ MSG	49.90%	57.42%
4	VoxelNet Lite	40.33%	49.35%
5	PointNet	37.86%	46.87%

Table 3. Notation used in the proposed Scale-Aware Transformer Fusion and Adaptive Scale Aggregation modules.

Symbol	Denotes
(B)	Batch size
(N)	Number of input points (e.g., 4096)
(S)	Number of centroids in the corresponding SA layer
(T)	Number of MSG scales (fixed = 3)
(C)	token_dim (128/256/512/1024)
(w_t)	Gate softmax weight for scale (t)

Table 4. Four Levels Across the Encoder: Hyperparameters.

Block	Number of Centroids (S)	Token_Dim (C)	Num_Heads	Num_Layers	Dim_Feedforward	Dropout
fuse1	1024	128	4	2	512	0.1
fuse2	256	256	4	2	1024	0.1
fuse3	64	512	8	2	2048	0.1
fuse4	16	1024	8	2	4096	0.1

Table 5. Model variants used in the ensemble.

Model	Loss Functions	Segmentation Head	Validation mIoU
dice_only	Dice Loss	2-layer head	51.52%
deeper_head	Dice Loss	4-layer head	50.35%
combined	Cross Entropy + Dice	2-layer head	48.22%

Table 6. The impact of ensemble and TTA methods on performance.

Methods	mIoU	Accuracy
Best dice_only	50.50%	60.83%
Dice + TTA	51.15%	61.35%
Ensemble	51.55%	60.97%
Ensemble + TTA	51.74%	61.50%

Table 7. Performance comparison between the proposed method and state-of-the-art deep learning architectures on the Oregon LiDAR dataset. The proposed method achieves the highest overall mIoU among the compared approaches.

Architecture	Test mIoU	Test Accuracy	Unclassified	Vegetation	Ground	Building
PointNet2 MSG Transformer (Ensemble + TTA)	51.74%	61.50%	79.89%	40.40%	44.65%	42.03%
PointNet2 MSG Transformer (best only, Dice)	50.50%	60.83%	78.09%	39.71%	44.29%	39.91%
PointNet++ MSG [17]	49.90%	57.42%	81.74%	25.08%	47.32%	45.44%
Point Transformer [57]	50.10%	61.39%	67.03%	40.60%	44.47%	48.28%
KPConv [58]	41.81%	57.99%	59.50%	43.00%	37.74%	26.99%
VoxelNet Lite [21]	40.33%	49.35%	73.26%	13.16%	45.75%	29.14%
PointNet [16]	37.86%	46.87%	70.68%	9.26%	45.12%	26.38%
RandLA-Net [59]	22.92%	45.70%	13.05%	36.86%	29.70%	12.07%

Table 8. A comparison of the performance of different training configurations.

Model	Loss Function	Head	Test mIoU	Accuracy
Baseline	Cross Entropy	2-layer	47.71%	56.22%
Combined	CE + Dice	2-layer	47.67%	56.31%
Dice Only	Dice Loss	2-layer	50.50%	60.83%
Deeper Head	Dice Loss	4-layer	50.41%	60.69%
FocalDice	Focal + Dice	2-layer	45.57%	55.16%

Table 9. Class-wise accuracy of the proposed model for LiDAR-based semantic segmentation. The findings demonstrate that performance varies across classes, with higher accuracy observed for ground and unclassified points and comparatively lower performance for vegetation and building classes. This is indicative of the complexity of geometric and structural variations present in these categories.

Class	Accuracy
Unclassified	0.90
Vegetation	0.45
Ground	0.85
Building	0.64

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Sevinc, H.K.; Karas, I.R. Deep Learning-Based Semantic Segmentation of Airborne LiDAR Point Clouds Using a Transformer-Enhanced PointNet++ Architecture. Geomatics 2026, 6, 43. https://doi.org/10.3390/geomatics6030043

AMA Style

Sevinc HK, Karas IR. Deep Learning-Based Semantic Segmentation of Airborne LiDAR Point Clouds Using a Transformer-Enhanced PointNet++ Architecture. Geomatics. 2026; 6(3):43. https://doi.org/10.3390/geomatics6030043

Chicago/Turabian Style

Sevinc, Hacer Kubra, and Ismail Rakip Karas. 2026. "Deep Learning-Based Semantic Segmentation of Airborne LiDAR Point Clouds Using a Transformer-Enhanced PointNet++ Architecture" Geomatics 6, no. 3: 43. https://doi.org/10.3390/geomatics6030043

APA Style

Sevinc, H. K., & Karas, I. R. (2026). Deep Learning-Based Semantic Segmentation of Airborne LiDAR Point Clouds Using a Transformer-Enhanced PointNet++ Architecture. Geomatics, 6(3), 43. https://doi.org/10.3390/geomatics6030043

Article Menu

Deep Learning-Based Semantic Segmentation of Airborne LiDAR Point Clouds Using a Transformer-Enhanced PointNet++ Architecture

Highlights

Abstract

1. Introduction

2. Related Work

2.1. Point Cloud-Based Semantic Segmentation

2.2. PointNet and Related Architectural Approaches

2.3. Transformer Models and Ensemble Learning Approaches

3. Materials and Methods

3.1. Data Source and Dataset Structure

3.2. Fixed Point Sampling Strategy

3.3. Compared Deep Learning Models

3.4. Proposed Model: PointNet2 MSG Transformer

3.4.1. Scale-Aware Transformer Fusion

3.4.2. Adaptive Scale Aggregation

3.5. Training Configuration

3.6. Loss Functions

3.7. Evaluation Metrics

3.8. Ensemble and Test-Time Augmentation

4. Experiments and Results

4.1. Experimental Setup

4.2. Baseline Model Comparison

4.3. Ablation Study

4.4. Effect of Ensemble and Test-Time Augmentation

4.5. Qualitative Segmentation Results

4.6. Error Analysis

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI