Transfer Learning Based on Multi-Branch Architecture Feature Extractor for Airborne LiDAR Point Cloud Semantic Segmentation with Few Samples

Jialin Yuan; Hongchao Ma; Liang Zhang; Jiwei Deng; Wenjun Luo; Ke Liu; Zhan Cai

doi:10.3390/rs17152618

,

and

¹

School of Remote Sensing and Information Engineering, Wuhan University, Wuhan 430072, China

²

Department of Oceanography, Dalhousie University, Halifax, NS B3H 4R2, Canada

³

Faculty of Resources and Environmental Science, Hubei University, Wuhan 430062, China

⁴

Tianjin Key Laboratory of Rail Transit Navigation Positioning and Spatio-Temporal Big Data Technology, Tianjin 300251, China

Remote Sens.2025, 17(15), 2618;https://doi.org/10.3390/rs17152618

Version Notes

Order Reprints

Abstract

The existing deep learning-based Airborne Laser Scanning (ALS) point cloud semantic segmentation methods require a large amount of labeled data for training, which is not always feasible in practice. Insufficient training data may lead to over-fitting. To address this issue, we propose a novel Multi-branch Feature Extractor (MFE) and a three-stage transfer learning strategy that conducts pre-training on multi-source ALS data and transfers the model to another dataset with few samples, thereby improving the model’s generalization ability and reducing the need for manual annotation. The proposed MFE is based on a novel multi-branch architecture integrating Neighborhood Embedding Block (NEB) and Point Transformer Block (PTB); it aims to extract heterogeneous features (e.g., geometric features, reflectance features, and internal structural features) by leveraging the parameters contained in ALS point clouds. To address model transfer, a three-stage strategy was developed: (1) A pre-training subtask was employed to pre-train the proposed MFE if the source domain consisted of multi-source ALS data, overcoming parameter differences. (2) A domain adaptation subtask was employed to align cross-domain feature distributions between source and target domains. (3) An incremental learning subtask was proposed for continuous learning of novel categories in the target domain, avoiding catastrophic forgetting. Experiments conducted on the source domain consisted of DALES and Dublin datasets and the target domain consists of ISPRS benchmark dataset. The experimental results show that the proposed method achieved the highest OA of 85.5% and an average F1 score of 74.0% using only 10% training samples, which means the proposed framework can reduce manual annotation by 90% while keeping competitive classification accuracy.

Keywords:

ALS; point cloud semantic segmentation; deep learning; transfer learning; few sample learning

1. Introduction

Airborne Laser Scanning (ALS) point clouds acquired by an airborne Light Detection and Ranging (LiDAR) system have demonstrated widespread applications in high-accuracy Digital Elevation Model (DEM) production [], three-dimensional (3D) city modeling [], vegetation detection and forest parameters retrieval [], power line detection [], and many others. One of the prerequisites to apply a point cloud to the aforementioned fields is the extraction of the thematic information from the original laser scanning data. In recent years, deep learning-based methods have achieved superior results in mobile laser scanning (MLS) and terrestrial laser scanning (TLS) point cloud semantic segmentation []. These methods can be summarized into three main categories [], which are shown in Table 1. A recent comprehensive survey on deep learning techniques for 3D point cloud analysis was provided by Ref. []. Though they originated from the semantic segmentation of MLS and TLS point clouds, these methods have been extended to ALS point cloud analysis with minor modifications [,,,].

Table 1. Summarization of point cloud semantic segmentation methods.

While deep learning techniques outperform traditional classifiers in various applications, they rely heavily on large volumes of manually annotated data for training. However, annotation remains a labor-intensive task, particularly for ALS point clouds. ALS point cloud data is not only costly to collect but also requires specialized expertise for annotation, incurring substantial time and financial expenses. Furthermore, like other machine learning methods, deep learning demands consistency in data distribution and feature space between the training dataset and the target application dataset. Failure to meet this requirement leads to degraded model performance, making it challenging to extend pre-trained models to datasets with inconsistent distributions. This issue is prevalent in airborne LiDAR data acquisition projects, where different LiDAR systems are deployed across diverse topographies. Data from various sensors may share common parameters such as 3D coordinates (X, Y, Z) while differing in sensor-specific attributes like intensity and multi-echo information. Additionally, category definitions vary across projects, resulting in datasets with distinct distributions. Traditional machine learning classifiers address such distribution discrepancies by retraining models from scratch using new data that matches the testing dataset’s distribution []. In contrast, a more efficient approach is to transfer knowledge learned from one task to other similar tasks with limited training data—a concept at the core of transfer learning.

The origins of transfer learning can be traced back to Bozinovski’s work in the 1970s []. Since then, numerous studies have applied this framework across diverse domains [], including natural language processing, audio processing, image processing, and multi-source data fusion. In these fields, transfer learning has consistently demonstrated advantages in both efficiency and performance [].

Transfer learning has also been investigated for ALS point cloud semantic segmentation and classification, aiming at reducing the requirements of annotation. Ma H et al. [] first introduced model transfer for ALS point cloud ground filtering, which is used to segment ground points and non-ground points. In this study, the trained model was transferred to new data without additional training. The effectiveness of traditional classifiers such as random forest was verified. However, the performance of deep learning methods is often superior to that of traditional classifiers and does not require manual feature design and selection []. Some works extend transfer learning to deep learning methods. Peng et al. [] and Xie et al. [] introduced Unsupervised Domain Adaptation (UDA), where models pre-trained in the source domain were transferred to a target domain through correlation alignment and domain adversarial learning. However, these methods assume that the source and target data share a consistent label and feature space, which is often not the reality. Zhao et al. [] converted point clouds into 2D representations and utilize pre-trained 2D networks for classification. Due to the loss of complex 3D geometric information, the performance for large scenes was limited [].

In order to address the issue of inconsistent labels between source and target domains while directly acting on 3D points to avoid information loss caused by projection, Dai et al. [] proposed a cross-domain incremental learning method with few samples. In this study, a photogrammetric point cloud and ALS point clouds were selected as the source domain and target domain, respectively. A conditional domain adversarial network (CDAN) [] was introduced for domain adaptation of base categories consisting of ground, vegetation, and building. Incremental learning was used to acquire knowledge of novel categories in the target domain. Although the authors provided a novel approach of transfer learning for ALS point cloud learning using few samples, the performance of the transferred model was not as perfect as expected, which may have been due to the significant discrepancies between the source and target domains. These discrepancies can be summarized as follows:

(1): Different scenes from where a LiDAR system collects data. This affects the number of categories and the feature distribution of ground objects. More complex scenes contain more object categories, leading to a harder semantic segmentation task.
(2): Inconsistent parameters recorded in the point clouds. In addition to the common XYZ values, photogrammetric point clouds generally include the colors RGB, while ALS data may contain intensity and echo-related parameters. When the parameters of the target domain and the source domain are different, the transferability of the features learned by the pre-trained model is poor, and directly applying the pre-trained model to the target domain may result in unsatisfactory performance.
(3): Inconsistencies in point density. Due to the variety of flight altitudes and LiDAR systems used in practice, data collected can have significant differences in point density, ranging from a few points to several hundred points per square meter. The same ground object may exhibit various features under inconsistent point densities; ideally, learned features should be density-invariant.

To address these challenges and to utilize multi-source ALS data as the source domain, a novel Multi-branch Feature Extractor and a novel transfer learning strategy were proposed, aimed at improving the performance of the model and reducing the amount of annotated data. The proposed method addresses the following issues: (1) It fully utilizes the parameters contained in ALS point clouds to learn salient features for semantic segmentation. (2) It overcomes difficulties in model transfer caused by differences in types of parameters between source domain and target domain. (3) It deals with inconsistent categories between source domain and target domain during domain adaptation. The main contributions include:

(1): A novel multi-branch architecture was proposed to learn heterogeneous features (such as geometric features, reflection features, and internal structural features). Based on this architecture, a feature extraction network named Multi-branch Feature Extractor (MFE) was proposed, where each branch consists of the proposed Neighborhood Embedding Block (NEB) and the existing Point Transformer Block (PTB). A pre-training strategy is expounded to address the difference in parameters between source and target domains: A specific branch was pre-trained and transferred through domain adaptation. The number of branches is adjustable to fit the scenarios where the source domain consists of multiple point cloud datasets with inconsistent parameters.
(2): A Multi-class Domain Adversarial Network (MDAN) was proposed to transfer the pre-trained feature extractor from source domain to target domain. By setting up independent discriminators for each base category, it promotes positive transfer of the base categories between source domain and target domain, avoiding negative transfer caused by unique categories.
(3): An incremental learning strategy consisting of a Knowledge Transfer Module (KTM) and a Feature Separation Module (FSM) was designed to enable the model to segment novel categories while avoiding catastrophic base category data loss. For base categories, a Knowledge Transfer Module (KTM) consisting of Kullback–Leibler (KL) divergence and feature similarity was used to transfer knowledge. For novel categories, an FSM that utilized orthogonal loss functions as constraints was proposed to learn unique features for distinguishing base categories.

The remainder of this paper is organized as follows: Section 2 describes the proposed Multi-branch Feature Extractor and transfer learning strategy for ALS point cloud semantic segmentation with few samples. Section 3 describes the experimental results, including datasets, evaluation metrics, segmentation results, and method comparison. Section 4 presents the ablation experiments and discussions. Section 5 provides the conclusions and future directions for our work.

2. Method

In this section, the proposed method is described from five aspects: problem definition, Multi-branch Feature Extractor (MFE), pre-training subtask, domain adaption subtask, and incremental learning subtask.

The workflow of the proposed method is illustrated in Figure 1. To provide a clearer description, the same categories in the source and target domains are defined as base categories, and the unique categories in the target domain are defined as novel categories. The detailed definition is presented in Section 2.1. Existing studies have shown that asymmetric mapping can better simulate the differences in low-level features than symmetric mapping []; therefore, the pre-training subtask was separated from the domain adaptation subtask. Another advantage is that when facing a new dataset, there is no need for additional pre-training. At the same time, due to the presence of novel categories in the target domain, the model should continue to learn and acquire the ability to segment novel categories. However, introducing novel categories into the model may lead to a sudden loss of knowledge regarding base categories—a phenomenon known as catastrophic forgetting in transfer learning []. To mitigate this issue, an incremental learning subtask is proposed to enable the model to segment novel categories while avoiding catastrophic forgetting.

Figure 1. The flowchart of the proposed model. Among them, D represents domain, F represents the proposed MFE, G represents multilayer perceptron (MLP)-based classifier,

C_{b}

represents base categories, and

C_{n}

represents novel categories. Subscripts s, t, and t_b represent source domain, target domain, and base categories from target domain, respectively.

Hence, the proposed method consists of three modules corresponding to three subtasks: pre-training subtask, domain adaptation subtask, and incremental learning subtask. In the pre-training subtask, multi-source ALS data was used to pre-train the Multi-branch Feature Extractor

F_{s}

, and semantic segmentation loss was used to constrain the learning process. In the domain adaptation subtask, target subdomain

D_{t_b}

with base categories were used to train

F_{t_b}

, and weights of

F_{t_b}

were initialized by

F_{s}

. Multi-class Domain Adversarial Network (MDAN) was applied to learn domain invariant features. Semantic segmentation loss was used to learn the target domain’s unique features and train the classifier

G_{t_b}

of the target domain. In the incremental learning subtask, through KTM and FSM, target domain feature extractor

F_{t}

and classifier

G_{t}

were able to segment target domain data

D_{t_n}

with novel categories while retaining its ability to segment

D_{t_b}

. The finally obtained

F_{t}

and

G_{t}

had the abilities of semantic segmentation for the whole target domain

D_{t}

.

2.1. Problem Definition

The benefits of transfer learning are generally believed to come from reusing the pre-trained feature hierarchy []. In the present study, features will be learned from a source domain that consists of inconsistent training datasets, and the learned features will be transferred to the downstream segmentation task, where there are novel categories. Therefore, the objective is to identify a feature representation that maximizes task relevance between the source and target domain while minimizing domain discrepancies. Considering the source domain

D_{s} = \{X_{s}, C_{s}\}

and target domain

D_{t} = \{X_{t}, C_{t}\}

, both comprise the parameters X and labels C. The purpose of transfer learning is to transfer the pre-trained feature extractor

F_{s}

to the target domain as

F_{t}

. The same categories in both source and target domains are regarded as base categories

C_{b} = C_{s} \cap C_{t}

, and the unique categories of the target domain are regarded as novel categories

C_{n} = C_{t} - C_{b}

. Therefore, the target domain can be decomposed into subdomains

D_{t_b} = \{X_{t_b}, C_{b}\}

and

D_{t_n} = \{X_{t_n}, C_{n}\}

.

Armed with the above definition, the three subtasks shown in Figure 1 can be summarized and formulated as:

(1): Pre-training subtask: Feature extractor $F_{s}$ was pre-trained in source domain $D_{s}$ .
(2): Domain adaptation subtask: For the base categories $C_{b}$ , domain adaptation was employed to transfer the pre-trained $F_{s}$ adapted to the target domain as $F_{t_b}$ in the subdomain $D_{t_b}$ : $F_{s} \overset{D_{t_b}}{\to} F_{t_b}$ , and target classifier $G_{t_b}$ was trained in the subdomain $D_{t_b}$ .
(3): Incremental learning subtask: The feature extractor $F_{t}$ and classifier $G_{t}$ should be able to segment $C_{n}$ while retaining the ability to segment for $C_{b}$ in $D_{t}$ : $F_{t_b} \overset{D_{t}}{\to} F_{t}$ , $G_{t_b} \overset{D_{t}}{\to} G_{t}$ .

2.2. Multi-Branch Feature Extractor (MFE)

Given that the source domain may comprise multi-source datasets with inconsistent categories, point densities, and other attributes, a Multi-branch Feature Extractor (MFE) based on the Point Transformer was designed, as illustrated in Figure 2. The input data was divided into different branches, such as a coordinate branch, an intensity branch, and an echo-related branch, according to the types of parameters they contained. Feature aggregation within each branch was conducted by extended Point Transformer with proposed Neighborhood Embedding Block (NEB). Finally, the features learned from multiple branches were concatenated and fed into subsequent processing, such as classifiers.

Figure 2. Flowchart of Multi-branch Feature Extractor, where N is the number of points, B is the number of branches, K is the number of neighborhoods, and

D_{i n}

represents the dimension of the input parameter.

2.2.1. Neighborhood Embedding Block (NEB)

In the classical Point Transformer, embedded features are encoded via a shared-weight MLP before being fed into the Point Transformer Block, which is effective for the TLS and MLS point clouds. However, this encoding method is hard to extend to ALS point clouds. For instance, ground points may be distributed arbitrarily across a scene, meaning two points that are far apart in coordinate space could belong to the same category. In contrast, directly encoding coordinate parameters using a shared-weight MLP would lead to significantly different feature representations for points of the same category.

In traditional classifiers for airborne LiDAR point clouds, it is necessary to handcraft features from raw data and input them into a classifier []. Neighborhood-based features, such as eigenvalues, play a crucial role in point cloud classification. Drawing on this insight, we hypothesize that incorporating point neighborhood information into the embedding module can effectively enhance the overall performance of the network. As shown in Figure 3, the NEB includes three steps: Firstly, parameters of the current point were embedded using MLP1. Secondly, parameters of the K-nearest-neighboring points were embedded by MLP2. Finally, the two encoded features were aggregated to obtain the finally embedded features through MLP3.

Figure 3. Illustration of NEB, where N is the number of points and K is the number of neighborhoods,

D_{i n}

is the dimension of the input, and

D_{o u t}

is the dimension of the embedded features.

The pseudo code for the NEB is shown in Algorithm 1:

Algorithm 1. Neighborhood Embedding Block

Input: ALS point cloud data

X^{N \times D_{i n}}

, where N is the number of points, and

D_{i n}

is the dimension of the input. For instance, if the input is X, Y, and Z, then

D_{i n} = 3

.
Output: Embedded features

F^{N \times D_{o u t}}

, where

D_{o u t}

is the dimension of the embedded features.
Initialization: P is point sets, K is the number of neighboring points.
1: For

p_{i} (i = 1, 2, 3, \dots, N)

in P:
2: Search for the K-nearest-neighboring points of the current point (

p_{i}

); the points within the neighborhood are denoted by

X_{p_{i}}^{K \times D_{i n}}

.
3: Obtain the original parameters of the current point (

p_{i}

) from point cloud data (

X^{N \times D_{i n}}

) as

X_{p_{i}}^{D_{i n}}

.
4: Encode the current point using MLP1:

X_{p_{i}}^{D_{i n}} \overset{M L P 1}{\to} {F_{o}}_{p_{i}}^{D_{o u t}}

.
5: Encode the neighborhood of the current point using MLP2:

X_{p_{i}}^{K \times D_{i n}} \overset{M L P 2}{\to} {F_{n}}_{p_{i}}^{D_{o u t}}

.
6: Combine the encoded features to obtain the final features through MLP3: [

{F_{o}}_{p_{i}}^{D_{o u t}}

{F_{n}}_{p_{i}}^{D_{o u t}}

]

\overset{M L P 3}{\to} F_{p_{i}}^{D_{o u t}}

.
7: End for

2.2.2. Concept of Multi-Branch Architecture

In the existing deep learning methods for point cloud semantic segmentation, the most commonly selected parameters include coordinate values, intensity, echo-related parameters, etc. These parameters are fed to the network simultaneously for feature learning. Though simple in network design, due to the different characteristics of ground objects reflected by different parameters, it might lead the inconspicuous features weighing out the salient ones, thereby decreasing the final segmentation accuracy. Numerous studies explore the relationship between input parameters and segmentation accuracy [,]. However, these studies also adopt the simultaneous input of parameters, leading to severe coupling of the influences exerted by different parameters. This coupling hinders the ability to discern the relationship between learned features and ground objects.

Thus, to our knowledge, there is currently limited research on the impact of different parameters on ALS point cloud semantic segmentation. In order to more intuitively demonstrate the impact, t-Distributed Stochastic Neighbor Embedding (t-SNE) was employed to evaluate the effectiveness of the learned features. The experiments were conducted on the ISPRS benchmark dataset and Point Transformer. The preprocessing of input data was the same as that of existing deep learning methods. The network was trained using the training set and was applied to the testing set. The features input into the classifier were taken out for visualization, as shown in Figure 3. Four combinations were conducted separately:

(a): Mixed parameters (including XYZ values, intensity, and echo-related parameters), which are also commonly input into current deep learning methods for ALS point cloud analysis;
(b): Echo-related parameters, which reflect the internal structure of ground objects;
(c): Intensity, which characterizes the backscattering coefficient of ground objects;
(d): XYZ values, which record the three-dimensional coordinate information of ground objects and are also the most commonly used parameters in traditional machine learning methods.

From Figure 4, it can be seen that when mixed parameters are input, the learned features are somewhat separable among low_veg, imp_surf, tree, and roof. However, the boundaries between these categories remain indistinct, and the intra-class distances are large. For the other five categories, the learned features are not clearly distinguishable. When the echo-related parameter is used, low_veg and imp_surf can be distinguished from tree and roof, which may be due to differences in internal structure. When intensity is used, low_veg and imp_surf can be distinguished, mainly due to their different backscattering coefficients. When XYZ values are used, tree and roof can be distinguished, which may be due to the differences in their geometric structures. Additionally, advantageous features of cars and fac can be learned when intensity and XYZ values are input separately.

Figure 4. Visualization of features learned from different parameter combinations using t-SNE.

As revealed by the t-SNE, features learned from different parameters are capable of distinguishing between various ground objects. Furthermore, since source domain data originates from diverse datasets with inconsistent parameter configurations, the number of inputs for feature learning remains variable. To address this, a multi-branch architecture is introduced. A key characteristic of this structure is that each specific branch is assigned distinct inputs and tasked with learning corresponding specific features, with the final output features being utilized for subsequent processing.

2.2.3. Construction of MFE Based on Multi-Branch Architecture

With the aforementioned facts in mind, MFE was proposed based on extended Point Transformer and multi-branch architecture. MFE consists of three branches corresponding to three types of parameters. Each branch consists of NEB and PTB. The three branches are:

(a): Coordinate branch: the input includes X, Y, and Z values. In the NEB, besides X, Y, and Z values, relative Z values of the K-nearest-neighboring points of the current point were input into the embedding features. Relative Z values were calculated because relative Z values are significant features to distinguish ground objects [].
(b): Intensity branch: The input is the intensity. In the NEB, the intensity of the current point and that of the K-nearest-neighboring points were input into the embedding features.
(c): Echo-related branch: The input is the preprocessed echo-related parameters. In the NEB, echo-related parameters of the current point and those of the K-nearest-neighboring points were input into embedding features.

Preprocessing is necessary for the coordinate values, intensity and echo-related parameters, as was described in detail in Section 3.2. The value K was determined by trial and error.

2.3. Pre-Training Subtask

The purpose of the pre-training subtask is to pre-train the feature extractor with multi-source ALS data. The larger the amount of pre-training data, the better the generalization performance of the pre-trained model []. Thus, we aimed to utilize as much data as possible for pre-training. However, multi-source ALS data involves diverse types of parameters. To address this issue, we constructed a three-branch architecture for the feature extractor, specifically comprising a coordinate branch, an intensity branch, and an echo-related branch. Each branch consists of an MFE and an MLP-based classifier.

The overall framework of the pre-training subtask is illustrated in Figure 5. First, the source domain data is divided into square blocks. Given the potential impact of point density variations on model transferability, voxelization is performed prior to feature learning. During voxelization, one point is randomly selected from each voxel, and these selected points are fed into the MFE. After voxelization, the point cloud density primarily depends on the voxel size—an approach that effectively mitigates differences in point cloud density across multi-source ALS data. To achieve this, the voxel size is determined based on the lowest point density among the datasets in both the source and target domains.

Figure 5. Pre-training subtask based on multi-branch architecture. N is the number of points.

The feature extractor was optimized through semantic segmentation loss function in each branch:

\underset{G_{s}, F_{s}}{a r g m i n} L_{s e g_s} = - \sum_{i = 1}^{C_{b}} y_{i} l o g (G_{s} (F_{s} (x_{i}))),

(1)

where

y_{i}

is the true label,

x_{i}

is the input of each branch,

F_{s}

is feature extractor, and

G_{s}

is the MLP-based classifier.

The semantic segmentation loss in the pre-training subtask can be defined as:

L_{s e g_s} = L_{s e g_C} + L_{s e g_I} + L_{s e g_E},

(2)

Furthermore, appropriate branches are selected based on the shared parameter types between the target domain and source domain to learn transferable features. For instance, if the source domain lacks intensity data, only the coordinate branch and echo-related branch undergo pre-training in the subtask before being transferred to the target domain. As for the intensity branch, supervised training is conducted using the target domain training set.

2.4. Domain Adaption Subtask

In order to transfer the pre-trained

F_{s}

to the target domain to obtain

F_{t_b}

, two modules are proposed: (1) the Domain Adaptation Module, which aims to learn domain-invariant features, and (2) the Semantic Segmentation Module, which is designed to learn target domain-specific features and training target classifier

G_{t_b}

. A detailed description of these two modules is provided below.

(1) Domain Adaptation Module

Domain adaptation is recognized as a special case of transfer learning []. In recent years, deep domain adaptation has been proposed to integrate the properties of deep networks with domain adaptation techniques. Domain-Adversarial Training of Neural Networks (DANN) [] was the first to introduce adversarial learning into deep domain adaptation, which maps source and target domain data symmetrically into a new feature space to learn domain-invariant features. CDAN [] introduced category information to allow the network to learn domain invariant features corresponding to categories. Dai [] applied CDAN for ALS data to achieve domain adaptation. However, CDAN performs optimally when the source and target domains share identical categories. Thus, directly using CDAN may cause the following issues: (1) Classifying source unique categories and target novel categories into one category will reduce the ability of the feature extractor. (2) The source’s unique categories and target’s novel categories may lead to negative transfer. (3) It cannot be guaranteed that all samples used for domain adaptation obtain the correct category, and the misclassified samples may also lead to negative transfer.

To address the above issues, inspired by Multi-Adversarial Domain Adaptation (MADA) [], a separate adversarial network named Multi-class Domain Adversarial Network (MDAN) was proposed, as shown in Figure 6 and Figure 7. From Figure 6, domain-adversarial training was applied to each branch of the pre-trained Multi-branch Feature Extractor, aligning the feature distribution through MDAN. As shown in Figure 7, in MDAN, one discriminator corresponds to one base category; hence, there is an equal number of discriminators and base categories. The advantage of setting multiple discriminators is to alleviate negative transfer by preventing false alignment of modes in different distributions across domains. In the implementation of MDAN, the output features

f_{s}

of the source feature extractor and the output features

f_{t}

of the target feature extractor were selected by labels separately to obtain

f_{s_b}

,

f_{t_b}

as the input of MDAN. In MDAN, multiple domain discriminators were designed with the same number as

C_{b}

.

f_{s_b}

,

f_{t_b}

were input into the MLP-based domain discriminator to obtain the domain output, and adversarial loss was calculated and backpropagated through Gradient Reverse Layer (GRL). Compared to MADA, our method has the following advantages: (1) freezing the pre-trained feature extractor helps to preserve the feature extraction abilities of the source domain and improve the ability to extract domain invariant features, and (2) using labels instead of category probabilities to determine which discriminator the sample should input can help with positive transfer, avoiding negative transfer caused by misclassification.

Figure 6. The flowchart of domain adaptation through Multi-class Domain Adversarial Network (MDAN).

Figure 7. Description of the proposed MDAN.

Based on the above analysis, the loss function within each domain discriminator is calculated as follows:

\underset{D}{a r g m i n} \underset{F_{t_b}}{a r g m a x} L_{M D A N}^{i} = \sum_{x_{j} \in D_{s}, x_{k} \in D_{t_b}} L_{D} (D (F_{t_{b}} (x_{j}), d_{j}) + L_{D} (D (F_{t_{b}} (x_{k}), d_{k})

(3)

where i represents the i-th domain discriminator,

L_{D}

is the loss function, D is the MLP-based domain discriminator, and

d_{j}

and

d_{k}

are the domain labels.

Then, the total adversarial loss is:

\underset{D}{a r g m i n} \underset{F_{t_b}}{a r g m a x} L_{a d v_M D A N} = \sum_{j = 1}^{B} \sum_{i \in C_{b}} {L_{M D A N}^{i}}_{j},

(4)

where B is the number of branches.

To reduce computational complexity and avoid domain label imbalance caused by significant differences in data volume between the source and target domains, an equal number of blocks are randomly selected from both domains in each epoch for domain adversarial learning.

(2) Semantic Segmentation Module

After domain adaptation,

F_{t_b}

obtains the ability to extract domain invariant features. However, the target domain and the source domain are not exactly the same. In order to learn target domain specific features and train the target classifier

G_{t_b}

, semantic segmentation loss is used, defined as follows:

\underset{G_{t_b}, F_{t_b}}{a r g m i n} L_{s e g_b a s e} = - \sum_{x_{i} \in D_{t_b}} y_{i} l o g (G_{t_b} (F_{t_b} (x_{i}))),

(5)

Therefore, the overall loss function of domain adaptation subtask can be defined as:

L_{D A} = L_{s e g_b a s e} + λ_{1} L_{a d v_M D A N},

(6)

where

λ_{1}

is a learnable weight parameter.

2.5. Incremental Learning Subtask

After the domain adaptation subtask,

F_{t_b}

obtains the ability of feature extraction for

C_{b}

. Due to the presence of novel categories, the feature extractor should continue to learn for

C_{n}

. If only

C_{n}

is used for learning, the feature extractor would inevitably fall into catastrophic forgetting of

C_{b}

. If

C_{n}

and

C_{b}

are used to train the

F_{t}

and

G_{t}

together, the model would lose its generalization ability to

C_{b}

because the source domain is no longer visible. To address this issue, an incremental learning strategy was proposed that integrates knowledge transfer and feature separation. This strategy allows the model to learn the segmenting ability of

C_{n}

while avoiding catastrophic forgetting for

C_{b}

and retaining the generalization ability learned from the source domain. This subtask consists of two modules: (1) the Knowledge Transfer Module for

C_{b}

and (2) the Feature Separation Module for

C_{n}

. Next, we will provide a detailed description of these two modules.

(1) Knowledge Transfer Module (KTM)

The purpose of knowledge transfer is to enable the student model (SM) to retain the feature extraction capability of the teacher model (TM). Dai [] introduced KL divergence as

L_{K L}

into incremental learning to avoid catastrophic forgetting. However, the dimension of the classifier’s output layer was expanded, which could lead to the two identical points having different category probability distributions, and it could reduce the performance of knowledge transfer. Therefore, in addition to

L_{K L}

, feature similarity was introduced as a constraint to overcome these differences. According to [], Manhattan distance has better similarity measurement in prototype networks than Euclidean distance, cosine distance, and Chebyshev distance. At the same time, it is more stable. Therefore, for feature similarity measurement, Manhattan distance was chosen instead of other distances. Then, the loss function of feature similarity can be defined as:

\underset{F_{t}, G_{t}}{a r g m i n} L_{f e a t} = \frac{1}{N} \sum_{x_{i} \in D_{t_b}}^{N} |F_{t} (x_{i}) - F_{t_b} (x_{i})|,

(7)

Then, the loss function in KTM consists of KL divergence

L_{K L}

and feature similarity

L_{f e a t}

.

L_{K T} = L_{K L} + λ_{2} L_{f e a t},

(8)

where

λ_{2}

is a learnable weight parameter.

(2) Feature Separation Module (FSM)

Due to the presence of novel categories

C_{n}

, Dai [] introduced an adversarial learning module, hoping for the model to learn distinguished features between

C_{b}

and

C_{n}

. However, the adversarial learning is not stable, and additional domain discriminators will increase the computational complexity. To solve this problem, the idea of feature separation was introduced using orthogonal constraint loss functions as a constraint to expect the network to learn unique and separable features for

C_{n}

. Assuming that the features of

C_{b}

are orthogonal to

C_{n}

, the classifier can independently activate both types of features through linear transformation. Orthogonal constraints are established between the output features of

F_{t}

and

F_{t_b}

.

\underset{F_{t}}{a r g m i n} L_{O r t h} = ‖ F_{t_b} {(x_{i})}^{T} F_{t} {(x_{i}) ‖}_{F r o}^{2}, x_{i} \in D_{t_n},

(9)

In order to make the learned features more effective for

C_{n}

, semantic segmentation loss was used as an auxiliary:

\underset{G_{t}, F_{t}}{a r g m i n} L_{s e g_n o v e l} = - \sum_{x_{i} \in D_{t_n}} y_{i} l o g (G_{t} (F_{t} (x_{i})))

(10)

Then, the loss function in FSM consists of

L_{O r t h}

and

L_{s e g_n o v e l}

:

L_{F S} = L_{O r t h} + λ_{3} L_{s e g_n o v e l},

(11)

where

λ_{3}

is a learnable weight parameter.

In summary, the total loss function of incremental learning subtask can be defined as:

L_{I L} = L_{K T} + λ_{4} L_{F S},

(12)

where

λ_{4}

is a learnable weight parameter.

Pseudocodes of the three-stage transfer learning strategy are provided as shown in Algorithm 2:

Algorithm 2. Three-stage transfer learning strategy

Pre-training subtask

Input: Source domain

X^{N_{s}}

, where

N_{s}

is the number of total blocks after gridding the source domain.
Output: Pre-trained

{MFE}_{P}

consists of Multi-branch Feature Extractor

F_{s}

and an MLP-based classification layer within each branch. The number of branches is set to 3, namely coordinate branch MFE-C, intensity branch MFE-I, and echo-related branch MFE-E.
1: For each block

X_{i}

in

X^{N_{s}}

:
2: Train

{MFE}_{P}

-C,

{MFE}_{P}

-I and

{MFE}_{P}

-E using

X_{i}

and calculate

L_{s e g_s}

.
3: Backpropagate

L_{s e g_s}

and update

{MFE}_{P}

.
4: End for

Domain adaptation subtask

Input: Source domain

X^{N_{s}}

and target domain with labeled base categories

X^{N_{t_b}}

, where

N_{s}

and

N_{t_b}

are the number of blocks after gridding the source and target domain, respectively. Frozen pre-trained

{MFE}_{P}

. MDAN consists of k MLP-based domain discriminators, where k is the number of base categories.
Output: Domain-adapted

{MFE}_{D A}

consists of Multi-branch Feature Extractor

F_{t_b}

and MLP-based classification layer

G_{t_b}

.
1: For

i

in [1, 2, 3, …,

N_{t_b}

]:
2: Given a block

X_{s}^{i}

from source domain and a block

X_{t}^{i}

from target domain.
3: For each branch

j

in {MFE-C, MFE-I, MFE-E}:
4: Features

f_{t}

,

f_{s}

are obtained by:

X_{t}^{i} \overset{F_{t_b j}}{\to} f_{t}

,

X_{s}^{i} \overset{F_{s j}}{\to} f_{s}

.
5: For each base category k:
6: Domain labels

d_{k}

are obtained by

M D A N_{k}

: [

f_{t_k}

,

f_{s_k}

]

\overset{M D A N_{k}}{\to}

d_{k}

.
7: Loss

L_{M D A N}^{k} - j

is calculated.
8: End for
9: Loss

L_{a d v_M D A N}

is calculated by:

L_{a d v_M D A N} = \sum_{j} \sum_{k} L_{M D A N}^{k} - j

.
10: End for
11: Train

{MFE}_{D A}

using

X_{t}^{i}

and calculate the segmentation loss

L_{s e g_b a s e}

.
12:

L_{D A} = L_{s e g_b a s e} + λ_{1} L_{a d v_M D A N}

.
13: Backpropagate

L_{D A}

, updating

{MFE}_{D}

and MDAN.
14: End for

Incremental learning subtask

Input: Target domain

X^{N_{t}}

, where

N_{t}

is the number of total blocks after gridding the target domain. Frozen domain adaptated

{MFE}_{D A}

.
Output:

{MFE}_{T}

consists of Multi-branch Feature Extractor

F_{t}

and MLP-based classification layer

G_{t}

.
1: For each block

X_{i}

in

X^{N_{t}}

:
2: For points with labeled base categories:
3: Features

f_{t_b}^{’}

and logists

g_{t_b}^{’}

are obtained by

{MFE}_{D A}

:

X_{i} \overset{{MFE}_{D A}}{\to} [f_{t_b}^{’}, g_{t_b}^{’}]

.
4: Features

f_{t_b}

and logists

g_{t_b}

are obtained by

{MFE}_{T}

:

X_{i} \overset{{MFE}_{T}}{\to} [f_{t_b}, g_{t_b}]

.
5:

L_{K L}

is calculated by

g_{t_b}

and

g_{t_b}^{’}

;

L_{f e a t}

is calculated by

f_{t_b}

and

f_{t_b}^{’}

.
6:

L_{K T} = L_{K L} + λ_{2} L_{f e a t}

.
7: End for
8: For points with labeled novel categories:
9: Features

f_{t_n}^{’}

are obtained by

{MFE}_{D A}

:

X_{i} \overset{F_{t_b}}{\to} f_{t_n}

.
10: Features

f_{t_n}

and logists

g_{t_b}

are obtained by

M F E_{T}

:

X_{t}^{i} \overset{M F E_{T}}{\to} [f_{t_b}, g_{t_b}]

.
11:

L_{O r t h}

is calculated by

f_{t_n}

and

f_{t_n}^{’}

; segmentation loss

L_{s e g_n o v e l}

is calculated.
12:

L_{F S} = L_{O r t h} + λ_{3} L_{s e g_n o v e l}

.
13: End for
14:

L_{I L} = L_{K T} + λ_{4} L_{F S}

15: Backpropagate

L_{I L}

and update

{MFE}_{T}

.
16: End for

3. Experiments

In this chapter, we will introduce the experimental results from three aspects: datasets, evaluation metrics, and experimental results.

3.1. Datasets and Evaluation Metrics

Three datasets were used to evaluate the proposed model: the International Society for Photogrammetry and Remote Sensing (ISPRS) benchmark dataset [,], the Dublin City Annotated LiDAR Point Cloud dataset [], and the Dayton Annotated LiDAR Earth Scan (DALES) dataset []. All these datasets were acquired via airborne LiDAR systems. Specifically, the DALES and Dublin datasets served as the source domains; the training set of the ISPRS benchmark dataset was used for domain adaptation and incremental learning, while its testing set was employed to evaluate the proposed method. Detailed information on these three datasets is presented in Table 2.

Table 2. Detailed information of the three datasets.

Three evaluation metrics were adopted to assess the segmentation results: F1 score (F1), average F1 (averF1), intersection over union (IoU), and overall accuracy (OA). Specifically, F1, averF1, and OA are used to compare with other fully supervised deep learning methods, while IoU is employed to compare with other few sample learning methods.

A paired t-test was used to examine significant differences between experimental results. In the ablation experiments, multiple trials were conducted, and the p-values of two-tailed tests were reported. If p < 0.05, it indicates that there is a statistically significant difference between the two compared models.

3.2. Experimental Results

3.2.1. Processing Details

1.: Data preprocessing

All three datasets were divided into blocks of 50 × 50 m, and coordinate values were centralized in each block. Due to the large range of the centralized coordinate values, intensity, and echo-related parameters, they were preprocessed by the following methods before being input into the network in order for the network to learn normally:

Coordinate values: Normalized after centralization, after which the range of value is 0~1.
Intensity: Normalization, after which the range of value is 0~1.
Echo-related parameters: Existing deep learning methods normalize the return number (RN) and number of returns (NR) separately []. Other studies pointed out that NR is positively correlated with the complexity of the internal structure []. Therefore, by combining RN and NR, a novel normalization method for echo-related parameters was proposed as follow:

F_{e} = \{\begin{matrix} 0, w h e n N R = 1 \\ \frac{R N}{N R}, o t h e r w i s e \end{matrix},

(13)

The range of

F_{e}

is [0, 1], where 0 indicates that the ground object cannot be penetrated. The closer the value is to 1, the more complexity there is of the internal structure of the object from where the laser beam reflected.

2.: Settings of pre-training subtask

Both the training set and testing set of the DALES dataset and Dublin dataset were used to pre-train MFE. A voxel size of 0.5 m was selected. The pre-training situation for each branch of MFE was as follows:

(1): Coordinate branch: Pre-trained using the DALES dataset and Dublin dataset.
(2): Echo-related branch: Pre-trained using the DALES dataset and Dublin dataset.
(3): Intensity branch: Pre-trained using the Dublin dataset.

3.: Settings of domain adaptation subtask and incremental learning subtask

The training set of the ISPRS benchmark dataset was used in the domain adaptation subtask and incremental learning subtask. In order to evaluate the impact of different sampling proportions on the model, 10%, 1%, and 0.1% samples of the training set were selected through random sampling in each category, and the number of samples in each category was not less than 1. The number of samples is shown in Table 3.

Table 3. Different sampling proportions in the training set of the ISPRS benchmark dataset.

For the domain adaptation subtask, samples with base categories were used. For the incremental learning subtask, samples with base categories were used in the Knowledge Transfer Module, and samples with novel categories were used in the Feature Separation Module.

After the three subtasks, feature extractor

F_{t}

and classifier

G_{t}

were evaluated on the testing set of the ISPRS benchmark dataset.

4.: Other settings

In all three subtasks, the Adam optimizer with an initial learning rate of 0.0005 was used, the momentum value was 0.9, and the batch size was 4. The learning rate was iteratively reduced based on the current epoch by a factor of 0.7. In the pre-training subtask, the training epoch was 30. In the domain adaptation subtask and incremental learning subtask, the training epoch was 300.

3.2.2. Comparison of Experimental Results

The proposed method was evaluated from the following two aspects: (1) comparison with other fully supervised deep learning methods and (2) comparison with other few sample learning methods.

1.: Comparison with other fully supervised deep learning methods

Table 4 shows the semantic segmentation results of the proposed method in comparison with other fully supervised deep learning methods published in recent literature. The thematic map of the segmentation result using 10% training samples and the ground truth are shown in Figure 8, while Figure 9 illustrates the difference between them. Figure 10 displays the zoomed-in map, which shows that some facades were misclassified as roofs, and some confusion also occurred between low_veg and fences, as well as between trees and roofs. These errors can be attributed to the close similarity in geometric and internal structural characteristics between these classes, making it difficult to learn distinctive features for their differentiation.

Table 4. Comparison of segmentation results using the ISPRS benchmark dataset with other fully supervised deep learning methods. The segmentation accuracy of each category was evaluated by F1 and is expressed in percentage. The highest values of F1, OA, and averF1 are marked with bold text, and the second highest values are marked with underlines.

Figure 8. Segmentation results of ISPRS benchmark dataset. (a) Results of our methods. (b) Ground truth.

Figure 9. Comparison between the results of our method and ground truth.

Figure 10. Zoomed-in thematic map of ISPRS benchmark dataset. (Left) Results of our methods. (Right) Ground truth.

From Table 4, it can be seen that the proposed method achieved the highest OA and averF1, and achieved the highest F1 score in five out of nine categories. Among the six base categories, the segmentation performance was highly competitive, except for fence. This may have been due to the low point density of the ISPRS benchmark dataset, which resulted in incomplete morphology of the fence, and the proposed MFE relied on local features, leading to an inferior result for this category. Notably, the proposed method achieved the best performance in three novel categories (low_veg, fac, and shrub). This indicates that the method can effectively transfer knowledge learned from the source domain to the target domain, thereby enhancing the model’s generalization ability.

Meanwhile, the impact of various proportions of samples on semantic segmentation was compared. When using 1% of training samples, the model still achieved competitive results compared to other fully supervised deep learning methods. This was mainly due to the strong generalization ability of the model acquired from the source domain, which allowed the model to avoid overfitting, thereby maintaining the performance of the model with fewer target training samples. This greatly reduced the cost of manual annotation. This means that when using 0.1% of training samples, each category had only a few to less than 200 training samples, and the results deteriorated. Nevertheless, the method still achieved favorable performance for major ground objects such as imp_surf, tree, and roof, demonstrating its practical value in large-scale applications. By comparing the results of different sampling proportions, it can be seen that compared to using 100% training samples, using 10% training samples kept competitive classification accuracy while reducing manual annotations by 90%.

2.: Comparison with other few sample learning methods

To evaluate the proposed method for few sample learning, it was compared with Thr-MPRNet and Dai’s method []. In order to obtain more objective comparisons, the same dataset, SensatUrban [], and sampling method as Dai [] were adopted. A brief overview of the sampling method in Ref. [] is as follows: (1) The training set is divided into grids with overlapping regions. (2) The class of each “representative block” is defined based on the proportion of points within each grid. If the number of points belonging to the selected category within a block exceeds a certain threshold, then the block is considered the “representative block” of this category. (3) In each “representative block,” N points are sampled and their annotation information is retained. A more detailed description of the sampling method can be found in the original paper.

Ground, vegetation, and building were taken as base categories. As the SensatUrban dataset contains coordinate values XYZ and color information RGB, in the pre-training subtask, only the coordinate branch was set up and pre-trained using coordinate values XYZ, while RGB information was abandoned. The sampled training set of the ISPRS benchmark dataset was used to train the intensity branch and echo-related branch in the domain adaptation subtask. The detailed settings are shown in Table 5.

Table 5. Experimental settings for comparison with other few sample learning methods.

As shown in Table 6, Thr-MPRNet and Dai’s methods achieved inferior results, which may have been due to (1) difficulty in transferring the pre-trained model to the target domain due to the difference in parameter type between the source and target domains. Simply embedding the parameters through the MLP-based embedding layer can lead to poor transferability of the learned features. (2) There was a significant difference in point density between the source and target domains, which may have resulted in significant differences in local features. This is not conducive to the transfer of features. The proposed Multi-branch Feature Extractor and transfer learning strategy seem likely to solve these two problems effectively.

Table 6. Comparison of segmentation results with other few sample learning methods. The segmentation accuracy of each category was evaluated by IoU. The highest values of IoU, OA, and mIoU are marked with bold text.

4. Ablation Experiments and Discussions

The experimental results show the effectiveness of the proposed method. Ablation experiments were conducted to further study the impact of each component of the proposed method. We will separately discuss the proposed NEB, multi-branch architecture, and transfer learning strategy.

4.1. Discussions of Neighborhood Embedding Block

Experiments were performed to assess the effectiveness of NEB and compare the impact of different K-nearest-neighbor sizes on model performance. To eliminate interference from the multi-branch architecture, only the coordinate branch was adopted. The experimental results are presented in Table 7, where “Base” denotes the original Point Transformer utilizing an MLP-based embedding block. For each K-nearest-neighbor setting, the experiment was repeated five times, with the averages and ranges of F1 scores, OA, and averF1 reported, along with p-values for OA and averF1 in comparison to the “Base”. As shown in the table, incorporating neighborhood information into NEB can effectively enhance model performance with increasing K values. However, when K exceeds 16, the model performance starts to decline. This phenomenon may be attributed to the fact that a larger K might encode features from other categories, leading to inconsistent embedded features among points of the same category and thus hindering subsequent feature learning and overall model performance. When K < 64, both p-values for OA and averF1 are less than 0.05, indicating statistically significant improvements. In contrast, when K = 64, the p-value for averF1 exceeds 0.05, suggesting that the improvement is not statistically significant.

Table 7. Effect of different K-nearest neighboring on the NEB measured by OA, averF1, and p-values.

4.2. Discussions of Multi-Branch Architecture

As mentioned before, the two purposes of multi-branch architecture in the feature extractor are (a) to learn more representative features and (b) to deal with variable parameters contained in a point cloud dataset. In this subsection, an ablation experiment was conducted to show the effectiveness of the first purpose. In the ablation experiment, the descriptions of the two models used are shown in Table 8. To avoid the impact of NEB, an MLP-based embedding layer was used for feature embedding. The flowchart of the model is shown in Figure 11.

Table 8. Models with different architecture.

Figure 11. Flowchart of the used models in the ablation experiment of multi-branch architecture.

The experimental results are presented in Table 9. Each experiment was conducted five times, with the averages and ranges of the F1 scores, OA, and averF1 reported, along with p-values for OA and averF1 comparing M1 and M2. Both p-values (for OA and averF1) were less than 0.05, indicating a statistically significant difference between M1 and M2. As observed, the multi-branch architecture significantly improved performance across all nine categories. Among them, the segmentation accuracy of powerlines and fences showed the most substantial increase, which can be attributed to the separate echo-related branch effectively extracting their internal structural features. There was also a noticeable improvement in low_veg segmentation, possibly due to the intensity branch’s capability to extract distinguishable features. For buildings, the enhanced segmentation accuracy may stem from the unique geometric features extracted by the coordinate branch. Combined with the t-SNE analysis in Section 2.3 and the experimental results, the multi-branch architecture is confirmed to be superior to the single-branch counterpart. This superiority may be attributed to the fact that different parameters correspond to distinct physical characteristics, enabling the multi-branch architecture to more effectively learn features that distinguish various ground objects.

Table 9. Ablation experiment of multi-branch architecture and single-branch architecture. The segmentation effectiveness of each category was evaluated by F1 score and is expressed in percentage.

To evaluate the time complexity of the proposed architecture, M1 and M2 were executed on the testing set of ISPRS benchmark dataset five times and the running time was separately recorded, as shown in Table 10. Though the multi-branch architecture almost doubles the running time in average compared to the single branch architecture, it is still in the acceptable range. But the most important is that it can effectively solve the issue of parameters differences between the source and target domains, therefore, greatly improve the performance of model transfer.

Table 10. Time complexity measured by running time of M1 and M2 on the testing set of ISPRS dataset.

4.3. Discussion of Domain Adaptation Subtask and Incremental Learning Subtask

To demonstrate the effectiveness of domain adaptation subtask and incremental learning subtask, comparative experiment and ablation experiment were conducted. 10% samples of the target training set were used for domain adaptation and incremental learning. The trained model was evaluated on target testing set through segmentation performance.

(1) Discussion of Domain Adaptation Subtask

To verify the effectiveness of the proposed multi-class domain adversarial network (MDAN), comparative experiment was conducted where Fine-tuning (FT) method was used as baseline and MADA as comparison. In Fine-tuning, pre-trained Multi-branch Feature Extractor was frozen and only MLP-based classifier was trained in target subdomain

D_{t_b}

. For MADA, due to the need for category probability to determine which domain discriminator the sample was input, the pre-trained

G_{s}

in the pre-training subtask was connected to

F_{t_b}

as

{G_{s}}^{'}

to obtain category probability, and

{G_{s}}^{'}

was updated by semantic segmentation loss.

The experimental results are shown in Table 11. Since domain adaptation subtask was applied for

C_{b}

, only power, imp_surf (ground), car, fence, roof (building) and tree (vegetation) were considered. The experiments were conducted five times, the average and ranges of F1 scores and OA were reported, along with p-values (OA) compared to MDAN. The p-values (OA) of FT and MADA were both less than 0.05, indicating there were significant differences. From the table, it can be seen that MDAN could transfer learned knowledge more effectively compared with MADA, which likely contributed to the MDAN strategy of solely using points of the same category for domain adaptation, avoiding negative transfer. Specifically, for categories with fewer samples of power, fences, and cars, the improvement was more pronounced compared with FT. This may be due to the fact that the proposed method can avoid overfitting and improve the segmentation performance.

Table 11. Comparison of different domain adaptation methods. The segmentation results of each category were evaluated by F1 score and are expressed in percentage.

(2) Discussion of Incremental Learning Subtask

As for the incremental learning subtask, ablation experiments on KTM and FSM were carried out separately. To evaluate KTM, an MLP-base classifier was fine-tuned using the labeled samples from

C_{n}

to obtain the ability to segment novel categories. The experimental results are shown in Table 12. “Base” represents the segmentation results of

C_{b}

after the domain adaptation subtask. The experiments were conducted five times, and the average and ranges of F1 scores and OA were reported, along with the p-values (OA) compared to the proposed KTM + FSM. The p-values (OA) of FT and MADA were both less than 0.05, indicating there were significant differences. It can be seen that when only using FSM, compared to fine-tuning, the model performed better in segmenting novel categories

C_{n}

, but it would fall into catastrophic forgetting, resulting in poor performance for base categories

C_{b}

. The combination of the proposed KTM and FSM improved the segmentation performance of novel categories while avoiding catastrophic forgetting the knowledge learned from the base categories.

Table 12. Ablation experiments of KTM and FSM. The segmentation effectiveness of each category was evaluated by F1 and is expressed in percentage.

In order to better evaluate the proposed FSM, experiments were conducted five times to compare with the Semantic Adversarial Learning Module (SALM) proposed by Dai []. The average and ranges of F1 scores and OA were reported, along with p-values (OA) compared between SALM and FSM. The p-value (OA) was less than 0.05, indicating there was significant difference. The experimental results in Table 13 show that the proposed FSM achieved better results in novel categories and was more stable.

Table 13. Comparative experiments of SALM and the proposed FSM. The segmentation accuracy of each category was evaluated by F1 score and is expressed in percentage.

4.4. Analysis of the Impact of Random Sampling

As mentioned in Section 3.2, few labeled samples were selected via random sampling. Experiments were conducted to evaluate the impact of this process. For proportions of 10%, 1%, and 0.1%, five random samplings were performed for each, with OA and averF1 recorded. The experimental results are presented in Table 14. In general, both OA and averF1 exhibited stability, despite minor fluctuations in averF1. This indicates that random sampling had little impact on segmentation performance, thereby confirming the feasibility of this sampling method.

Table 14. The impact of random sampling. OA and averF1 are expressed in percentage.

5. Conclusions

In this study, a novel Multi-branch Feature Extractor and transfer learning strategy were proposed, aiming at improving the performance of the model and reducing the amount of annotated data. MFE was based on multi-branch architecture, which is able to extract more representative features and deal with inconsistent parameter types between the source domain and target domain. A three-stage transfer learning strategy was proposed to cope with feature transfer and catastrophic forgetting: (1) The pre-training subtask effectively utilizes multi-source ALS data to obtain pre-trained feature extractor. The source domain consists of datasets collected by different LiDAR systems; therefore, there are data discrepancies in term of parameters recorded, point density, topography, and landforms. MFE plays a core role in this stage to learn features. (2) In the domain transfer subtask, MDAN aligns the distributions of the source and target domains for each category separately through multiple domain discriminators, avoiding negative transfer caused by unique categories of the source domain and novel categories of the target domain. (3) The incremental learning subtask, consisting of KTM and FSM, was designed to learn the knowledge of novel categories while avoiding catastrophic forgetting for base categories. The experiments conducted on the ISPRS benchmark dataset validated the effectiveness of the proposed method. Through the proposed method, using only 10% labeled training samples can achieve competitive segmentation results comparable to other fully supervised methods. Meanwhile, the proposed method is also superior to other few sample learning methods. By fully utilizing existing annotated ALS data, the need for manual annotation is greatly reduced when facing a new dataset.

Although the proposed method reduced the need for manual annotation, the cost was the increase in computational complexity, especially for the domain transfer subtask. The existing few sample learning methods require retraining for a new dataset. The proposed method solves this problem by separating the pre-training subtask from the domain adaptation subtask. However, in the domain adaptation subtask, it is still necessary to extract features from the source domain to align the features, which will increase the computational load. Reducing the computational load of the domain adaptation subtask is still a problem that needs further research. Airborne LiDAR data is difficult to collect compared to spectral images in Earth observation. Exploring the possibility of cross-modal transfer from spectral images to LiDAR point clouds is meaningful and listed as the future research direction of the authors’ R&D. Meanwhile, MFE has demonstrated its application value in semantic segmentation. Extending it to other 3D data processing tasks (e.g., object detection) is also a potential research direction.

Author Contributions

Conceptualization, J.Y. and H.M.; data curation, J.Y. and J.D.; funding acquisition, L.Z., K.L. and J.D.; investigation, J.D. and W.L.; methodology, J.Y.; project administration, H.M.; software, L.Z., W.L., K.L. and Z.C.; supervision, H.M.; validation, J.Y., K.L. and L.Z.; visualization, J.Y., K.L. and L.Z.; writing—original draft, J.Y.; writing—review and editing, H.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Tianjin Key Laboratory of Rail Transit Navigation Positioning and Spatio-temporal Big Data Technology, grant number TKL2023B13, and Natural Science Foundation of Hubei Province, grant number 2025AFB833.

Data Availability Statement

The data presented in this study is available in the DALES dataset at “https://udayton.edu/engineering/research/centers/vision_lab/research/was_data_analysis_and_processing/dale.php (accessed on 30 May 2025)”, the Dublin dataset at “https://v-sense.scss.tcd.ie/DublinCity (accessed on 30 May 2025)”, and the ISPRS benchmark dataset at “https://isprs.org/resources/datasets/benchmarks/UrbanSemLab/3d-semantic-labeling.aspx (accessed on 30 May 2025)”.

Conflicts of Interest

The authors declare no conflict of interest.

References

Liu, X.; Zhang, Z.; Peterson, J.; Chandra, S. Large area DEM generation using airborne LiDAR data and quality control. In Accuracy in Geomatics: Proceedings of the 8th International Symposium on Spatial Accuracy Assessment in Natural Resources and Environmental Sciences, Shanghai, China, 25–27 June 2008; World Academic Union: Liverpool, UK, 2008; Volume 2. [Google Scholar]
Murakami, H.; Nakagawa, K.; Hasegawa, H.; Shibata, T.; Iwanami, E. Change detection of building using an airborne laser scanner. ISPRS J. Photogramm. Remote Sens. 1999, 54, 148–152. [Google Scholar] [CrossRef]
Ivanovs, J.; Lazdins, A.; Lang, M. The influence of forest tree species composition on the forest height predicted from airborne laser scanning data—A case study in Latvia. Balt. For. 2023, 29, id663. [Google Scholar] [CrossRef]
Matikainen, L.; Lehtomäki, M.; Ahokas, E.; Hyyppä, J.; Karjalainen, M.; Jaakkola, A.; Kukko, A.; Heinonen, T. Remote sensing methods for power line corridor surveys. ISPRS J. Photogramm. Remote Sens. 2016, 119, 10–31. [Google Scholar] [CrossRef]
Guo, Y.; Wang, H.; Hu, Q.; Liu, H.; Liu, L.; Bennamoun, M. Deep learning for 3d point clouds: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 4338–4364. [Google Scholar] [CrossRef] [PubMed]
An, Z.; Sun, G.; Liu, Y.; Liu, F.; Wu, Z.; Wang, D.; Van Gool, L.; Belongie, S. Rethinking few-shot 3d point cloud semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 3996–4006. [Google Scholar]
Sarker, S.; Sarker, P.; Stone, G.; Gorman, R.; Tavakkoli, A.; Bebis, G.; Sattarvand, J. A comprehensive overview of deep learning techniques for 3D point cloud classification and semantic segmentation. Mach. Vis. Appl. 2024, 35, 67. [Google Scholar] [CrossRef]
Huang, R.; Xu, Y.; Stilla, U. GraNet: Global relation-aware attentional network for semantic segmentation of ALS point cloud. ISPRS J. Photogramm. Remote Sens. 2021, 177, 1–20. [Google Scholar] [CrossRef]
Liang, Z.; Lai, X. Multilevel geometric feature embedding in transformer network for ALS point cloud semantic segmentation. Remote Sens. 2024, 16, 3386. [Google Scholar] [CrossRef]
Liu, T.; Wei, B.; Hao, J.; Li, Z.; Ye, F.; Wang, L. A multi-point focus transformer approach for large-scale ALS point cloud ground filtering. Int. J. Remote Sens. 2025, 46, 979–999. [Google Scholar] [CrossRef]
Zhao, J.; Zhou, H. Cross-layer Features Fusion Network with Attention MLP for ALS Point Cloud Segmentation. In Proceedings of the 5th International Conference on Computer Vision, Zhuhai, China, 19–21 April 2024; Image and Deep Learning (CVIDL). IEEE: New York, NY, USA, 2024; pp. 1110–1114. [Google Scholar]
Hu, Q.; Yang, B.; Xie, L.; Rosa, S.; Guo, Y.; Wang, Z.; Trigoni, N.; Markham, A. Randla-net: Efficient semantic segmentation of large-scale point cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11108–11117. [Google Scholar]
Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 652–660. [Google Scholar]
Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Adv. Neural Inf. Process Syst. 2017, 30, 5105–5114. [Google Scholar]
Phan, A.V.; Le Nguyen, M.; Nguyen, Y.L.H.; Bui, L.T. Dgcnn: A convolutional neural network over large-scale labeled graphs. Neural Netw. 2018, 108, 533–543. [Google Scholar] [CrossRef]
Zhou, Y.; Tuzel, O. Voxelnet: End-to-end learning for point cloud based 3d object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4490–4499. [Google Scholar]
Wang, Y.; Sun, Y.; Liu, Z.; Sarma, S.E.; Bronstein, M.M.; Solomon, J.M. Dynamic graph cnn for learning on point cloud. ACM Trans. Graph. (TOG) 2019, 38, 1–12. [Google Scholar] [CrossRef]
Zhao, H.; Jiang, L.; Jia, J.; Torr, P.H.; Koltun, V. Point transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 16259–16268. [Google Scholar]
Park, C.; Jeong, Y.; Cho, M.; Park, J. Fast point transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 16949–16958. [Google Scholar]
Wang, P.S. Octformer: Octree-based transformers for 3d point clouds. ACM Trans. Graph. (TOG) 2023, 42, 1–11. [Google Scholar] [CrossRef]
Guo, M.H.; Cai, J.X.; Liu, Z.N.; Mu, T.J.; Martin, R.R.; Hu, S.M. Pct: Point cloud transformer. Comput. Vis. Media 2021, 7, 187–199. [Google Scholar] [CrossRef]
Kheddar, H.; Himeur, Y.; Al-Maadeed, S.; Amira, A.; Bensaali, F. Deep transfer learning for automatic speech recognition: Towards better generalization. Knowl.-Based Syst. 2023, 277, 110851. [Google Scholar] [CrossRef]
Bozinovski, S. Reminder of the first paper on transfer learning in neural networks, 1976. Informatica 2020, 44, 291. [Google Scholar] [CrossRef]
Iman, M.; Arabnia, H.R.; Rasheed, K. A review of deep transfer learning and recent advancements. Technologies 2023, 11, 40. [Google Scholar] [CrossRef]
Niu, S.; Liu, Y.; Wang, J.; Song, H. A decade survey of transfer learning (2010–2020). IEEE Trans. Artif. Intell. 2021, 1, 151–166. [Google Scholar] [CrossRef]
Ma, H.; Cai, Z.; Zhang, L. Comparison of the filtering models for airborne LiDAR data by three classifiers with exploration on model transfer. J. Appl. Remote Sens. 2018, 12, 016021. [Google Scholar] [CrossRef]
Ahmed, S.F.; Alam, M.S.B.; Hassan, M.; Rozbu, M.R.; Ishtiak, T.; Rafa, N.; Mofijur, M.; Shawkat Ali, A.; Gandomi, A. Deep learning modelling techniques: Current progress, applications, advantages, and challenges. Artif. Intell. Rev. 2023, 56, 13521–13617. [Google Scholar] [CrossRef]
Peng, S.; Xi, X.; Wang, C.; Xie, R.; Wang, P.; Tan, H. Point-based multilevel domain adaptation for point cloud segmentation. IEEE Geosci. Remote Sens. Lett. 2020, 19, 1–5. [Google Scholar] [CrossRef]
Xie, Y.; Schindler, K.; Tian, J.; Zhu, X.X. Exploring cross-city semantic segmentation of ALS point clouds. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2021, 43, 247–254. [Google Scholar] [CrossRef]
Zhao, C.; Yu, D.; Xu, J.; Zhang, B.; Li, D. Airborne LiDAR point cloud classification based on transfer learning. In Proceedings of the Eleventh International Conference on Digital Image Processing (ICDIP 2019), Guangzhou, China, 10–13 May 2019; SPIE: Bellingham, WA, USA, 2019; Volume 11179, pp. 550–556. [Google Scholar]
Dai, M.; Xing, S.; Xu, Q.; Li, P.; Pan, J.; Zhang, G.; Wang, H. Multiprototype Relational Network for Few-Shot ALS Point Cloud Semantic Segmentation by Transferring Knowledge From Photogrammetric Point Clouds. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–17. [Google Scholar] [CrossRef]
Dai, M.; Xing, S.; Xu, Q.; Li, P.; Pan, J.; Wang, H. Cross-Domain Incremental Feature Learning for ALS Point Cloud Semantic Segmentation with Few Samples. IEEE Trans. Geosci. Remote Sens. 2024, 63, 1–14. [Google Scholar] [CrossRef]
Long, M.; Cao, Z.; Wang, J.; Jordan, M.I. Conditional Adversarial Domain Adaptation. In Proceedings of the Computer Vision—ECCV 2018—15th European Conference, Munich, Germany, 8–14 September 2018; Springer: Cham, Switzerland, 2018. [Google Scholar] [CrossRef]
Tzeng, E.; Hoffman, J.; Saenko, K.; Darrell, T. Adversarial Discriminative Domain Adaptation. arXiv 2017. [Google Scholar] [CrossRef]
Kirkpatrick, J.; Pascanu, R.; Rabinowitz, N.; Veness, J.; Desjardins, G.; Rusu, A.A.; Milan, K.; Quan, J.; Ramalho, T.; Grabska-Barwinska, A. Overcoming catastrophic forgetting in neural networks. Proc. Natl. Acad. Sci. USA 2017, 114, 3521–3526. [Google Scholar] [CrossRef]
Neyshabur, B.; Sedghi, H.; Zhang, C. What is being transferred in transfer learning? Adv. Neural Inf. Process. Syst. 2020, 33, 512–523. [Google Scholar]
Ni, H.; Lin, X.; Zhang, J. Classification of ALS point cloud with improved point cloud segmentation and random forests. Remote Sens. 2017, 9, 288. [Google Scholar] [CrossRef]
Reymann, C.; Lacroix, S. Improving LiDAR point cloud classification using intensities and multiple echoes. In Proceedings of the 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Hamburg, Germany, 28 September–3 October 2015; IEEE: New York, NY, USA, 2015; pp. 5122–5128. [Google Scholar]
Stanislas, L.; Nubert, J.; Dugas, D.; Nitsch, J.; Sünderhauf, N.; Siegwart, R.; Cadena, C.; Peynot, T. Airborne particle classification in lidar point clouds using deep learning. In Field and Service Robotics: Results of the 12th International Conference; Springer: Singapore, 2021; pp. 395–410. [Google Scholar]
Zhang, K.; Ye, L.; Xiao, W.; Sheng, Y.; Zhang, S.; Tao, X.; Zhou, Y. A dual attention neural network for airborne LiDAR point cloud semantic segmentation. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–17. [Google Scholar] [CrossRef]
Mahajan, D.; Girshick, R.; Ramanathan, V.; He, K.; Paluri, M.; Li, Y.; Bharambe, A.; Van Der Maaten, L. Exploring the limits of weakly supervised pretraining. In Proceedings of the European conference on computer vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 181–196. [Google Scholar]
Panigrahi, S.; Nanda, A.; Swarnkar, T. A survey on transfer learning. In Intelligent and Cloud Computing: Proceedings of ICICC 2019, Bhubaneswar, India, 20 December 2019; Springer: Singapore, 2020; Volume 1, pp. 781–789. [Google Scholar]
Ganin, Y.; Ustinova, E.; Ajakan, H.; Germain, P.; Larochelle, H.; Laviolette, F.; March, M.; Lempitsky, V. Domain-Adversarial Training of Neural Networks. J. Mach. Learn. Res. 2016, 17, 1–35. [Google Scholar] [CrossRef]
Pei, Z.; Cao, Z.; Long, M.; Wang, J. Multi-Adversarial Domain Adaptation. In Proceedings of the AAAI conference on artificial intelligence, New Orleans, LA, USA, 2–7 February 2018; AAAI Press: Palo Alto, CA, USA, 2018. [Google Scholar] [CrossRef]
Yu, Z.; Wang, K.; Xie, S.; Zhong, Y.; Lv, Z. Prototypical Network Based on Manhattan Distance. CMES-Comput. Model. Eng. Sci. 2022, 131, 655. [Google Scholar] [CrossRef]
Cramer, M. The dgpf-test on digital airborne camera evaluation–overview and test design. Photogramm.-Fernerkund.-Geoinf. 2010, 73–82. [Google Scholar] [CrossRef]
Rottensteiner, F.; Sohn, G.; Jung, J.; Gerke, M.; Baillard, C.; Benitez, S.; Breitkopf, U. The isprs benchmark on urban object classification and 3D building reconstruction. ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci. 2012, 1–3, 293–298. [Google Scholar] [CrossRef]
Iman Zolanvari, S.M.; Ruano, S.; Rana, A.; Cummins, A.; da Silva, R.E.; Rahbar, M.; Smolic, A. Dublin City: Annotated LiDAR Point Cloud and its Applications. arXiv 2019. [Google Scholar] [CrossRef]
Varney, N.; Asari, V.K.; Graehling, Q. DALES: A Large-scale Aerial LiDAR Data Set for Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 14–19 June 2020; IEEE: New York, NY, USA, 2020. [Google Scholar] [CrossRef]
Wang, J.; Li, H.; Xu, Z.; Xie, X. Semantic segmentation of urban airborne LiDAR point clouds based on fusion attention mechanism and multi-scale features. Remote Sens. 2023, 15, 5248. [Google Scholar] [CrossRef]
Thomas, H.; Qi, C.R.; Deschaud, J.E.; Marcotegui, B.; Goulette, F.; Guibas, L.J. KPConv: Flexible and Deformable Convolution for Point Clouds. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; IEEE: New York, NY, USA, 2020. [Google Scholar] [CrossRef]
Li, X.; Wang, L.; Wang, M.; Wen, C.; Fang, Y. Dance-net: Density-aware convolution networks with context encoding for airborne lidar point cloud classification. ISPRS J. Photogramm. Remote Sens. 2020, 166, 128–139. [Google Scholar] [CrossRef]
Li, W.; Wang, F.-D.; Xia, G.-S. A geometry-attentional network for ALS point cloud classification. ISPRS J. Photogramm. Remote Sens. 2020, 164, 26–40. [Google Scholar] [CrossRef]
Mao, Y.; Chen, K.; Diao, W.; Sun, X.; Lu, X.; Fu, K.; Weinmann, M. Beyond single receptive field: A receptive field fusion-and-stratification network for airborne laser scanning point cloud classification. ISPRS J. Photogramm. Remote Sens. 2022, 188, 45–61. [Google Scholar] [CrossRef]
Chen, Y.; Xing, Y.; Li, X.; Gao, W. DGCN-ED: Dynamic graph convolutional networks with encoder–decoder structure and its application for airborne LiDAR point classification. Int. J. Remote Sens. 2023, 44, 3489–3506. [Google Scholar] [CrossRef]
Mao, Y.Q.; Bi, H.; Li, X.; Chen, K.; Wang, Z.; Sun, X.; Fu, K. Twin deformable point convolutions for airborne laser scanning point cloud classification. ISPRS J. Photogramm. Remote Sens. 2025, 221, 78–91. [Google Scholar] [CrossRef]
Wang, Z.; Chen, H.; Liu, J.; Qin, J.; Sheng, Y.; Yang, L. Multilevel intuitive attention neural network for airborne LiDAR point cloud semantic segmentation. Int. J. Appl. Earth Obs. Geoinf. 2024, 132, 104020. [Google Scholar] [CrossRef]
Hu, Q.; Yang, B.; Khalid, S.; Xiao, W.; Trigoni, N.; Markham, A. Towards Semantic Segmentation of Urban-Scale 3D Point Clouds: A Dataset, Benchmarks and Challenges (CVPR’2021). In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]

Figure 1. The flowchart of the proposed model. Among them, D represents domain, F represents the proposed MFE, G represents multilayer perceptron (MLP)-based classifier,

C_{b}

represents base categories, and

C_{n}

represents novel categories. Subscripts s, t, and t_b represent source domain, target domain, and base categories from target domain, respectively.

Figure 2. Flowchart of Multi-branch Feature Extractor, where N is the number of points, B is the number of branches, K is the number of neighborhoods, and

D_{i n}

represents the dimension of the input parameter.

Figure 3. Illustration of NEB, where N is the number of points and K is the number of neighborhoods,

D_{i n}

is the dimension of the input, and

D_{o u t}

is the dimension of the embedded features.

Figure 4. Visualization of features learned from different parameter combinations using t-SNE.

Figure 5. Pre-training subtask based on multi-branch architecture. N is the number of points.

Figure 6. The flowchart of domain adaptation through Multi-class Domain Adversarial Network (MDAN).

Figure 7. Description of the proposed MDAN.

Figure 8. Segmentation results of ISPRS benchmark dataset. (a) Results of our methods. (b) Ground truth.

Figure 9. Comparison between the results of our method and ground truth.

Figure 10. Zoomed-in thematic map of ISPRS benchmark dataset. (Left) Results of our methods. (Right) Ground truth.

Figure 11. Flowchart of the used models in the ablation experiment of multi-branch architecture.

Table 1. Summarization of point cloud semantic segmentation methods.

Methods	Brief Descriptions	Advantages and Limitations
MLP-based methods [,,]	A shared multi-layer perceptron (MLP) is adopted as the core building block. Aggregating features are complemented by symmetric functions.	Simple and efficient, but underperforms when the point cloud contains complex local structures.
Convolution-based methods [,,]	Inspired by the success of convolution kernels in image segmentation, they are adopted for point clouds in order to extract local geometries.	Mature deep learning models can be adopted, but this type of method requires converting the point cloud into a 2D/3D regular structure first by gridding/voxelization, and information loss is unavoidable during this process. Moreover, the input order of the point cloud may cause differences in the segmentation results.
Attention-based methods [,,,]	Long-range dependencies are modeled by attention mechanisms, which are suitable for handling point cloud irregularities.	Attention-based methods have achieved superior results, but this type of method has a high demand for computing resources, and a large amount of high-quality annotated data is required to ensure its high performance.

Table 2. Detailed information of the three datasets.

Domain	Target Domain	Source Domain
Dataset	ISPRS benchmark dataset	DALES dataset	Dublin dataset
Density	Above 4 pts/m²	Above 50 pts/m²	>150 pts/m²
Parameter type	XYZ values Intensity Echo-related parameters	XYZ values Echo-related parameters	XYZ Values intensity Echo-related parameters
Categories	Powerlines, low_veg, imp_sur (ground), cars, fence/hedge (fences), roof (building), fac, shrub, tree (vegetation)	Ground, vegetation, cars, trucks, powerlines, poles, fences, building	Building, ground and vegetation
Base categories	Ground, vegetation, car, building, fence, powerlines
Novel categories	low_veg, fac, shrub	-	-

Table 3. Different sampling proportions in the training set of the ISPRS benchmark dataset.

Proportion	Base Categories						Novel Categories
Proportion	Power	Imp_Surf	Car	Fence	Roof	Tree	Low_Veg	Fac	Shrub
100%	546	193,723	4614	12,070	152,045	135,173	180,580	27,250	47,605
10%	55	19,383	462	1207	15,205	13,518	18,058	2725	4761
1%	6	1939	47	121	1521	1352	1806	273	477
0.1%	1	194	5	13	153	136	181	28	48

Table 4. Comparison of segmentation results using the ISPRS benchmark dataset with other fully supervised deep learning methods. The segmentation accuracy of each category was evaluated by F1 and is expressed in percentage. The highest values of F1, OA, and averF1 are marked with bold text, and the second highest values are marked with underlines.

Method	Power	Low_Veg	Imp_Surf	Car	Fence	Roof	Fac	Shrub	Tree	OA	averF1
PointNet++ (2017) []	57.9	79.6	90.6	66.1	31.5	91.6	54.3	41.6	77.0	81.2	65.6
KPConv (2019) []	63.1	82.3	91.4	72.5	25.2	94.4	60.3	44.9	81.2	83.7	68.4
DANCE-NET (2020) []	68.4	81.6	92.8	77.2	38.6	93.9	60.2	47.2	81.4	83.9	71.2
GANet (2020) []	75.4	82.0	91.6	77.8	44.2	94.4	61.5	49.6	82.6	84.5	73.2
GraNet (2021) []	67.7	82.7	91.7	80.9	51.1	94.5	62.0	49.9	82.0	84.5	73.6
RFFS-NET (2022) []	75.5	80.0	90.5	78.5	45.5	92.7	57.9	48.3	75.7	82.1	71.6
DGCN-ED (2023) []	72.6	82.9	92.3	75.9	41.8	92.1	60.9	43.4	79.5	83.5	71.3
TDConvs (2024) []	67.0	82.4	91.6	84.7	48.7	94.2	63.3	46.9	81.7	84.5	73.4
MIA-Net (2024) []	65.8	79.5	89.7	71.1	26.2	94.0	63.5	48.1	82.8	83.3	71.8
Ours (10%)	72.9	83.1	91.9	83.4	42.1	95.4	63.9	51.7	82.8	85.5	74.0
Ours (1%)	65.2	81.6	90.3	66.1	31.6	94.1	60.1	43.5	82.6	83.7	68.4
Ours (0.1%)	59.6	80.2	89.7	62.9	26.1	92.5	58.3	36.1	80.7	81.8	65.1
Ours (100%)	76.3	83.4	91.8	83.2	37.6	95.5	68.4	52.8	83.9	85.7	74.7

Table 5. Experimental settings for comparison with other few sample learning methods.

Domain		Target Domain	Source Domain
Dataset		ISPRS benchmark dataset	SensatUrban
Parameter type		XYZ values Intensity Echo-related parameters	XYZ values RGB
Number of categories		9	12
Base categories		Ground, vegetation, building
Novel categories		Power, low_veg, car, fence, fac, shrub	—
Settings of Multi-branch Feature Extractor	Components	(1) Coordinate branch, (2) intensity branch, (3) echo-related branch
	Pre-training subtask	_	Coordinate branch: pre-training using XYZ values of SensatUrban dataset
	Domain adaptation subtask	(1) Coordinate branch: domain adaptation using sampled training set with base categories (2) Intensity branch and echo-related branch: supervised learning using sampled training set with base categories	-
	Incremental learning subtask	Coordinate branch, intensity branch, and echo-related branch: incremental learning using sampled training set with base categories and novel categories	-

Table 6. Comparison of segmentation results with other few sample learning methods. The segmentation accuracy of each category was evaluated by IoU. The highest values of IoU, OA, and mIoU are marked with bold text.

Method	Power	Low_Veg	Imp_Surf	Car	Fence	Roof	Fac	Shrub	Tree	OA	mIoU
Thr-MPRNet (2024) []	0.0	60.58	70.6	8.6	1.7	57.8	12.08	17.4	44.7	44.1	32.4
Dai (2025) []	0.0	43.9	84.0	12.0	1.5	47.2	15.6	18.2	59.6	48.1	37.3
Ours	51.8	70.3	84.6	58.8	27.7	89.1	44.3	28.0	67.9	84.2	58.1

Table 7. Effect of different K-nearest neighboring on the NEB measured by OA, averF1, and p-values.

Method	OA	p-Values (OA)	averF1	p-Value (averF1)
Base	80.1 (±0.18)	-	62.9 (±0.74)	-
NEB (K = 4)	80.9 (±0.26)	0.009	65.6 (±0.72)	<0.0001
NEB (K = 8)	81.2 (±0.24)	0.0005	66.7 (±0.83)	0.001
NEB (K = 16)	81.8 (±0.28)	0.0008	67.1 (±0.76)	0.0006
NEB (K = 32)	81.7 (±0.21)	0.0006	66.53 (±0.77)	0.005
NEB (K = 64)	80.5 (±0.22)	0.04	63.2 (±0.72)	0.49

Table 8. Models with different architecture.

Model	Description
M1	Model consists of an MLP-based embedding layer, Point Transformer Block (PTB), and MLP-based classifier. The input data is a 5-dimensional vector that includes 3 coordinate values, intensity, and an echo-related parameter.
M2	Model consists of a three-branch feature extractor that includes a coordinate branch, an intensity branch, and an echo-related branch. Each branch consists of an MLP-based embedding layer and PTB. Features from each branch were concatenated and then input into an MLP-based classifier to obtain segmentation results.

Table 9. Ablation experiment of multi-branch architecture and single-branch architecture. The segmentation effectiveness of each category was evaluated by F1 score and is expressed in percentage.

Method	Power	Low_Veg	Imp_Surf	Car	Fence	Roof	Fac	Shrub	Tree	OA	averF1	p-Value (OA)	p-Value (averF1)
M1	50.2 (±5.41)	80.4 (±0.28)	91.0 (±0.25)	67.4 (±1.94)	21.6 (±5.15)	92.8 (±0.25)	56.7 (±0.48)	46.7 (±0.43)	80.5 (±0.28)	82.4 (±0.28)	65.3 (±0.82)	-	-
M2	67.1 (±4.82)	82.3 (±0.34)	91.2 (±0.23)	70.5 (±2.17)	33.6 (±3.24)	94.5 (±0.21)	60.7 (±0.52)	49.3 (±0.38)	80.8 (±0.31)	83.7 (±0.26)	69.8 (±0.78)	0.0008	0.0007

Table 10. Time complexity measured by running time of M1 and M2 on the testing set of ISPRS dataset.

Method	Maximum (Second)	Minimum (Second)	Average (Second)
M1	28.7	31.9	30.7
M2	55.3	61.4	58.6

Table 11. Comparison of different domain adaptation methods. The segmentation results of each category were evaluated by F1 score and are expressed in percentage.

Method	Power	Imp_Surf	Car	Fence	Roof	Tree	OA	p-Value (OA)
FT	50.4 (±5.89)	97.9 (±0.14)	66.3 (±1.91)	48.9 (±5.41)	90.7 (±0.21)	84.1 (±0.25)	90.3 (±0.28)	<0.0001
MADA	57.1 (±4.76)	98.3 (±0.17)	78.6 (±2.41)	55.2 (±4.68)	97.5 (±0.12)	93.7 (±0.18)	95.8 (±0.24)	<0.0001
MDAN (Ours)	76.9 (±3.46)	98.5 (±0.11)	82.6 (±1.81)	61.4 (±4.32)	98.0 (±0.15)	93.9 (±0.2)	96.5 (±0.17)	-

Table 12. Ablation experiments of KTM and FSM. The segmentation effectiveness of each category was evaluated by F1 and is expressed in percentage.

Method	$C_{b}$	$C_{n}$				OA	p-Value (OA)
Method	OA	Low_Veg	Fac	Shrub	OA	OA	p-Value (OA)
Base	96.5	-	-	-	-	-	-
KTM + Fine-tuning	91.7 (±0.25)	74.5 (±0.37)	42.8 (±0.92)	41.6 (±1.27)	62.7 (±0.45)	82.2 (±0.34)	<0.0001
FSM	81.0 (±0.19)	84.0 (±0.28)	69.5 (±0.86)	65.1 (±0.93)	78.1 (±0.38)	80.1 (±0.23)	0.0001
Ours (KTM + FSM)	91.7 (±0.25)	83.2 (±0.19)	64.2 (±0.73)	52.1 (±0.84)	73.2 (±0.26)	85.5 (±0.25)	-

Table 13. Comparative experiments of SALM and the proposed FSM. The segmentation accuracy of each category was evaluated by F1 score and is expressed in percentage.

Method	$C_{n}$
Method	Low_Veg	Fac	Shrub	OA	p-Value (OA)
SALM	80.4 (±0.81)	58.3 (±1.45)	56.8 (±0.96)	77.1 (±0.9)	-
FSM	84.0 (±0.28)	69.5 (±0.86)	65.1 (±0.93)	78.1 (±0.38)	0.008

Table 14. The impact of random sampling. OA and averF1 are expressed in percentage.

Proportion		Maximum	Minimum	Average
10%	OA	85.7	85.2	85.5
10%	averF1	74.3	72.1	73.8
1%	OA	84.1	83.4	83.8
1%	averF1	67.1	69.2	68.4
0.1%	OA	81.9	80.7	81.6
0.1%	averF1	65.6	62.1	64.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Transfer Learning Based on Multi-Branch Architecture Feature Extractor for Airborne LiDAR Point Cloud Semantic Segmentation with Few Samples

Abstract

1. Introduction

2. Method

2.1. Problem Definition

2.2. Multi-Branch Feature Extractor (MFE)

2.2.1. Neighborhood Embedding Block (NEB)

2.2.2. Concept of Multi-Branch Architecture

2.2.3. Construction of MFE Based on Multi-Branch Architecture

2.3. Pre-Training Subtask

2.4. Domain Adaption Subtask

2.5. Incremental Learning Subtask

3. Experiments

3.1. Datasets and Evaluation Metrics

3.2. Experimental Results

3.2.1. Processing Details

3.2.2. Comparison of Experimental Results

4. Ablation Experiments and Discussions

4.1. Discussions of Neighborhood Embedding Block

4.2. Discussions of Multi-Branch Architecture

4.3. Discussion of Domain Adaptation Subtask and Incremental Learning Subtask

4.4. Analysis of the Impact of Random Sampling

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics