Ground to Altitude: Weakly-Supervised Cross-Platform Domain Generalization for LiDAR Semantic Segmentation

Jingyi Wang; Xiaojia Xiang; Jun Lai; Yu Liu; Qi Li; Chen Chen

doi:10.3390/rs18020192

,

and

¹

College of Systems Engineering, National University of Defense Technology, Changsha 410073, China

²

College of Intelligence Science and Technology, National University of Defense Technology, Changsha 410073, China

^*

Author to whom correspondence should be addressed.

Remote Sens.2026, 18(2), 192;https://doi.org/10.3390/rs18020192

This article belongs to the Special Issue New Perspectives on 3D Point Cloud (Fourth Edition)

Version Notes

Order Reprints

Highlights

What are the main findings?

We propose GTA, the first weakly-supervised domain generalization framework for cross-platform (ground-to-altitude) point cloud segmentation.
We develop a progressive domain-aware augmentation strategy with cross-scale semantic alignment to effectively mitigate domain shifts.

What are the implications of the main findings?

We eliminate the dependency on annotated aerial point clouds by establishing a practical weakly supervised paradigm for ground-to-aerial collaborative perception.
We significantly enhance cross-platform generalization capability in complex large-scale outdoor scenarios with diverse urban layouts.

Abstract

Collaborative sensing between low-altitude remote sensing and ground-based mobile mapping lays the theoretical foundation for multi-platform 3D data fusion. However, point clouds collected from Airborne Laser Scanners (ALSs) remain scarce due to high acquisition and annotation costs. In contrast, while autonomous driving datasets are more accessible, dense annotation remains a significant bottleneck. To address this, we propose Ground to Altitude (GTA), a weakly supervised domain generalization (DG) framework. GTA leverages sparse autonomous driving data to learn robust representations, enabling reliable segmentation on airborne point clouds under zero-label conditions. Specifically, we tackle cross-platform discrepancies through progressive domain-aware augmentation (PDA) and cross-scale semantic alignment (CSA). For PDA, we design a distance-guided dynamic upsampling strategy to approximate airborne point density and a cross-view augmentation scheme to model viewpoint variations. For CSA, we impose cross-domain feature consistency and contrastive regularization to enhance robustness against perturbations. A progressive training pipeline is further employed to maximize the utility of limited annotations and abundant unlabeled data. Our study reveals the limitations of existing DG methods in cross-platform scenarios. Extensive experiments demonstrate that GTA achieves state-of-the-art (SOTA) performance. Notably, under the challenging 0.1% supervision setting, our method achieves a 6.36% improvement in mIoU over the baseline on the SemanticKITTI → DALES benchmark, demonstrating significant gains across diverse categories beyond just structural objects.

Keywords:

weak supervision; domain generalization; semantic segmentation; large-scale point clouds

1. Introduction

Three-dimensional LiDAR point cloud semantic segmentation is a fundamental task in scene understanding and plays a crucial role in various domains such as autonomous driving, human–computer interaction, and remote sensing. In recent years, advances in sensor acquisition technologies and data synthesis techniques have led to the emergence of several high-quality public datasets, such as SemanticKITTI [1] and SynLiDAR [2], which have significantly accelerated the progress of 3D point cloud semantic segmentation research.

However, most existing LiDAR datasets are collected from ground vehicles with a near-horizontal viewing perspective, resulting in data distributions and structural characteristics that are highly dependent on ground-level scenes. When models are deployed in typical aerial 3D applications—such as large-scale land-cover classification, building footprint extraction and reconstruction, vegetation and forest inventory analysis, agricultural field monitoring, disaster assessment, and city- or region-level geographic mapping—there is a notable lack of large-scale benchmark datasets tailored for these scenarios. Models trained on ground-view data typically exhibit limited generalization ability when transferred to aerial platforms due to substantial differences in viewpoint, point density distribution, and scene scale.

To improve model generalization, inspired by domain transfer studies in 2D vision tasks, several recent works in 3D point cloud semantic segmentation have begun exploring unsupervised domain adaptation (UDA) and domain generalization (DG) approaches. When handling large-scale outdoor datasets, fully annotating source domain data is prohibitively labor-intensive and often infeasible. Therefore, to address the challenge of cross-domain adaptation without significantly increasing data acquisition and labeling efforts, it is imperative to develop a cross-platform, weakly supervised 3D point cloud semantic segmentation framework that ensures stable scene perception across different sensing platforms. Such a capability not only facilitates multi-platform 3D data fusion but also lays a solid foundation for future intelligent systems to operate robustly in diverse and complex environments.

UDA methods typically rely on accessing target-domain data during training, which limits their generalization ability to only those domains observed during the training process. In contrast, DG aims to enhance model robustness in previously unseen domains without requiring any target-domain data. To achieve stable performance across heterogeneous platforms, this study adopts a DG-based research paradigm, with the objective of learning domain-invariant point cloud representations across data collected from different sensor platforms. However, even for the same semantic category, point clouds captured from different platforms may exhibit substantial discrepancies in both distribution and structural characteristics. As illustrated in Figure 1, airborne LiDAR typically preserves upper-layer geometry and global infrastructure while lacking interior and lateral surfaces, whereas ground-based autonomous driving datasets emphasize side-view structural details. These inherent differences pose significant challenges for constructing truly domain-invariant representations.

Figure 1. Comparison of datasets captured from different perspectives. For better visualization, different colors denote distinct semantic categories. Point clouds from the airborne dataset DALES exhibit uniform density and typically preserve upper-layer geometry and overall infrastructure (top). In contrast, point clouds from the autonomous driving dataset SemanticKITTI decrease in density with distance and tend to emphasize side-view structures while underrepresenting upper-layer geometry (bottom).

To address this issue, we propose a cross-platform, weakly supervised 3D point cloud semantic segmentation framework based on DG. From a data-centric perspective, to reduce the model’s dependency on target-domain data, we employ progressive domain-aware augmentation (PDA), which applies stochastic transformations to source-domain point clouds based on a basic augmentation strategy. This encourages the model to learn more generalizable representations across varying samples without relying on domain-specific details, thereby enhancing generalization in the absence of target-domain data. Furthermore, we design platform-specific augmentation strategies tailored to the disparities between vehicle-mounted and airborne LiDAR data, improving the model’s adaptability to different platforms. From a label-centric perspective, weakly supervised methods provide limited annotations. To further enhance semantic refinement, our framework leverages a large number of unlabeled points without disturbing the learning of labeled point features. Specifically, we introduce cross-scale semantic alignment (CSA) to enforce semantic consistency between original and augmented samples across multiple feature scales.

Finally, we conduct a systematic evaluation and analysis of the proposed strategy. Experimental results on the SemanticKITTI, SynLiDAR, and DALES [3] datasets demonstrate that our method, even when trained under weak supervision, consistently outperforms all baseline approaches, thereby validating the effectiveness of the proposed framework.

To summarize, our key contributions are the following:

We propose the first DG-based cross-platform weakly supervised 3D point cloud semantic segmentation framework, which effectively learns domain-invariant features from multi-platform data.
We design a progressive data augmentation strategy, simulating airborne LiDAR data to enhance cross-platform generalization.
We introduce CSA to fully leverage unlabeled points, enabling richer contextual feature learning and further improving generalizable 3D semantic scene understanding.

3. Method

In this section, we first introduce the overall framework of the proposed method in Section 3.1. We then elaborate on the PDA and the CSA in Section 3.2 and Section 3.3, respectively. Finally, the detailed loss functions are presented in Section 3.4.

3.1. Overview

The overview of GTA is illustrated in Figure 2. To mitigate the challenges introduced by domain shift, the proposed framework is designed to enhance weakly supervised 3D point cloud semantic segmentation. It integrates two complementary components: PDA and CSA. Working jointly, these modules improve robustness against distributional variations and enforce semantic consistency across different feature scales and domains.

Figure 2. The architecture of GTA. The original point cloud

P_{o}

is first processed by PDA to generate an augmented point cloud,

P_{a}

. Both

P_{o}

and

P_{a}

are then fed into the 3D semantic segmentation network to extract their corresponding features,

F_{o}

and

F_{a}

. These features are passed through a series of multilayer perceptrons to produce initial semantic predictions, which are subsequently used to compute the CSA loss

L_{c s a}

, integrating contrastive regularization and consistency supervision. This loss enforces semantic consistency across scales and domains. Meanwhile, only the labeled points contribute to the cross-entropy segmentation loss

L_{s e g}

, guiding the network to learn accurate predictions from sparse annotations.

First, the PDA module applies stage-wise domain-aware transformations to the input point clouds, including basic augmentations—mirroring, random rotation, and jittering—as well as perceptual augmentations, such as distance-guided upsampling and geometry-aware selection. Unlike conventional random augmentation strategies that may distort inherent spatial structures, PDA dynamically adjusts the augmentation intensity based on the training stage. During early training, the model focuses on learning fundamental geometric patterns under mild perturbations; as training progresses, it is gradually exposed to more complex geometric variations and density shifts. The distance-guided upsampling and geometry-aware selection operations increase point density while preserving top points, facade points, and a portion of original samples, thereby enriching training diversity without compromising structural integrity. This progressive strategy facilitates the learning of feature representations that are robust to domain variations while maintaining structural and semantic consistency.

Next, with the features extracted, MLPs are applied to generate initial semantic predictions. These predictions are further refined by the CSA module, which enforces semantic consistency across multiple spatial scales. Specifically, CSA constrains both coarse- and fine-grained semantic responses to maintain local coherence while preserving high-frequency structural details. This strategy effectively mitigates semantic drift caused by the significant geometric perturbations introduced by PDA, enabling the network to exploit complementary contextual cues and ensuring reliable CSA.

Finally, the initial predictions at weakly annotated points are leveraged to compute the segmentation loss, which is jointly optimized with the cross-scale alignment loss. Taken together, these objectives constitute a unified augmentation–alignment learning paradigm that progressively improves DG and semantic stability under weak supervision.

3.2. Progressive Domain-Aware Augmentation

Autonomous driving and aerial LiDAR point clouds exhibit significant discrepancies in feature distributions due to differences in acquisition viewpoints, sensor configurations, and scanning trajectories. Traditional data augmentation techniques fail to effectively simulate the structured geometric variations that naturally occur across domains. To address this challenge, we propose a PDA strategy. Built upon conventional geometric perturbations, PDA gradually increases perturbation complexity and diversity, generating multiple semantically consistent point cloud variants. This progressive scheduling enables the model to smoothly transition from learning local perturbation invariance to global domain shift robustness, thereby effectively narrowing the cross-domain gap.

Specifically, PDA consists of two stages: a basic augmentation stage and a domain-aware augmentation stage. In the early training phase, lightweight geometric perturbations such as mirroring, rotation, and jittering are applied to improve robustness against local spatial noise while preserving the overall topological layout. As training proceeds, PDA progressively activates domain-aware augmentations that mimic characteristics commonly observed in aerial LiDAR, including uniform density distribution, vertical view-angle bias, and geometric discontinuities caused by occlusions.

Motivated by the observation that autonomous driving point clouds become sparser with increasing distance, while aerial point clouds generally maintain uniform density, we introduce a distance-aware density compensation strategy. For each point

p_{i} = (x_{i}, y_{i}, z_{i})

, its ground-projected distance is computed as

r_{i} = {∥ p_{i, x y} ∥}_{2}, w_{i} = \frac{r_{i}}{{max}_{j} r_{j}},

(1)

where

w_{i}

denotes the relative compensation weight applied to enhance density in distant regions. Furthermore, we define a vertical structural saliency term based on local height differences:

s_{i} = |z_{i} - \frac{1}{| N (i) |} \sum_{j \in N (i)} z_{j}|,

(2)

which captures geometric discontinuities on building rooftops and façades, thereby reinforcing occlusion patterns present from aerial viewpoints. To avoid semantic distortion, ground surfaces are preserved while façade structures are sparsely retained.

We unify density compensation, structural saliency, and training scheduling into a continuous piecewise augmentation formulation:

T (p_{i}) = Dup (p_{i}, 1 + ⌊κ (λ_{t}) (α w_{i} + β s_{i}) (1 - m_{i})⌋),

(3)

where

Dup (p_{i}, n)

denotes the duplication of point

p_{i}

with a duplication factor of n for local density augmentation. Here,

λ_{t} = \frac{t}{T}

is a monotonic curriculum coefficient,

κ (λ_{t}) = e^{λ_{t}} - 1

increases domain shift intensity in the later training stages, and

α

and

β

control the contributions of density and saliency, respectively. Here,

m_{i} = I (s_{i} > τ_{m})

(4)

is a façade saliency mask introducing sparse discontinuities in structurally significant regions.

This unified formulation enables joint geometric modeling of domain discrepancies:

w_{i}

simulates uniform density at long ranges,

s_{i}

captures vertical structural deviations, and

m_{i}

introduces façade sparsity. The exponential scheduling term

κ (λ_{t})

ensures a smooth transition from mild to strong domain perturbations, preventing early training instability. After applying PDA, the original point cloud

P_{o}

and the augmented point cloud

P_{a}

are fed into the network to extract features. The obtained representations

{F_{o}, F_{a}}

correspond to structurally consistent yet geometrically diverse inputs, significantly improving spatial robustness under cross-domain scenarios. By progressively exposing the model to increasingly challenging geometric variations, PDA enhances generalization while preserving semantic stability, laying a solid foundation for subsequent cross-domain feature alignment.

3.3. Cross-Scale Semantic Alignment

While PDA effectively increases cross-domain geometric diversity, it inevitably introduces feature mismatches between the original and augmented samples. Under weak supervision, where point-level annotations are limited, local geometric perturbations are prone to induce semantic drift and accumulate in the feature space, ultimately degrading class discriminability. Moreover, features at different scales exhibit varying sensitivities to local structures and global semantics, which can further exacerbate potential alignment bias.

To address these challenges, we propose a CSA module that enforces semantic consistency and contrastive regularization across multiple feature scales to achieve robust alignment under diverse geometric perturbations. Specifically, lower-level features capture fine-grained local geometry, while higher-level features focus on global semantic stability. In addition, contrastive regularization explicitly pulls semantically consistent samples closer and pushes away potentially misaligned instances, effectively mitigating distributional deviation introduced by PDA.

With CSA, the network learns semantically invariant representations across geometry-diverse domains, thereby substantially improving the generalization and discriminative capability of the feature space, and providing robust support for weakly supervised scenarios lacking dense annotations.

3.3.1. Semantic Consistency Constraint

Directly imposing supervision on augmented point clouds may lead the model to overfit perturbation-induced biases, resulting in degraded segmentation performance. To address this issue, we introduce a semantic consistency constraint that enforces alignment between the predictive distributions of the original and augmented point clouds, thereby enhancing robustness against structural perturbations.

Given the softmax distributions of the original prediction

p_{i}^{o}

and the augmented prediction

p_{i}^{a}

, the cross-cloud consistency loss is defined as

L_{js} = \frac{1}{N} \sum_{i = 1}^{N} JS (p_{i}^{o}, p_{i}^{a}),

(5)

where N denotes the number of points and

JS (\cdot, \cdot)

represents the Jensen–Shannon divergence, computed as

JS (p_{i}^{o}, p_{i}^{a}) = \frac{1}{2} D_{KL} (p_{i}^{o} ‖ q_{i}) + \frac{1}{2} D_{KL} (p_{i}^{a} ‖ q_{i}),

(6)

q_{i} = \frac{1}{2} (p_{i}^{o} + p_{i}^{a}),

(7)

where

D_{KL} (\cdot ‖ \cdot)

denotes the Kullback–Leibler divergence:

D_{KL} (p ‖ q) = \sum_{c = 1}^{C} p_{c} log \frac{p_{c}}{q_{c}} .

(8)

The computation is analogous for

p_{i}^{a}

, with

i \in [1, N]

and

c \in [1, C]

.

Since the model is highly sensitive to noisy perturbations during early training, we apply a parameterized multi-stage scheduling strategy to gradually activate the consistency objective:

λ_{soft} (t) = \{\begin{matrix} λ_{min}, & t < t_{warmup}, \\ \min (λ_{max}, λ_{min} + \frac{t - t_{warmup}}{T - t_{warmup}} (λ_{max} - λ_{min})), & t \geq t_{warmup}, \end{matrix}

(9)

where t denotes the current epoch, T is the maximum number of epochs, and

λ_{min}

,

λ_{max}

, and

t_{warmup}

are tunable hyper-parameters. Consistent with the multi-stage PDA pipeline, we assign weaker consistency strength in early phases and gradually increase it as training stabilizes, effectively suppressing distributional drift introduced by aggressive geometry perturbations.

The final semantic consistency loss is defined as

L_{cons} = λ_{soft} (t) \cdot L_{js} .

(10)

3.3.2. Contrastive Regularization

Relying solely on consistency constraints in the prediction space may cause feature representations of structurally distinct points to collapse over time, leading to inter-class confusion under aggressive geometric perturbations. To enhance class separability in the embedding space, we further introduce a point-level contrastive regularization term.

Given the original embedding features

F_{o}

and the augmented embedding features

F_{a}

, we apply

ℓ_{2}

normalization along the feature dimension for both. For each anchor embedding

f_{i}^{o}

, its corresponding positive sample is the augmented counterpart

f_{i}^{a}

, while all other embeddings are considered as negatives. Based on a temperature parameter,

τ

, the point-level contrastive regularization loss is formulated as

L_{pcr} = - \frac{1}{N} \sum_{i = 1}^{N} log \frac{\exp (sim (f_{i}^{o}, f_{i}^{a}) / τ)}{\sum_{j = 1}^{N} \exp (sim (f_{i}^{o}, f_{j}^{a}) / τ)},

(11)

where

sim (\cdot, \cdot)

denotes the cosine similarity,

τ

is a temperature hyperparameter, and N is the number of sampled points.

This regularization term encourages anchor embeddings to be pulled closer to their corresponding positive samples while suppressing similarity to negatives in the embedding space, thus complementing the semantic consistency constraint imposed in the prediction space. We jointly define the cross-domain semantic alignment loss as

L_{csa} = L_{cons} + L_{pcr} .

(12)

By simultaneously optimizing output-space consistency and embedding-space discrimination, the model effectively mitigates feature collapse, strengthens structural robustness, and yields domain-invariant 3D semantic representations.

3.4. Loss Functions

Based on the aforementioned module designs, we propose GTA, a weakly supervised domain-generalizable LSS framework. The framework jointly optimizes the weakly supervised segmentation loss together with the semantic alignment regularization introduced by PDA and CSA, thereby improving classification accuracy and cross-domain robustness. The overall loss function is defined as

L_{total} = L_{seg} + β L_{csa},

(13)

where

β

controls the contribution of the semantic alignment regularization term.

The weakly supervised segmentation loss

L_{seg}

is computed by applying cross-entropy between the predicted probability and weak annotations, defined as

L_{seg} = - \sum_{i \in Ω_{w}} Y_{w} (i) log \hat{Y} (i),

(14)

where

Y_{w}

denotes the weak supervision signal,

\hat{Y}

denotes the model prediction, and

Ω_{w}

represents the set of weakly annotated points.

During training, the model further optimizes semantic consistency in the prediction space and semantic alignment in the embedding space, effectively mitigating distributional discrepancies caused by domain shift. Consequently, the model achieves enhanced generalization capability and more stable semantic discrimination across diverse scenarios.

4. Experimental Results

To evaluate the semantic segmentation performance of our model under a DG setting, we restricted the training process to a single annotated autonomous-driving dataset as the source domain and subsequently deployed the trained model on previously unseen airborne LiDAR data for target-domain testing. Using the DALES dataset, we constructed two DG benchmark test sets to rigorously assess how the proposed GTA framework facilitates cross-platform generalization. In addition, we compared our approach against several state-of-the-art (SOTA) point cloud segmentation generalization networks to provide a comprehensive performance analysis.

4.1. Implementation Details

We selected seven shared evaluation categories based on the semantic taxonomy of the DALES dataset for both DG benchmarks. Since SemanticKITTI and SynLiDAR contain more fine-grained class definitions, several categories were merged accordingly, and all dataset annotations were remapped to a unified label space. For SemanticKITTI and SynLiDAR, we adopted point-level weak supervision by randomly selecting a fixed proportion of points as labeled samples. Specifically, we tailored the labeled ratio to the distinct characteristics of each source domain. For SemanticKITTI, we strictly adhered to the standard protocol of the baseline Semantic Query Network (SQN) [73] by fixing the labeled ratio at 0.1% to ensure a fair comparison and isolate the contribution of our domain-aware modules. Conversely, for SynLiDAR, we adopted a 1% labeled setting. This adjustment compensated for the lower point density of the synthetic dataset compared to that of real-world LiDAR, ensuring sufficient supervision signals for effective learning. All experiments were conducted on a workstation equipped with an Intel Core^TM i9-12900KF CPU and an NVIDIA RTX 3090Ti GPU with 24 GB of memory.

4.2. Evaluation Metrics

To quantitatively evaluate and analyze the performance of the model, we performed assessments on all points of the target-domain test-set DALES. We adopted the Intersection over Union (IoU) for each segmentation class and the mean Intersection over Union (mIoU) across all classes as evaluation metrics. Let K denote the total number of classes, TP the number of true positives, FP the number of false positives, FN the number of false negatives, and TN the number of true negatives [74]. The two evaluation metrics are computed as follows:

{IoU}_{k} = \frac{{TP}_{k}}{{TP}_{k} + {FP}_{k} + {FN}_{k}}, k = 1, \dots, K

(15)

mIoU = \frac{1}{K} \sum_{k = 1}^{K} {IoU}_{k} = \frac{1}{K} \sum_{k = 1}^{K} \frac{{TP}_{k}}{{TP}_{k} + {FP}_{k} + {FN}_{k}}

(16)

4.3. Comparison with SOTA Methods

To provide a realistic evaluation of our model’s performance, we selected several closely related methods as baselines. First, we considered data augmentation-based approaches, including Pointcutmix [75]—which generates augmented samples by mixing two point clouds selected from the training set—and Mix3D [76], which constructs hybrid scenes by concatenating points and labels from different scenes. Second, we included domain-adaptive methods such as RayCast [77] and fully supervised DG approaches for all-weather scenarios like PointDR [66], which employs a domain randomization strategy during segmentation training. Finally, we also evaluated the generalization performance of SQN, a weakly supervised method designed for large-scale 3D point cloud semantic segmentation. Notably, the implementations of the baseline methods were based on the relevant code from the LiDOG [69] project.

4.3.1. Evaluation on SemanticKITTI → DALES

SemanticKITTI is a densely annotated autonomous-driving dataset collected by the Mobile Laser Scanning (MLS) platform at the University of Bonn, Germany. It comprises 45.49 million points and 28 semantic categories, which we further consolidated into seven categories for our study.

We trained the model on the SemanticKITTI dataset and evaluated its generalization performance on the DALES dataset, with the results summarized in Table 1. Under a fully supervised training regime, the data augmentation method Mix3D achieved the best generalization performance, reaching 22.61% mIoU. The DG method for all-weather scenarios, PointDR, performed poorly, achieving only 14.71% mIoU, with the lowest performance observed on the ground category. The weakly supervised method SQN attained 20.54% mIoU, surpassing Pointcutmix, RayCast, and PointDR. Notably, GTA achieved the SOTA segmentation performance. Using only 0.1% of the training data, it attained 26.9% mIoU, outperforming PointCutMix and Mix3D by 7.23% and 4.29%, respectively, and surpassing RayCast by 9.02%. Furthermore, compared to SQN under the same 0.1% annotation setting, our method yielded a significant performance gain of 6.36%. These results demonstrate GTA’s capability to learn more robust representations when addressing cross-platform domain generalization tasks.

Table 1. Performance comparison on SemanticKITTI → DALES.

The qualitative results of transferring from SemanticKITTI to DALES using only 0.1% annotations are illustrated in Figure 3. The first column shows the raw input point clouds, which are visualized using a Scalar Field Visualization technique since the DALES dataset does not provide RGB information. The second column presents the ground-truth labels, while the third column displays the predictions generated by our method. The last column shows the predictions of SQN [73], which serves as the backbone network of our framework. Regions where our method achieves superior predictions are highlighted in purple. As can be observed, our method yields substantial improvements over SQN on the DALES dataset, particularly in the more accurate identification of buildings and cars, thereby demonstrating the effectiveness of the proposed approach.

Figure 3. Visualization results of cross-DG from SemanticKITTI to DALES. From left to right, we show the raw point cloud, the ground-truth annotations, the predictions of our method, and those of the comparison method. Regions where our approach outperforms SQN are highlighted with purple circles. Note that the ground truth (GT) is provided as a reference to highlight the significant structural improvements of GTA compared to the baseline (SQN), particularly in recovering geometric details.

4.3.2. Evaluation on SynLiDAR → DALES

SynLiDAR is a large-scale synthetic dataset collected and annotated by Nanyang Technological University from diverse virtual environments with rich scene content and layouts. It contains over 19 billion points across 32 semantic categories, which we further consolidated into seven test categories for evaluation.

We trained the model solely on the SynLiDAR dataset and evaluated its target-domain performance on the DALES dataset, with the results summarized in Table 2. Under a fully supervised training regime, the data augmentation method Pointcutmix achieved the best generalization performance, attaining 22.19% mIoU, surpassing Mix3D, RayCast, and PointDR. The weakly supervised 3D point cloud segmentation method SQN performed poorly, achieving only 14.34% mIoU, and struggled particularly with distinguishing the vegetation category.

Table 2. Performance comparison on SynLiDAR → DALES.

The proposed method, GTA, achieved the highest segmentation performance, reaching 24.02% mIoU using only 1% of the training data. This corresponded to improvements of 1.83% and 8.61% over the data augmentation methods Pointcutmix and Mix3D, respectively, and an increase of 7.51% compared to the domain-adaptive method RayCast. Furthermore, compared to SQN, which also used 1% labeled data, GTA achieved a 9.68% higher mIoU. These results demonstrate the ability of GTA to learn more robust representations for cross-platform tasks and for generalizing from synthetic to real-world data. Notably, the performance gap of SQN between synthetic and autonomous-driving datasets highlights the inherent difficulty of synthetic-to-real generalization, whereas GTA maintained strong performance, confirming its effectiveness.

The qualitative results of transferring from SynLiDAR to DALES using only 1% annotations are presented in Figure 4. The first column shows the raw input point clouds, which were visualized using the Scalar Field Visualization technique for the DALES dataset. The second column depicts the ground-truth labels, while the third column displays the predictions produced by our method. The last column shows the predictions of SQN, which serves as the backbone network of our framework. Regions where our method achieves superior predictions are highlighted in purple. As observed, our method exhibits substantial improvements over SQN on the DALES dataset, particularly in the more accurate identification of buildings and vegetation, demonstrating the effectiveness of the proposed approach.

Figure 4. Visualization results of cross-DG from SynLiDAR to DALES. From left to right, we show the raw point cloud, the ground-truth annotations, the predictions of our method, and those of the comparison method. Regions where our approach outperforms SQN are highlighted with purple circles. Note that the ground truth (GT) is provided as a reference to highlight the significant structural improvements of GTA compared to the baseline (SQN), particularly in recovering geometric details.

5. Discussion

5.1. Ablation Study

To thoroughly evaluate the individual contributions of the key components in our GTA framework, we conducted a systematic analysis combining quantitative metrics and qualitative visualization. The experiments were performed on the cross-domain SemanticKITTI → DALES scenario under the 0.1% weakly supervised setting. The quantitative results are summarized in Table 3.

Table 3. Ablation study (removing modules) on SemanticKITTI → DALES.

5.1.1. Effectiveness of Progressive Domain-Aware Augmentation

Removing the PDA module resulted in a 7.76% decrease in mIoU. This metric drop indicates that the model struggles to adapt to the geometric characteristics of the target airborne point clouds without progressive perturbations. Operating at the input level, PDA introduces data augmentation in a staged manner to progressively align the geometric statistics of source samples with the target distribution, thereby enabling the model to learn more discriminative and structured semantic features. These ablation results align with existing theoretical analyses of DG and staged adaptation mechanisms, further validating the positive contribution of PDA in robust feature extraction.

5.1.2. Effectiveness of the Cross-Scale Semantic Alignment

Table 3 reveals that removing the CSA module precipitated the most substantial performance decline of 8.82%. This sharp drop underscores that merely utilizing PDA is insufficient to bridge the severe domain gap without explicit feature-level alignment. As corroborated by the t-SNE visualization in Figure 5, the baseline features exhibit severe entanglement, whereas the full framework incorporating CSA establishes clear decision boundaries. This confirms that CSA effectively operates on the embedding space by enforcing consistency constraints across scales. Particularly under extremely sparse supervision, CSA compensates for the semantic uncertainty and enhances class separability, playing a decisive role in constructing a discriminative semantic structure.

Figure 5. The t-SNE visualization of feature distributions on the DALES dataset. The numbers in the legend correspond to different semantic classes. (Left): The baseline features exhibit severe entanglement between classes, indicating poor semantic discriminability. (Right): In contrast, our GTA framework produces compact intra-class clusters and clear inter-class separation, demonstrating that the PDA and CSA modules effectively refine the feature space.

5.1.3. Synergistic Effects

The complete GTA model achieved optimal performance when both PDA and CSA were incorporated, indicating a complementary relationship between the two components at different spatial levels. PDA bridges the geometric gap at the data level, while CSA bridges the semantic gap at the feature level. Comparing the visual results in Figure 5, we observe that our full framework transforms the chaotic feature space of the baseline into compact, well-separated clusters. This proves that the two modules mutually reinforce each other: PDA provides a diverse and robust input distribution, which facilitates CSA in learning more discriminative structural features, ultimately forming a closed loop for cross-domain robustness.

5.1.4. Computational Complexity Analysis

To evaluate the practical efficiency of our GTA framework, we compared the model parameters and inference speed with those of the baseline SQN. All experiments were conducted on a single NVIDIA RTX 3090 Ti GPU with a batch size of 2. As summarized in Table 4, our method introduced only a marginal increase in model size from 1.05 M to 1.11 M parameters. This indicates that the proposed PDA and CSA modules are parameter-efficient and do not impose a heavy memory burden. Significantly, the inference speed remained consistent at 10.0 FPS, comparable to that of the baseline. This demonstrates that GTA significantly improves segmentation performance without compromising the real-time capability of the network, highlighting its potential for efficient deployment in resource-constrained autonomous systems.

Table 4. Comparison of computational complexity and performance on DALES.

5.2. Performance Analysis

5.2.1. Quantitative Error Analysiss

To further investigate the limitations and failure modes of our method in the airborne scenario, we conducted a quantitative error analysis on the DALES dataset.

As shown in Figure 6 (left), the confusion matrix reveals a systematic misclassification where various structural objects are erroneously predicted as Vegetation. Specifically, the model exhibited high error rates for Buildings (81%), Trucks (64%), and Cars (56%) being misclassified as Vegetation. This widespread confusion is primarily attributed to the drastic viewpoint shift. The model, trained on ground-based LiDAR dataset SemanticKITTI, learned to recognize these classes largely by their vertical facades. However, in the airborne target domain, these objects predominantly present as roofs (horizontal or sloped planes) while vertical facades are largely occluded or missing. Consequently, the model failed to identify the characteristic vertical geometric features and erroneously categorized these elevated roof points as Vegetation.

Figure 6. Quantitative error analysis on the DALES dataset. (Left): The normalized confusion matrix reveals a severe misclassification of structural objects as Vegetation, highlighted by the high values in the “Vegetation” column. (Right): The per-class IoU comparison demonstrates that our GTA framework achieves significant gains in structural categories compared to the baseline SQN.

However, as shown in Figure 6 (right), our method still demonstrated superior robustness compared to the baseline. The GTA framework significantly improved the IoU of structural classes, boosting Ground by 11.4% and Cars by 15.5%. These results verify that although geometric ambiguity persisted, our domain-aware modules effectively recovered a substantial portion of semantic information lost by the baseline.

5.2.2. Qualitative Visualization

Complementing the quantitative metrics, we present a visualization study to intuitively demonstrate the generalization capability of GTA. Due to the distinct geometric characteristics of different categories, generalization strategies are often challenged by specific class-level variances, particularly under the drastic viewpoint shift between domains.

As shown in Figure 7, we visualize the segmentation results of the SemanticKITTI → DALES scenario under the 0.1% labeled setting. We specifically highlight four representative categories: Vegetation, Poles, Cars, and Buildings. The red bounding boxes in raw point clouds highlight the regions of interest.

Figure 7. Visualization results of generalizing the SemanticKITTI training set to the DALES dataset. Each group of results includes the raw point cloud, the ground-truth annotations within the red bounding boxes, the predictions of baseline methods, and the results of our approach. With only 0.1% labeled points, the baseline methods exhibit noticeable discrepancies from the ground-truth annotations, whereas the proposed GTA method produces segmentation results that are much closer to the ground truth.

It is evident that our GTA produces predictions much closer to the ground truth compared to those of the baseline SQN. A critical observation can be made in the building case: the baseline model, struggling with the top-down view, fails to recognize the building structure and misclassifies the roof points as vegetation. In contrast, GTA successfully captures the geometric consistency of the building despite the missing vertical facades. Similarly, for small objects like poles and cars, our method preserves fine-grained structural details that are lost by the baseline. This visual evidence confirms that our DG strategy effectively learns robust, viewpoint-invariant features.

6. Conclusions

In this paper, we introduced GTA, a weakly supervised DG framework for LSS that bridges the gap between autonomous driving and airborne point clouds, enabling zero-label deployment in the target domain. To accommodate distributional differences across heterogeneous LiDAR platforms, we proposed a distance-guided dynamic upsampling strategy and a cross-view augmentation scheme within the PDA module, supported by a progressive training pipeline that ensures stable optimization. Furthermore, the semantic consistency and contrastive regularization of CSA enhance robustness against perturbations and improve the discriminability of learned representations. Extensive experiments demonstrate that, although a performance gap remains compared to the ground truth (the theoretical upper bound), GTA achieves SOTA performance among weakly supervised methods for cross-platform 3D semantic segmentation. Error analysis reveals that domain viewpoint shifts lead to geometric ambiguity between structural objects and vegetation. In the absence of vertical facade features, models relying on sparse geometric cues struggle to distinguish flat rooftops from vegetation canopies. In future work, we plan to address this by incorporating cross-modal data to provide complementary texture information, or by exploring geometric-aware pre-training strategies to learn more viewpoint-invariant representations, to further close this performance gap.

Author Contributions

Conceptualization, J.W., X.X., J.L., Y.L. and C.C.; Methodology, J.W. and X.X.; Validation, J.W., J.L. and Q.L.; Formal analysis, C.C.; Investigation, J.W. and Y.L.; Resources, Y.L.; Data curation, J.L., Q.L. and C.C.; Writing – original draft, J.W.; Writing – review & editing, J.W. and X.X.; Visualization, J.W. and Q.L.; Supervision, X.X., Y.L. and C.C.; Project administration, X.X., J.L., Y.L. and C.C.; Funding acquisition, C.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original data presented in this study are openly available from SemanticKITTI, SynLiDAR and DALES: www.semantic-kitti.org (accessed on 1 June 2023), https://github.com/xiaoaoran/SynLiDAR (accessed on 28 June 2021) and https://sites.google.com/a/udayton.edu/vasari1/research/earth-vision/dales (accessed on 19 June 2020) or [1,2,3].

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

GTA	Ground to Altitude;
MLPs	Multilayer perceptrons;
SOTA	State of the art;
CE	Cross-entropy.

References

Behley, J.; Garbade, M.; Milioto, A.; Quenzel, J.; Behnke, S.; Stachniss, C.; Gall, J. Semantickitti: A dataset for semantic scene understanding of lidar sequences. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9297–9307. [Google Scholar]
Xiao, A.; Huang, J.; Guan, D.; Zhan, F.; Lu, S. Transfer learning from synthetic to real LiDAR point cloud for semantic segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, Palo Alto, CA, USA, 22 February–1 March 2022; Volume 36, pp. 2795–2803. [Google Scholar]
Varney, N.; Asari, V.K.; Graehling, Q. DALES: A large-scale aerial LiDAR data set for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Utrecht, The Netherlands, 7–11 December 2020; pp. 186–187. [Google Scholar]
Xiao, A.; Yang, X.; Lu, S.; Guan, D.; Huang, J. FPS-Net: A convolutional fusion network for large-scale LiDAR point cloud segmentation. Isprs J. Photogramm. Remote Sens. 2021, 176, 237–249. [Google Scholar] [CrossRef]
Cortinhal, T.; Tzelepis, G.; Aksoy, E.E. Salsanext: Fast, uncertainty-aware semantic segmentation of LiDAR point clouds. In Proceedings of the International Symposium on Visual Computing (ISVC), San Diego, CA, USA, 5–7 October 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 207–222. [Google Scholar]
Peng, K.; Fei, J.; Yang, K.; Roitberg, A.; Zhang, J.; Bieder, F.; Heidenreich, P.; Stiller, C.; Stiefelhagen, R. MASS: Multi-attentional semantic segmentation of LiDAR data for dense top-view understanding. IEEE Trans. Intell. Transp. Syst. 2022, 23, 15824–15840. [Google Scholar] [CrossRef]
Ando, A.; Gidaris, S.; Bursuc, A.; Puy, G.; Boulch, A.; Marlet, R. RangeViT: Towards vision transformers for 3D semantic segmentation in autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 5240–5250. [Google Scholar]
Cheng, H.X.; Han, X.F.; Xiao, G.Q. TransRVNet: LiDAR semantic segmentation with transformer. IEEE Trans. Intell. Transp. Syst. 2023, 24, 5895–5907. [Google Scholar] [CrossRef]
Lin, F.; Lin, T.; Yao, Y.; Ren, H.; Wu, J.; Cai, Q. VPA-Net: A visual perception assistance network for 3D LiDAR semantic segmentation. Pattern Recognit. 2025, 158, 111014. [Google Scholar] [CrossRef]
Li, R.; Li, S.; Chen, X.; Ma, T.; Gall, J.; Liang, J. TFNet: Exploiting temporal cues for fast and accurate LiDAR semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Seattle, WA, USA, 17–21 June 2024; pp. 4547–4556. [Google Scholar]
Wang, X.; Feng, W.; Kong, L.; Wan, L. NUC-Net: Non-uniform cylindrical partition network for efficient LiDAR semantic segmentation. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 9090–9104. [Google Scholar] [CrossRef]
Zhu, X.; Zhou, H.; Wang, T.; Hong, F.; Ma, Y.; Li, W.; Li, H.; Lin, D. Cylindrical and asymmetrical 3D convolution networks for LiDAR segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19 November 2021; pp. 9939–9948. [Google Scholar]
Hou, Y.; Zhu, X.; Ma, Y.; Loy, C.C.; Li, Y. Point-to-voxel knowledge distillation for LiDAR semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 5 June 2022; pp. 8479–8488. [Google Scholar]
Choy, C.; Gwak, J.; Savarese, S. 4D spatio-temporal convnets: Minkowski convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3075–3084. [Google Scholar]
Graham, B.; Engelcke, M.; Van Der Maaten, L. 3D semantic segmentation with submanifold sparse convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 9224–9232. [Google Scholar]
Wang, Z.; Chen, H.; Liu, J.; Qin, J.; Sheng, Y.; Yang, L. Multilevel intuitive attention neural network for airborne LiDAR point cloud semantic segmentation. Int. J. Appl. Earth Obs. Geoinf. 2024, 132, 104020. [Google Scholar] [CrossRef]
Zhou, Y.; Xie, Z.; Zhao, J.; Du, W.; Yao, R.; El Saddik, A. Multi-modal LiDAR point cloud semantic segmentation with salience refinement and boundary perception. ACM Trans. Multimed. Comput. Commun. Appl. 2024, 20, 1–20. [Google Scholar]
Qiu, S.; Li, X.; Xue, X.; Pu, J. PC-BEV: An efficient polar-Cartesian BEV fusion framework for LiDAR semantic segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 6612–6620. [Google Scholar]
Hu, Q.; Yang, B.; Xie, L.; Rosa, S.; Guo, Y.; Wang, Z.; Trigoni, N.; Markham, A. Randla-net: Efficient semantic segmentation of large-scale point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11108–11117. [Google Scholar]
Phan, A.V.; Le Nguyen, M.; Nguyen, Y.L.H.; Bui, L.T. Dgcnn: A convolutional neural network over large-scale labeled graphs. Neural Netw. 2018, 108, 533–543. [Google Scholar] [CrossRef] [PubMed]
Lai, X.; Liu, J.; Jiang, L.; Wang, L.; Zhao, H.; Liu, S.; Qi, X.; Jia, J. Stratified transformer for 3D point cloud segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 8500–8509. [Google Scholar]
Ma, Y.; Guo, Y.; Liu, H.; Lei, Y.; Wen, G. Global context reasoning for semantic segmentation of 3D point clouds. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Snowmass Village, CO, USA, 1–5 March 2020; pp. 2931–2940. [Google Scholar]
Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. Pointnet: Deep learning on point sets for 3D classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 652–660. [Google Scholar]
Qiang, X.; He, W.; Chen, S.; Lv, Q.; Huang, F. Hierarchical Point Cloud Transformer: A Unified Vegetation Semantic Segmentation Model for Multisource Point Clouds Based on Deep Learning. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–16. [Google Scholar] [CrossRef]
Wu, X.; Jiang, L.; Wang, P.-S.; Liu, Z.; Liu, X.; Qiao, Y.; Ouyang, W.; He, T.; Zhao, H. Point Transformer V3: Simpler Faster Stronger. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 4840–4851. [Google Scholar]
Chen, Y.; Liu, J.; Ni, B.; Wang, H.; Yang, J.; Liu, N.; Li, T.; Tian, Q. Shape self-correction for unsupervised point cloud understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 8382–8391. [Google Scholar]
Cho, J.H.; Mall, U.; Bala, K.; Hariharan, B. Picie: Unsupervised semantic segmentation using invariance and equivariance in clustering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 16794–16804. [Google Scholar]
Hoang, C.M.; Kang, B. Pixel-level clustering network for unsupervised image segmentation. Eng. Appl. Artif. Intell. 2024, 127, 107327. [Google Scholar] [CrossRef]
Niu, D.; Wang, X.; Han, X.; Lian, L.; Herzig, R.; Darrell, T. Unsupervised universal image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 22744–22754. [Google Scholar]
Chen, Z.; Xu, H.; Chen, W.; Zhou, Z.; Xiao, H.; Sun, B.; Xie, X. PointDC: Unsupervised Semantic Segmentation of 3D Point Clouds via Cross-modal Distillation and Super-Voxel Clustering. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 14290–14299. [Google Scholar]
Zhang, Z.; Yang, B.; Wang, B.; Li, B. Growsp: Unsupervised semantic segmentation of 3D point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Paris, France, 2–3 October 2023; pp. 17619–17629. [Google Scholar]
Liu, J.; Yu, Z.; Breckon, T.P.; Shum, H.P.H. U3DS3: Unsupervised 3D semantic scene segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2024; pp. 3759–3768. [Google Scholar]
Zou, P.; Zhao, S.; Huang, W.; Xia, Q.; Wen, C.; Li, W.; Wang, C. AdaCo: Overcoming visual foundation model noise in 3D semantic segmentation via adaptive label correction. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 11086–11094. [Google Scholar]
Navaneet, K.L.; Mandikal, P.; Agarwal, M.; Babu, R.V. CapNet: Continuous approximation projection for 3D point cloud reconstruction using 2D supervision. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 8819–8826. [Google Scholar]
Wang, H.; Rong, X.; Yang, L.; Wang, S.; Tian, Y. Towards Weakly Supervised Semantic Segmentation in 3D Graph-Structured Point Clouds of Wild Scenes. In Proceedings of the BMVC, Cardiff, UK, 9–12 September 2019; p. 284. [Google Scholar]
Wang, H.; Rong, X.; Yang, L.; Feng, J.; Xiao, J.; Tian, Y. Weakly supervised semantic segmentation in 3D graph-structured point clouds of wild scenes. arXiv 2020, arXiv:2004.12498. [Google Scholar] [CrossRef]
Xu, X.; Yuan, Y.; Li, J.; Zhang, Q.; Jie, Z.; Ma, L.; Tang, H.; Sebe, N.; Wang, X. 3D weakly supervised semantic segmentation with 2D vision-language guidance. In Proceedings of the European Conference on Computer Vision (ECCV), Milan, Italy, 29 September–4 October 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 87–104. [Google Scholar]
Hu, W.; Zhao, H.; Jiang, L.; Jia, J.; Wong, T.-T. Bidirectional projection network for cross dimension scene understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 14373–14382. [Google Scholar]
Kweon, H.; Yoon, K.-J. Joint learning of 2D-3D weakly supervised semantic segmentation. Adv. Neural Inf. Process. Syst. (Neurips) 2022, 35, 30499–30511. [Google Scholar]
Yang, C.-K.; Chen, M.-H.; Chuang, Y.-Y.; Lin, Y.-Y. 2D–3D interlaced transformer for point cloud segmentation with scene-level supervision. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Hsinchu City, Taiwan, 19 October 2023; pp. 977–987. [Google Scholar]
Sun, T.; Zhang, Z.; Tan, X.; Qu, Y.; Xie, Y. Image understands point cloud: Weakly supervised 3D semantic segmentation via association learning. IEEE Trans. Image Process. 2024, 33, 1838–1852. [Google Scholar] [CrossRef] [PubMed]
Zhou, B.; Khosla, A.; Lapedriza, A.; Oliva, A.; Torralba, A. Learning deep features for discriminative localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2921–2929. [Google Scholar]
Wei, J.; Lin, G.; Yap, K.H.; Hung, T.Y.; Xie, L. Multi-path region mining for weakly supervised 3D semantic segmentation on point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 4384–4393. [Google Scholar]
Cheng, M.; Hui, L.; Xie, J.; Yang, J. Sspc-net: Semi-supervised semantic 3D point cloud segmentation network. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 9–21 May 2021; Volume 35, pp. 1140–1147. [Google Scholar]
Liu, Z.; Qi, X.; Fu, C.W. One thing one click: A self-training approach for weakly supervised 3D semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 1726–1736. [Google Scholar]
Li, M.; Xie, Y.; Shen, Y.; Ke, B.; Qiao, R.; Ren, B.; Lin, S.; Ma, L. Hybridcr: Weakly-supervised 3D point cloud semantic segmentation via hybrid contrastive regularization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 14930–14939. [Google Scholar]
Liu, G.; van Kaick, O.; Huang, H.; Hu, R. Active self-training for weakly supervised 3D scene semantic segmentation. Comput. Vis. Media 2024, 10, 425–438. [Google Scholar] [CrossRef]
Wu, Z.; Wu, Y.; Lin, G.; Cai, J. Reliability-Adaptive Consistency Regularization for Weakly-Supervised Point Cloud Segmentation. Int. J. Comput. Vis. 2024, 132, 2276–2289. [Google Scholar] [CrossRef]
Deng, J.; Lu, J.; Zhang, T. Quantity-Quality Enhanced Self-Training Network for Weakly Supervised Point Cloud Semantic Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 3580–3596. [Google Scholar] [CrossRef] [PubMed]
Xu, X.; Lee, G.H. Weakly supervised semantic point cloud segmentation: Towards 10× fewer labels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 13706–13715. [Google Scholar]
Wei, J.; Lin, G.; Yap, K.H.; Liu, F.; Hung, T.Y. Dense supervision propagation for weakly supervised semantic segmentation on 3D point clouds. arXiv 2021, arXiv:2107.11267. [Google Scholar] [CrossRef]
Su, Y.; Xu, X.; Jia, K. Weakly supervised 3D point cloud segmentation via multi-prototype learning. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 7723–7736. [Google Scholar] [CrossRef]
Han, X.F.; Cheng, H.; Jiang, H.; He, D.; Xiao, G. Pcb-randnet: Rethinking random sampling for lidar semantic segmentation in autonomous driving scene. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 4435–4441. [Google Scholar]
Zhang, Y.; Qu, Y.; Xie, Y.; Li, Z.; Zheng, S.; Li, C. Perturbed self-distillation: Weakly supervised large-scale point cloud semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 15520–15528. [Google Scholar]
Lee, M.S.; Yang, S.W.; Han, S.W. Gaia: Graphical information gain based attention network for weakly supervised point cloud semantic segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 January 2023; pp. 582–591. [Google Scholar]
Wu, B.; Zhou, X.; Zhao, S.; Yue, X.; Keutzer, K. SqueezeSegV2: Improved Model Structure and Unsupervised Domain Adaptation for Road-Object Segmentation from a LiDAR Point Cloud. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 4376–4382. [Google Scholar]
Yi, E.; Yang, J.; Kim, J. Enhanced Prototypical Learning for Unsupervised Domain Adaptation in LiDAR Semantic Segmentation. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Philadelphia, PA, USA, 23–27 May 2022; pp. 1–7. [Google Scholar]
Shaban, A.; Lee, J.; Jung, S.; Meng, X.; Boots, B. LiDAR-UDA: Self-ensembling Through Time for Unsupervised LiDAR Domain Adaptation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Honolulu, HI, USA, 24 September 2023; pp. 19784–19794. [Google Scholar]
Pan, F.; Yin, X.; Lee, S.; Niu, A.; Yoon, S.; Kweon, I.S. MoDA: Leveraging Motion Priors from Videos for Advancing Unsupervised Domain Adaptation in Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Denver, CO, USA, 21 September 2024; pp. 2649–2658. [Google Scholar]
Bian, Y.; Xie, J.; Qian, J. Unsupervised domain adaptive point cloud semantic segmentation. In Proceedings of the Asian Conference on Pattern Recognition, Jeju Island, Republic of Korea, 9–12 November 2021; Springer: Berlin/Heidelberg, Germany, 2021; pp. 285–298. [Google Scholar]
Zhao, X.; Mithun, N.C.; Rajvanshi, A.; Chiu, H.-P.; Samarasekera, S. Unsupervised Domain Adaptation for Semantic Segmentation With Pseudo Label Self-Refinement. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2024; pp. 2399–2409. [Google Scholar]
Xu, J.; Yang, W.; Kong, L.; Liu, Y.; Zhou, Q.; Zhang, R.; Li, Z.; Chen, W.-M.; Fei, B. Visual foundation models boost cross-modal unsupervised domain adaptation for 3D semantic segmentation. IEEE Trans. Intell. Transp. Syst. 2025, 26, 20287–20301. [Google Scholar] [CrossRef]
Wu, Y.; Xing, M.; Zhang, Y.; Xie, Y.; Qu, Y. CLIP2UDA: Making frozen CLIP reward unsupervised domain adaptation in 3D semantic segmentation. In Proceedings of the 32nd ACM International Conference on Multimedia (MM), Melbourne VIC, Australia, 28 October–1 November 2024; Association for Computing Machinery: New York, NY, USA, 2024; pp. 8662–8671. [Google Scholar] [CrossRef]
Saltori, C.; Galasso, F.; Fiameni, G.; Sebe, N.; Ricci, E.; Poiesi, F. CoSMix: Compositional Semantic Mix for Domain Adaptation in 3D LiDAR Segmentation. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 586–602. [Google Scholar]
Zhao, H.; Zhang, J.; Chen, Z.; Zhao, S.; Tao, D. UniMix: Towards Domain Adaptive and Generalizable LiDAR Semantic Segmentation in Adverse Weather. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 16–22 June 2024; pp. 14781–14791. [Google Scholar]
Xiao, A.; Huang, J.; Xuan, W.; Ren, R.; Liu, K.; Guan, D.; El Saddik, A.; Lu, S.; Xing, E.P. 3D semantic segmentation in the wild: Learning generalized models for adverse-condition point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 10–17 June 2023; pp. 9382–9392. [Google Scholar]
Kim, H.; Kang, Y.; Oh, C.; Yoon, K.-J. Single domain generalization for LiDAR semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 17587–17598. [Google Scholar]
Li, M.; Zhang, Y.; Ma, X.; Qu, Y.; Fu, Y. BEV-DG: Cross-modal learning under bird’s-eye view for domain generalization of 3D semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 11632–11642. [Google Scholar]
Saltori, C.; Osep, A.; Ricci, E.; Leal-Taixé, L. Walking your LiDoG: A journey through multiple domains for LiDAR semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 196–206. [Google Scholar]
Ryu, K.; Hwang, S.; Park, J. Instant domain augmentation for LiDAR semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–23 June 2023; pp. 9350–9360. [Google Scholar]
Sanchez, J.; Deschaud, J.-E.; Goulette, F. Domain generalization of 3D semantic segmentation in autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 18077–18087. [Google Scholar]
Kim, J.; Woo, J.; Shin, U.; Oh, J.; Im, S. Density-aware domain generalization for LiDAR semantic segmentation. In Proceedings of the 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Abu Dhabi, United Arab Emirates, 14–18 October 2024; pp. 9573–9580. [Google Scholar]
Hu, Q.; Yang, B.; Fang, G.; Guo, Y.; Leonardis, A.; Trigoni, N.; Markham, A. Sqn: Weakly-supervised semantic segmentation of large-scale 3D point clouds. In Proceedings of the Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, 23–27 October 2022; Proceedings, Part XXVII. Springer: Berlin/Heidelberg, Germany, 2022; pp. 600–619. [Google Scholar]
Hu, Q.; Yang, B.; Khalid, S.; Xiao, W.; Trigoni, N.; Markham, A. Sensaturban: Learning semantics from urban-scale photogrammetric point clouds. Int. J. Comput. Vis. 2022, 130, 316–343. [Google Scholar] [CrossRef]
Zhang, J.; Chen, L.; Ouyang, B.; Liu, B.; Zhu, J.; Chen, Y.; Meng, Y.; Wu, D. PointCutMix: Regularization strategy for point cloud classification. Neurocomputing 2022, 505, 58–67. [Google Scholar] [CrossRef]
Nekrasov, A.; Schult, J.; Litany, O.; Leibe, B.; Engelmann, F. Mix3D: Out-of-context data augmentation for 3D scenes. In Proceedings of the 2021 International Conference on 3D Vision (3DV), London, UK, 1–3 December 2021; IEEE: New York, NY, USA, 2021; pp. 116–125. [Google Scholar]
Langer, F.; Milioto, A.; Haag, A.; Behley, J.; Stachniss, C. Domain transfer for semantic segmentation of LiDAR data using deep neural networks. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 10 February 2020; IEEE: New York, NY, USA, 2020; pp. 8263–8270. [Google Scholar]

Figure 1. Comparison of datasets captured from different perspectives. For better visualization, different colors denote distinct semantic categories. Point clouds from the airborne dataset DALES exhibit uniform density and typically preserve upper-layer geometry and overall infrastructure (top). In contrast, point clouds from the autonomous driving dataset SemanticKITTI decrease in density with distance and tend to emphasize side-view structures while underrepresenting upper-layer geometry (bottom).

Figure 2. The architecture of GTA. The original point cloud

P_{o}

is first processed by PDA to generate an augmented point cloud,

P_{a}

. Both

P_{o}

and

P_{a}

are then fed into the 3D semantic segmentation network to extract their corresponding features,

F_{o}

and

F_{a}

. These features are passed through a series of multilayer perceptrons to produce initial semantic predictions, which are subsequently used to compute the CSA loss

L_{c s a}

, integrating contrastive regularization and consistency supervision. This loss enforces semantic consistency across scales and domains. Meanwhile, only the labeled points contribute to the cross-entropy segmentation loss

L_{s e g}

, guiding the network to learn accurate predictions from sparse annotations.

Figure 3. Visualization results of cross-DG from SemanticKITTI to DALES. From left to right, we show the raw point cloud, the ground-truth annotations, the predictions of our method, and those of the comparison method. Regions where our approach outperforms SQN are highlighted with purple circles. Note that the ground truth (GT) is provided as a reference to highlight the significant structural improvements of GTA compared to the baseline (SQN), particularly in recovering geometric details.

Figure 4. Visualization results of cross-DG from SynLiDAR to DALES. From left to right, we show the raw point cloud, the ground-truth annotations, the predictions of our method, and those of the comparison method. Regions where our approach outperforms SQN are highlighted with purple circles. Note that the ground truth (GT) is provided as a reference to highlight the significant structural improvements of GTA compared to the baseline (SQN), particularly in recovering geometric details.

Figure 5. The t-SNE visualization of feature distributions on the DALES dataset. The numbers in the legend correspond to different semantic classes. (Left): The baseline features exhibit severe entanglement between classes, indicating poor semantic discriminability. (Right): In contrast, our GTA framework produces compact intra-class clusters and clear inter-class separation, demonstrating that the PDA and CSA modules effectively refine the feature space.

Figure 6. Quantitative error analysis on the DALES dataset. (Left): The normalized confusion matrix reveals a severe misclassification of structural objects as Vegetation, highlighted by the high values in the “Vegetation” column. (Right): The per-class IoU comparison demonstrates that our GTA framework achieves significant gains in structural categories compared to the baseline SQN.

Figure 7. Visualization results of generalizing the SemanticKITTI training set to the DALES dataset. Each group of results includes the raw point cloud, the ground-truth annotations within the red bounding boxes, the predictions of baseline methods, and the results of our approach. With only 0.1% labeled points, the baseline methods exhibit noticeable discrepancies from the ground-truth annotations, whereas the proposed GTA method produces segmentation results that are much closer to the ground truth.

Table 1. Performance comparison on SemanticKITTI → DALES.

	Method	mIoU	Cars	Trucks	Ground	Buildings	Fences	Vegetation	Pole
Full supervision	Pointcutmix	19.67	0.01	0.27	63.11	0.14	1.43	60.69	12.03
	Mix3d	22.61	0.03	0.08	69.96	0.34	3.25	75.57	9.05
	RayCast	17.88	0.92	0.01	73.25	0.77	1.66	41.6	6.99
	PointDR	14.71	0.02	0.01	47.33	5.83	0.04	41.99	7.77
0.1%	SQN	20.54	7.52	0.01	76.21	1.53	5.76	50.76	1.97
0.1%	GTA (Ours)	26.9	23.05	2.66	87.59	6.74	8.35	58.06	1.86

Table 2. Performance comparison on SynLiDAR → DALES.

	Method	mIoU	Cars	Trucks	Ground	Buildings	Fences	Vegetation	Pole
Full supervision	Pointcutmix	22.19	0.87	0.12	69.5	1.55	0.39	72.55	10.34
	Mix3d	15.41	1.17	0.08	66.95	7.81	0.3	23.93	7.62
	RayCast	16.51	0.01	0.05	63.84	0.9	1.09	49.45	0.22
	PointDR	16.93	0	0.0006	42.3	21.22	0	54.92	0.07
1%	SQN	14.34	0.1	0.11	63.23	32.38	0.38	3.42	0.78
1%	GTA (Ours)	24.02	0.22	0	62.89	43.61	0.04	59.03	2.34

Table 3. Ablation study (removing modules) on SemanticKITTI → DALES.

Removed Module	Method Configuration	mIoU	$Δ$
None (Full)	PDA + CSA	26.9	-
PDA	CSA	19.14	−7.76
CSA	PDA	18.08	−8.82

Table 4. Comparison of computational complexity and performance on DALES.

Method	Params (M)	Speed (FPS)	mIoU (%)
Baseline	1.05	10.0	20.54
GTA (Ours)	1.11	10.0	26.9

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Ground to Altitude: Weakly-Supervised Cross-Platform Domain Generalization for LiDAR Semantic Segmentation

Highlights

Abstract

1. Introduction

3. Method

3.1. Overview

3.2. Progressive Domain-Aware Augmentation

3.3. Cross-Scale Semantic Alignment

3.3.1. Semantic Consistency Constraint

3.3.2. Contrastive Regularization

3.4. Loss Functions

4. Experimental Results

4.1. Implementation Details

4.2. Evaluation Metrics

4.3. Comparison with SOTA Methods

4.3.1. Evaluation on SemanticKITTI → DALES

4.3.2. Evaluation on SynLiDAR → DALES

5. Discussion

5.1. Ablation Study

5.1.1. Effectiveness of Progressive Domain-Aware Augmentation

5.1.2. Effectiveness of the Cross-Scale Semantic Alignment

5.1.3. Synergistic Effects

5.1.4. Computational Complexity Analysis

5.2. Performance Analysis

5.2.1. Quantitative Error Analysiss

5.2.2. Qualitative Visualization

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Article Metrics

Citations

Article Access Statistics

Ground to Altitude: Weakly-Supervised Cross-Platform Domain Generalization for LiDAR Semantic Segmentation

Highlights

Abstract

1. Introduction

2. Related Work

2.1. LiDAR Semantic Segmentation

2.1.1. Fully Supervised LiDAR Semantic Segmentation

2.1.2. Unsupervised LiDAR Semantic Segmentation

2.1.3. Weakly Supervised LiDAR Semantic Segmentation

2.2. Unsupervised Domain Adaptation

2.3. Domain Generalization

3. Method

3.1. Overview

3.2. Progressive Domain-Aware Augmentation

3.3. Cross-Scale Semantic Alignment

3.3.1. Semantic Consistency Constraint

3.3.2. Contrastive Regularization

3.4. Loss Functions

4. Experimental Results

4.1. Implementation Details

4.2. Evaluation Metrics

4.3. Comparison with SOTA Methods

4.3.1. Evaluation on SemanticKITTI → DALES

4.3.2. Evaluation on SynLiDAR → DALES

5. Discussion

5.1. Ablation Study

5.1.1. Effectiveness of Progressive Domain-Aware Augmentation

5.1.2. Effectiveness of the Cross-Scale Semantic Alignment

5.1.3. Synergistic Effects

5.1.4. Computational Complexity Analysis

5.2. Performance Analysis

5.2.1. Quantitative Error Analysiss

5.2.2. Qualitative Visualization

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Article Metrics

Citations

Article Access Statistics