Advancing ALS Applications with Large-Scale Pre-Training: Framework, Dataset, and Downstream Assessment

Xiu, Haoyi; Liu, Xin; Kim, Taehoon; Kim, Kyoung-Sook

doi:10.3390/rs17111859

Open AccessArticle

Advancing ALS Applications with Large-Scale Pre-Training: Framework, Dataset, and Downstream Assessment

National Institute of Advanced Industrial Science and Technology (AIST), Tsukuba 305-8560, Japan

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(11), 1859; https://doi.org/10.3390/rs17111859

Submission received: 12 April 2025 / Revised: 16 May 2025 / Accepted: 24 May 2025 / Published: 27 May 2025

Download

Browse Figures

Versions Notes

Abstract

The pre-training and fine-tuning paradigm has significantly advanced satellite remote sensing applications. However, its potential remains largely underexplored for airborne laser scanning (ALS), a key technology in domains such as forest management and urban planning. In this study, we address this gap by constructing a large-scale ALS point cloud dataset and evaluating its effectiveness in downstream applications. We first propose a simple, generalizable framework for dataset construction, designed to maximize land cover and terrain diversity while allowing flexible control over dataset size. We instantiate this framework using ALS, land cover, and terrain data collected across the contiguous United States, resulting in a dataset geographically covering 17,000 +

{km}^{2}

(184 billion points) with diverse land cover and terrain types included. As a baseline self-supervised learning model, we adopt BEV-MAE, a state-of-the-art masked autoencoder for 3D outdoor point clouds, and pre-train it on the constructed dataset. The resulting models are fine-tuned for several downstream tasks, including tree species classification, terrain scene recognition, and point cloud semantic segmentation. Our results show that pre-trained models consistently outperform their counterparts trained from scratch across all downstream tasks, demonstrating the strong transferability of the learned representations. Additionally, we find that scaling the dataset using the proposed framework leads to consistent performance improvements, whereas datasets constructed via random sampling fail to achieve comparable gains.

Keywords:

airborne laser scanning; pre-training; 3D point clouds; masked autoencoders; foundation models

1. Introduction

Airborne laser scanning (ALS) is an important remote sensing technology that captures high-resolution, three-dimensional spatial data by emitting laser pulses from an airborne platform and analyzing the reflected signals. This process generates dense light detection and ranging (LiDAR) point clouds, which accurately represent the Earth’s surface including both natural and built environments. A significant advantage of ALS is its ability to penetrate vegetation and provide precise measurements, making it particularly valuable for applications such as terrain mapping [1], forest management [2], urban planning [3], and disaster management [4].

Large-scale pre-training and fine-tuning paradigms have been transformative across various artificial intelligence (AI) fields [5,6]. These paradigms involve extensive pre-training on diverse datasets, enabling models to adapt effectively to a wide range of downstream tasks through fine-tuning. Commonly referred to as foundation models [7], they leverage large-scale self-supervised/unsupervised training to learn generalizable representations. Satellite remote sensing has also greatly benefited from this trend. By pre-training on large-scale unlabeled datasets, such as Sentinel-2, remote sensing foundation models [8,9,10] achieve state-of-the-art performance on a variety of downstream tasks, including scene classification, land cover mapping, and multi-temporal cloud imputation.

However, large-scale pre-training and fine-tuning paradigms have yet to demonstrate their full impact on ALS applications. Although several large-scale datasets, such as OpenGF [1] and PureForest [2], exist, they lack the scale and land cover diversity necessary for training versatile models. Furthermore, while numerous freely available ALS LiDAR data sources, such as the United States Geological Survey 3D Elevation Program (USGS 3DEP) [11] and the Actueel Hoogtebestand Nederland (AHN) [12], provide extensive resources, there is currently no efficient method to extract data from these sources. Leveraging all available data is computationally prohibitive and often redundant. These limitations collectively hinder progress in adopting the pre-training and fine-tuning paradigm for ALS applications.

To address the aforementioned limitations, this study focuses on developing a large-scale dataset for pre-training ALS models and evaluating its effectiveness on downstream tasks (The source code and pre-trained models will be made publicly available at https://github.com/martianxiu/ALS_pretraining (accessed on 23 May 2025)). First, we introduce a simple and general framework for dataset construction. The framework takes land cover maps, digital elevation models (DEMs), and ALS point clouds as input, and performs sampling with explicit consideration of land cover and terrain diversity. Specifically, we propose a joint land cover–terrain inverse probability sampling method that enables flexible control over dataset size while ensuring diverse representation, which is essential for ALS downstream applications. We instantiate this framework using nationwide ALS, land cover, and DEM data collected across the contiguous United States, constructing a large-scale dataset tailored for pre-training. Second, to assess the utility of the proposed framework and constructed dataset, we perform pre-training and fine-tuning on various downstream tasks. We adopt BEV-MAE [13], a self-supervised learning (SSL) model designed for outdoor 3D point clouds, as our baseline due to its state-of-the-art performance and suitability for ALS data. The model is fine-tuned on several downstream applications including tree species classification, terrain scene recognition, and point cloud segmentation. Additionally, we demonstrate the effectiveness of the proposed framework by scaling up the dataset and monitoring relative performance improvements, comparing it against alternative datasets and random sampling methods.

In summary, our contributions are as follows:

We propose a simple and general framework for large-scale ALS dataset construction that maximizes land cover and terrain diversity while ensuring flexible control over dataset size.
We construct a large-scale dataset for SSL on ALS point clouds.
We pre-train and fine-tune models on the constructed dataset and conduct extensive experiments to evaluate the effectiveness of the proposed framework, the quality of the dataset, and the performance of the pre-trained models. Additionally, to assess performance in recognizing different terrain types, we create a terrain scene recognition dataset derived from an existing dataset originally developed for ground filtering.

The remainder of the paper is organized as follows: Section 2 reviews and summarizes related work. Section 3 presents the overall framework design, data sources, and the construction of the final dataset. In Section 4, we evaluate the quality of the constructed dataset in terms of point density, ground point distribution, and return characteristics. Section 5 details the model architecture and experimental setup. Section 6 presents and discusses the experimental results. Finally, Section 7 concludes the study.

2. Related Work

2.1. Pre-Training and Fine-Tuning Paradigms Applied to Satellite Remote Sensing

Recently, AI has experienced a major paradigm shift from the supervised learning paradigm to the pre-train–fine-tune paradigm, where a large model is pre-trained with self-supervision on large-scale data and subsequently fine-tuned on downstream or target tasks [7]. These pre-trained models, known as foundation models, are designed to be broadly applicable across a wide range of tasks with minimal adaptation. Prominent examples include GPT-3 [5], CLIP [6], and LLaVA [14].

The remote sensing community has quickly embraced this trend, with many researchers exploring the potential of foundation models in this domain. Recently, numerous vision-based foundation models for remote sensing have emerged. Early works adopted contrastive learning approaches to construct foundation models without the need for annotations [15,16,17]. These methods often extend existing approaches from computer vision while incorporating characteristics unique to satellite imagery. For example, Ayush et al. [15] adapted MoCo-v2 [18] for remote sensing data by reformulating the pretext task to utilize geolocation and temporal image pairs. SeCo [16] designed self-supervision tasks that leverage the seasonal and positional invariances in remote sensing data, acquiring representations invariant to seasonal and synthetic augmentations. Similarly, CaCo [17] introduced a novel objective function that contrasts long- and short-term changes within the same geographical regions. More recently, SkySense [8] was introduced as a billion-scale foundation model pre-trained on multi-temporal optical and SAR images. It adopts multi-granularity contrastive learning to capture representations across different modalities and spatial granularities, achieving state-of-the-art (SoTA) performance on seven downstream tasks.

In addition to contrastive learning, numerous approaches based on MAE [19] have also been proposed. For instance, SatMAE [20] incorporates temporal and spectral position embeddings to effectively utilize temporal and multispectral information. RingMo [21] introduces a novel masking strategy to better preserve dense and small objects during masking. GFM [9] leverages the strong representations learned from ImageNet-22k [22] and enhances remote sensing image representation through continual pre-training. To better use spectral information in satellite imagery, SpectralGPT [10] introduces a 3D masking strategy and spectral-to-spectral reconstruction. Meanwhile, msGFM [23] incorporates cross-sensor pre-training using four different modalities to learn unified multi-sensor representations. This approach outperforms single-sensor foundation models across four downstream datasets.

Research into vision–language models (VLMs) is also highly active, as these models enable zero-shot applications. For instance, RemoteCLIP [24] and SkyCLIP [25] adapt CLIP for remote sensing datasets, outperforming standard CLIP baselines. GRAFT [26] introduces a pre-training framework that uses ground images as intermediaries to connect text with satellite imagery, enabling pre-training without textual annotations. GeoChat [27] fine-tunes LLaVA-1.5 [28] on a proposed instruction-following dataset, demonstrating promising zero-shot performance across a wide range of tasks, including image and region captioning, visual question answering, and scene classification.

Despite the significant advancements in applying pre-training and fine-tuning paradigms to satellite imagery, to the best of our knowledge, no prior work has investigated its application to ALS data—an area we aim to explore in this study.

2.2. Datasets for 3D Geospatial Applications

With the rapid advancements in 3D acquisition technologies, the availability of outdoor point cloud datasets has grown significantly, driving progress in 3D geospatial data analysis through deep learning techniques. Existing datasets can be broadly categorized based on their data collection methods. Photogrammetric 3D datasets, such as Campus3D [29], SensatUrban [30], HRHD-HK [31], and STPLS3D [32], are generated using photogrammetry techniques but lack ground points beneath dense vegetation canopies due to the limitations of passive image capture, making them unsuitable for ALS applications requiring dense vegetation analysis. Terrestrial and mobile laser scanning (TLS/MLS) datasets, including Semantic3D [33], Paris-Lille-3D [34], SemanticKITTI [35], and Toronto-3D [36], are collected at street level and focus on roadway scene understanding. While they provide high point density and large data volumes, their limited geographic coverage, as well as restricted diversity in land cover and terrain, making them inadequate for broader ALS applications. ALS datasets, such as ISPRS Vaihingen 3D [37], DublinCity [38], LASDU [39], DALES [3], and OpenGF [1], are collected using airborne LiDAR sensors and primarily target urban classification and environmental perception by identifying common urban objects like ground, grass, fences, cars, and facades. However, their limited scale and coverage restrict their utility for training versatile models across diverse ALS tasks.

In contrast to existing datasets that are designed for specific applications, this study proposes a general framework for constructing a large-scale dataset for pre-training, aiming to tackle various ALS applications.

2.3. SSL Methods for 3D Point Clouds

2.3.1. SSL Methods for General 3D Point Clouds

SSL enables neural networks to learn from unlabeled data, making it ideal for 3D point clouds where annotations are costly. We mainly focus on masked autoencoding-based SSL methods as they are most relevant to this study. Masked autoencoding methods learn meaningful representations by reconstructing randomly masked input portions, such as image patches or text tokens, capturing structural and contextual information.

Early works focus on generalizing BERT [40]’s masked language modeling to point clouds [41,42,43]. A representative example is Point-BERT [41], which trains a transformer encoder to predict masked dVAE [44]-generated tokens. Following the proposal of MAE, this idea was extended to point clouds. For example, Point-MAE [45] applies the concept by treating local point neighborhoods as patches for reconstruction. MaskPoint [46] introduces a masked discrimination task, replacing reconstruction with real/noise discrimination to improve robustness against sampling variance. Point-M2AE [47] incorporates a hierarchical masking strategy for multi-scale pre-training, while PointGPT [48] adopts GPT-style generative pre-training with a decoder tasked to generate point patches. MaskFeat3D [49] instead reconstructs surface properties, like normals, to learn higher-level features.

Some methods explore multi-modal approaches to enhance representation quality. I2P-MAE [50] incorporates 2D-guided masking and reconstruction using knowledge from pre-trained 2D models. Joint-MAE [51] performs joint masking and reconstruction of 2D and 3D data with shared encoders and decoders. RECON [52] combines masked autoencoding and contrastive learning, leveraging their respective strengths while handling inputs from points, images, and text.

Extensive work has focused on autonomous driving. Unlike indoor or synthetic point clouds, outdoor point clouds are sparse and have varying density. Traditional masked point modeling strategies often create overlapping patches, discarding important points. To address this, Voxel-MAE [53] uses a voxel-based masking strategy, predicting point coordinates, point counts per voxel, and voxel occupancy to better capture outdoor data distributions. Geo-MAE [54] improves further by predicting centroids, surface normals, and curvatures. GD-MAE [55] introduces a generative decoder, eliminating the need for complex decoders or masking strategies. Recently, BEV-MAE [13] explicitly focuses on learning BEV representations, achieving superior and efficient performance.

In this work, we focus on the impact of pre-training on downstream tasks for ALS data rather than designing a new network. We adopt BEV-MAE as it suits ALS data well, with details provided in Section 5.1.

2.3.2. SSL Methods for ALS 3D Point Clouds

SSL has recently been applied to ALS. Ref. [56] uses Barlow Twins [57] to improve semantic segmentation, especially for under-represented categories. Ref. [58] proposes a deep clustering and contrastive learning approach for unsupervised change detection, outperforming traditional methods. Ref. [59] pre-trains customized transformers under the MAE framework for 3D roof reconstruction, surpassing general MAE-based methods like Point-MAE and Point-M2AE. HAVANA [60] enhances contrastive learning by improving negative sample quality through AbsPAN.

Although current SSL methods for ALS show promise, they often fall short in performance or struggle to adapt to different downstream tasks. In this study, we address these limitations by proposing a framework to build a large-scale dataset, enabling scalable pre-training and the development of general-purpose models for ALS applications.

3. Framework and Instantiation

In this section, we propose a general framework for constructing large-scale ALS datasets for pre-training. The framework is guided by the principle of maximizing data diversity while maintaining sampling feasibility and downstream applicability. We then instantiate this framework using publicly available United States datasets, including 3DEP point clouds, land cover maps from the National Land Cover Database (NLCD), and DEMs. An overview of the framework is illustrated in Figure 1.

3.1. Framework Design

Since ALS is commonly used to extract surface objects and terrain information, it is essential that the pre-training dataset captures a wide variety of land cover and terrain types. To address this need, we propose a framework specifically designed to effectively incorporate and manage this diversity while being able to flexibly handle dataset size.

3.1.1. Land Cover and Terrain Information

The first input to the framework is a land cover map—a geospatial raster product that classifies the Earth’s surface into categories such as forest, urban areas, water bodies, and barren land. Each pixel in the map is assigned a discrete label, typically derived from satellite imagery using classification algorithms. Land cover products like the NLCD can be directly used as input, as they are already formatted as classification maps suitable for further processing.

The second input is a DEM, which provides information about terrain elevation. However, since each DEM pixel only records local elevation, it fails to directly reflect terrain complexity. This makes it difficult to jointly analyze land cover and terrain characteristics. To address this, we propose using a derived slope map, a classification map based on local elevation gradients, to represent terrain complexity.

Given a DEM

Z (x, y)

, the slope angle

θ (x, y)

in degrees is calculated as

θ (x, y) = arctan (\sqrt{{(\frac{\partial Z}{\partial x})}^{2} + {(\frac{\partial Z}{\partial y})}^{2}}) \cdot \frac{180}{π} .

(1)

Here, the partial derivatives are approximated using central differences. The final slope classification map is obtained by discretizing the slope values using pre-defined thresholds (details provided in Section 3.2.2). By using a slope map instead of the raw DEM, our framework can effectively integrate both land cover and terrain complexity.

3.1.2. Joint Land Cover–Terrain Inverse Probability Sampling

Due to the highly imbalanced distribution of land surface types and the large volume of LiDAR data, sampling is necessary to build a diverse yet manageable dataset for pre-training. However, random sampling tends to over-represent common landscapes while under-representing rare features, leading to biased datasets.

To overcome this, we adopt an inverse probability sampling strategy based on the joint distribution of land cover and terrain (slope) classes. Let

C = {c_{1}, c_{2}, \dots, c_{m}}

denote the set of land cover classes and

T = {t_{1}, t_{2}, \dots, t_{n}}

the set of slope (terrain) classes. Each sampling unit (point cloud tile) is assigned a tuple

(c_{i}, t_{j})

based on its dominant land cover and terrain classes. Let

P (c_{i}, t_{j})

denote the empirical joint distribution over all tiles. We define the inverse probability weight for each combination as

w (c_{i}, t_{j}) = \frac{1}{P (c_{i}, t_{j}) + ε},

(2)

where

ε

is a small constant added for numerical stability. The normalized sampling probability is then given by

π (c_{i}, t_{j}) = \frac{w (c_{i}, t_{j})}{\sum_{k = 1}^{m} \sum_{l = 1}^{n} w (c_{k}, t_{l})} .

(3)

Sampling is performed by drawing tiles according to the distribution

π (c_{i}, t_{j})

. This encourages the inclusion of rare combinations of land cover and terrain while down-weighting dominant types, promoting balance and diversity. A key advantage of this sampling strategy is its scalability: users can draw as many samples as needed—subject to storage or training constraints—while maintaining class balance. This enables flexible dataset construction without compromising representativeness.

3.2. Data Source

3.2.1. LiDAR Point Cloud

We use LiDAR data from the USGS 3DEP [61] as the primary data source for building a large-scale dataset for pre-training. The 3DEP collaborative program is designed to accelerate the collection of three-dimensional elevation data across the United States to meet a wide variety of needs [11].

The 3DEP data are well-suited for this study for two reasons: (1) its base specifications are designed for consistent data acquisition and the production of derived products, allowing the entire collection to be treated as a unified “3DEP” dataset; and (2) it captures the United States’s diverse land cover and varied terrain, making it an excellent foundation for constructing robust and versatile models with broad applicability across a wide range of geospatial and environmental contexts.

The point cloud data are accessible via AWS [62], allowing for programmatic downloads. Additionally, each LiDAR point cloud includes boundary data, enabling users to easily define and select their area of interest. The point cloud boundaries used in this study are depicted in Figure 2.

3.2.2. Land Cover and Terrain Information

We use the NLCD as the source of land cover information. The NLCD product suite provides comprehensive data on nationwide land cover and changes over two decades (2001–2021), offering detailed, long-term insights into land surface dynamics. We utilize the latest NLCD2021 release, which includes land cover maps for 2001, 2004, 2006, 2008, 2011, 2013, 2016, 2019, and 2021. The NLCD2021 follows the same protocols and procedures as previous releases, ensuring compatibility with the 2019 database. As a result, analysis conducted for NLCD 2019 can be useful for understanding NLCD2021. For instance, the validation report for the 2019 release [63] is used to understand classification accuracy of the land cover classes, which will later inform the selection of reliable land covers for data downloads.

The NLCD land cover product uses an adapted version of the Anderson Level II classification system, which includes 16 land cover classes (excluding those specific to Alaska). We use the Level I classification system by merging Level II classes, addressing the moderate per-class accuracy reported for the Level II system [63]. The Level I and Level II classification systems are presented in Table 1, with detailed definitions available in [64].

For DEM data, we utilize the seamless digital elevation models provided by the National Map. Specifically, we use the 1 arc-second DEM to ensure consistency in spatial resolution with the NLCD land cover maps, enabling accurate joint analysis. The DEM is converted into a slope classification map, which is then discretized based on the classification system outlined in Table 2, adapted from the USDA slope classification scheme [65]. The dataset is downloaded programmatically using the py3dep Python package [66].

3.3. Framework Instantiation

3.3.1. Data Processing and Sampling Details

We first extract land cover maps from NLCD2021 by aligning the map years with the point cloud capture years. If an exact match is unavailable, the map from the closest year is used. For each point cloud project in 3DEP, the land cover map and DEM are projected to the local Universal Transverse Mercator (UTM) coordinate reference system (CRS). The point cloud boundary file is then used to crop both maps, reducing computational complexity.

Next, each cropped region is divided into

500 m \times 500 m

tiles, and the most frequent land cover and slope classes within a tile are assigned as dominant labels. Then, we sample point cloud tiles following the sampling strategy described in Section 3.1.2. For land cover, we focus exclusively on the “Developed” and “Forest” classes, as their combinations with slope classes effectively address common ALS downstream tasks and they are among the most reliable land cover classifications [67]. We record the bounding box coordinates for each tile, facilitating programmatic downloads of selected point cloud tiles from the 3DEP database hosted on AWS. Geospatial operations such as reprojection are conducted using the rasterio and GeoPandas Python libraries, while PDAL is used for downloading data from the 3DEP server. The examples of the land cover map, slope map, slope classification map, and the sampling results are shown in Figure 3.

3.3.2. Dataset Statistics and Comparison

We construct the dataset using the proposed framework and the data source. We limit the number of LiDAR tiles per LiDAR project to 40 due to the resource constraint. The distribution of tiles across classes is summarized in Table 3.

A total of 73,762 tiles are sampled, with a relatively balanced representation of Developed and Forest land cover. Flat areas dominate the dataset, while Sloped and Steep areas are less represented due to real-world geographic constraints—Forests are prevalent in steeper terrains, whereas Developed areas are mostly found in Flat regions.

Table 4 presents a comparison of representative ALS datasets. As shown, our dataset stands out as the largest in both geographical coverage and number of points. While most existing datasets are tailored for specific tasks and thus have limited land cover types (e.g., urban or forest), our dataset encompasses a broader range of categories, including Developed areas such as metropolitan regions and villages, as well as mountainous and Forested landscapes. Furthermore, our sampling strategy ensures a diverse range of terrain types—from Flat to Steep—at scale, whereas other datasets often lack such terrain diversity. Although the OpenGF and ISPRS filtertest datasets include conceptually similar land cover and terrain types as ours, their geographical coverage and number of points are significantly smaller than ours, making them less suitable for large-scale pre-training and fine-tuning applications.

Additionally, random samples of the extracted point cloud tiles are presented in Figure 4. Developed tiles include built-up areas such as cities and villages, while Forest tiles primarily consist of vegetation. Furthermore, the increasing slope classes illustrate the growing complexity of terrain conditions, transitioning from Flat to Steep regions.

4. Dataset Validation

In this section, we analyze key characteristics of the constructed dataset, including point density per square meter, ground point standard deviation, and return characteristics, in order to validate its quality. Due to the dataset’s large size, we conduct this analysis on a subset created through random sampling. Specifically, we randomly select 30% of the dataset (22,129 tiles).

4.1. Point Density

Table 5 presents the mean and standard deviation (mean/std) of density values across land cover types (Developed and Forest) and slope categories (Flat, Sloped, and Steep). Forested areas consistently demonstrate higher mean densities and greater variability compared to Developed areas in both the Flat and Sloping categories. This is a natural outcome of multiple returns caused by vegetation layers, such as tree canopies, compared to the smoother, engineered surfaces typically found in Developed areas. The high mean and variability observed in the Developed and Steep class may be attributed to the following factors: (1) although labeled as Developed, a majority of the tiles in this class primarily contain vegetation with minimal artificial structures; and (2) the presence of high-density LiDAR projects, which elevate both the mean and standard deviation values. Overall, Forested areas exhibit higher densities than Developed areas across all topographies, highlighting the impact of vegetation and natural irregularities. Additionally, the observed increase in density from Flat to Steep terrain aligns with growing terrain complexity, validating the dataset’s alignment with real-world characteristics.

4.2. Ground Point Standard Deviation

Table 6 summarizes the mean and standard deviation (mean/std) of ground point standard deviation across land cover types and topographic categories. The results show that Forested areas consistently exhibit greater variability in ground elevation compared to Developed areas across all topographies. This disparity is expected, as Forested areas often feature irregular terrain, whereas Developed areas are typically engineered for smoothness. Moreover, the observed increase in variability from Flat to Steep terrain aligns with expectations, indicating meaningful distinctions between land cover and topographic categories.

4.3. Return Characteristics

Table 7 summarizes the return characteristics for the Developed class. The distribution is dominated by “First” and “Last” returns, with “Single” returns also comprising a substantial portion. Intermediate and higher-order returns contribute minimally, while “First of Many” and “Last of Many” together account for roughly 26%. This pattern reflects the simpler geometric structures typically found in Developed areas.

Table 8 presents the return characteristics of the Forest class. “Last” and “First” returns dominate, each representing approximately 69%, while “Single” returns account for 49.32%. Intermediate returns, such as “First of Many” and “Second”, suggest significant interactions with the canopy, whereas higher-order returns collectively contribute minimally. Similar to the Developed class, this distribution reflects the dominance of single and terminal returns from the canopy and ground, with limited deeper returns, aligning with the vertical structure characteristic of Forested environments.

As such, the dataset’s density, ground variability, and return characteristics exhibit expected trends across various land cover types and topographic categories, confirming the validity of the proposed framework and constructed dataset.

5. Experiments

In this section, we describe the model design and provide details of the pre-training experiments. Next, we introduce the downstream tasks, datasets, and fine-tuning architectures employed for these tasks. Finally, we describe the evaluation metrics used in this study.

5.1. Model Architecture and Pre-Training

5.1.1. Model Description

We adopt BEV-MAE [13] as our pre-training method. BEV-MAE is a cutting-edge pre-training method for 3D point clouds. The method is originally proposed for autonomous driving. It is designed for outdoor data and is designed to handle data from a Bird’s-Eye-View (BEV) perspective. The BEV-MAE inherits its design from masked autoencoders (MAEs) [19] in 2D image processing, which is an SSL technique that learns the representation by masking out a large portion of data and reconstructing them. This way of pre-training is assumed to be helpful for the model to learn high-level representation by forcing the model to reconstruct the invisible parts.

BEV-MAE follows a similar way of processing pipeline as MAE: masking, backbone network, and reconstruction. An element of the mask (corresponding to a masked patch in image processing) in BEV-MAE is a 3D pillar (or a BEV cell) that encapsulates a volume of 3D points. Specifically, given a defined x and y ranges, all points that fall in the range are removed. All masked elements are replaced by the same learnable mask token. After masking, the remaining point clouds together with mask tokens are put into the encoder network. The encoder network is a 3D sparse CNN [71,72], which is an memory efficient variant of voxel-based 3D CNN. It learns the multi-scale representation of point clouds with successive convolutions and downsampling. After the encoder, the masked point clouds are reconstructed from the corresponding mask tokens with a series of convolutions. In addition, the average number of points for each masked region is also reconstructed to learn about the local density.

We consider BEV-MAE to be suitable for this study for the following reasons. Unlike other MAEs for point clouds such as MaskPoint [46] and Point-MAE [45], BEV-MAE is explicitly designed for handling outdoor data. Second, BEV-MAE adopts sparse CNN as its backbone, which is more scalable to large-scale point clouds compared to point-based methods. Last but not least, it is explicitly designed to utilize BEV perspective that is natural for ALS as it fits the way ALS point clouds are captured.

5.1.2. Implementation Details for Pre-Training

For pre-training, we use the AdamW optimizer with

β_{1} = 0.9

,

β_{2} = 0.99

, a batch size of 16, and a one-cycle cosine annealing scheduler with a maximum learning rate of

10^{- 2}

. The models are trained for 50 epochs, with each epoch exposing the entire training dataset to the model once. We increase the number of parameters of the model to 60 M by increasing the channel widths without changing the depths.

The input point cloud is a square tile with a side length of 500 m. During training, we randomly crop smaller

144 m \times 144 m

tiles from the original tile. Each cropped tile is voxelized with a voxel size of

0.6 m

, and up to 200,000 voxels are sampled, with each voxel containing a maximum of 5 points. For the ground truth of point coordinate reconstruction, the point cloud is voxelized using a voxel size of

4.8 m \times 4.8 m \times 288 m

to create BEV voxels, where the maximum number of voxels is set to 200,000 and each voxel can contain up to 30 points. The ground truth density is computed on the fly from the BEV voxels. Several common data augmentations including random flipping, scaling, and translation are performed for the input point clouds. The input features consist of the 3D coordinates of the points. All pre-training runs were conducted on the AI Bridging Cloud Infrastructure (ABCI) 2.0 using up to 4 NVIDIA V100 GPUs.

5.2. Task

5.2.1. Tree Species Classification

Tree species classification is a vital task for managing forests. The identification of tree species supports public policies for forest management and helps mitigate the impact of climate change on forests. To validate the effectiveness of the approach for tree species classification, we use the PureForest [2] dataset, a comprehensive collection tailored for analyzing forest environments. The dataset comprises 135,569 tiles, each measuring

50 m \times 50 m

, and covers a total area of

339 {km}^{2}

across 449 distinct closed forests located in 40 departments in southern France. It includes 18 tree species, categorized into 13 semantic classes: Deciduous oak, Evergreen oak, Beech, Chestnut, Black locust, Maritime pine, Scotch pine, Black pine, Aleppo pine, Fir, Spruce, Larch, and Douglas. The dataset provides two modalities: colored ALS point clouds (40 points/m²) and aerial images (spatial resolution of

0.2 m

). The task is to classify a tile into one of the thirteen semantic categories. We use only the 3D point coordinates as input. Since each tile measures

50 m \times 50 m

, we use the entire tile as input during both training and testing. Following the pre-training settings, we limit the number of voxels after voxelization to 200,000. All the other settings remain the same as the pre-training ones.

The architecture used for this task is illustrated in Figure 5 (middle). With a point cloud tile as input, the model’s 3D encoder transforms the 3D coordinates into deep features. Receiving these deep features, the model outputs a classification label representing the tree species of the input tile. To achieve this, the encoder’s output is spatially pooled to form a global vector that summarizes the input point cloud. We concatenate the average-pooled and max-pooled vectors to emphasize both sharp and smooth features. The resulting global descriptor is then classified through a series of fully connected layers.

5.2.2. Terrain Scene Recognition

Three-dimensional terrain scene recognition is crucial for understanding and classifying landforms, which plays a significant role in geography-related research areas such as digital terrain analysis and ecological environment studies [73]. Therefore, it is vital to validate the effectiveness of the pre-trained model on terrain scene classification. However, there are very few publicly available datasets for this task. While a prior study [73] exists, the data used in the study remain private.

To address this, we develop our own dataset to evaluate our model on terrain scene recognition. We base our dataset on OpenGF [1], which was originally designed for ground filtering. In OpenGF, the authors divided the data into four prime terrain types—Metropolis, Small City, Village, and Mountain—consisting of 160

500 m \times 500 m

point cloud tiles for training and validation. These terrain types are further subdivided into nine scenes: Metropolis is divided into regions with large roofs (S1) and dense roofs (S2); Small City is divided into tiles with flat ground (S3), locally undulating ground (S4), and rugged ground (S5); Village consists only of tiles with scattered buildings (S6); Mountain is divided into tiles with gentle slopes and dense vegetation (S7), steep slopes and sparse vegetation (S8), and steep slopes and dense vegetation (S9). Given the well-defined scene categories and the large size of the dataset, we create a dataset for terrain scene recognition based on OpenGF.

To construct the dataset, we first combine the training and validation tiles from OpenGF. We then split the combined tiles into training, validation, and test sets, assigning 106 tiles (about 66%) to training, 27 tiles (about 17%) to validation, and 27 tiles (about 17%) to testing. For the training tiles, we divide each

500 m \times 500 m

tile into smaller

100 m \times 100 m

tiles using a sliding window cropping algorithm with an overlap of

50 m

. For validation and test tiles, no overlap is applied. This process results in 10,591 tiles for training, 675 tiles for validation, and 675 tiles for testing. Examples of the dataset are shown in Figure 6.

The task is to classify a tile into one of the nine semantic categories (S1–S9). Similar to the tree species classification task, we use only 3D coordinates as input. During training, the entire tile is fed into the network. Following the pre-training settings, we limit the number of voxels after voxelization to 200,000. All the other settings remain the same as the pre-training ones. For the fine-tuning architecture, we use the same architecture as that used for tree species classification.

5.2.3. Point Cloud Semantic Segmentation

Urban point cloud semantic segmentation provides vital information about ground objects for urban modeling. It involves classifying points in 3D data into meaningful categories such as buildings, roads, vegetation, and other urban features. In this work, we use the Dayton Annotated Laser Earth Scan (DALES) dataset [3], an aerial LiDAR dataset with nearly half a billion points spanning 10 square kilometers, to evaluate the performance of the pre-trained models. DALES consists of 40 scenes of dense, labeled aerial data covering multiple scene types, including urban, suburban, rural, and commercial. The dataset is hand-labeled by expert LiDAR technicians into eight semantic categories: ground, vegetation, cars, trucks, poles, power lines, fences, and buildings. While sensor intensity and return information are available, we use only the x, y, and z features as input. Each tile measures

500 m \times 500 m

, and the task is to classify points into one of the semantic classes. Similar to the pre-training setup, we sample

144 m \times 144 m

tiles from the

500 m \times 500 m

tiles during training. The maximum number of voxels is set to 200,000 during training and 1,000,000 during testing to ensure that all points are classified. All the other settings remain the same as the pre-training ones.

The architecture used for this task is shown in Figure 5 (bottom). Upon receiving the input tile, the 3D encoder generates multi-scale features through non-linear transformations and downsampling operations. To obtain the per-point class labels, we append a decoder to the pre-trained BEV-MAE encoder to recover the full resolution of the points. The decoder receives the downsampled high-level representations from the encoder and gradually transforms and upsamples the point cloud until it reaches full resolution. The resulting architecture is a U-Net [74]-style 3D CNN, which connects the encoder and decoder using skip connections. After passing through the decoder, each point is classified using a series of fully connected layers.

5.3. Evaluation Metrics

We use two commonly adopted metrics to evaluate the performance of the models: Mean Intersection over Union (mIoU) and Overall Accuracy (OA).

mIoU evaluates a model’s performance by measuring the overlap between predicted and ground truth points or point clouds. The IoU for each class i is defined as

{IoU}_{i} = \frac{{TP}_{i}}{{TP}_{i} + {FP}_{i} + {FN}_{i}},

(4)

where C is the total number of classes,

{TP}_{i}

represents the true positives for class i, and

{FP}_{i}

and

{FN}_{i}

denote the false positives and false negatives, respectively. IoU is considered a stricter and more comprehensive metric compared to metrics like precision and recall, as it penalizes both over-prediction (

{FP}_{i}

) and under-prediction (

{FN}_{i}

). The mIoU is then calculated as the average IoU across all classes:

mIoU = \frac{1}{C} \sum_{i = 1}^{C} {IoU}_{i} .

(5)

mIoU averages performance across classes, ensuring that both major and minor classes are equally weighted. Consequently, a high mIoU requires the model to perform well on all classes, regardless of their prevalence in the dataset.

OA measures the proportion of correctly classified points or point clouds across all classes. Mathematically, OA is defined as

OA = \frac{\sum_{i = 1}^{C} {TP}_{i}}{# all points or point clouds} .

(6)

OA does not account for class imbalance, meaning that a high OA can be achieved by performing well on major classes, even if performance on minor classes is poor.

These two metrics complement each other: mIoU emphasizes the model’s ability to accurately predict each class, providing a balanced evaluation of model performance, while OA captures the overall classification accuracy across all classes.

6. Results

In this section, we present the pre-training results through the reconstruction performance of BEV-MAE. We then discuss the results of various downstream tasks, including tree species classification, terrain scene recognition, and point cloud semantic segmentation.

6.1. Pre-Training Results

To validate the quality of the learned representations, we visualize the reconstructed coordinates and densities in this section. We feed unseen point cloud tiles (not used during pre-training) into the model and visualize the reconstructed coordinates and densities.

As shown in Figure 7, our first observation is that the model effectively reconstructs the overall patterns of the point clouds. The reconstructed surface objects, including buildings, trees, and the ground, align roughly well with the ground truth point clouds. However, a significant amount of detail is missing in the reconstructions, suggesting that detailed shape information is not fully recovered. Specifically, points within individual BEV cells often reconstruct as simple plane-like shapes, failing to capture the fine-grained details of the objects. This indicates that the network primarily learns abstract shapes of the point clouds rather than their fine-grained geometry. Furthermore, this limitation suggests the model may struggle to recognize smaller objects, as their relatively small size and detailed shape information are more easily obscured during the masked autoencoding process.

Figure 8 shows the results of density reconstruction. In general, the model achieves high-quality reconstruction. The “Recon. + visible” outputs often closely mimic the ground truth. The error maps reveal that errors are frequently concentrated in regions where density is higher. For instance, in the first row of the figure, errors are primarily concentrated around trees, whereas errors at ground or building points are minimal. Similarly, the third row shows that regions with highly variable topography under the dense vegetations (upper right) pose challenges for density reconstruction.

6.2. Downstream Application Results

6.2.1. Tree Species Classification

As shown in Table 9, the pre-trained model outperforms the scratch model by 3.4% in mIoU and 0.3% in OA, highlighting the effectiveness of the dataset and pre-training for tree species classification. Specifically, in Table 10, the pre-trained model demonstrates either slight or substantial improvements over the scratch model across nearly all categories. This suggests that pre-training enables the model to learn generalizable shape-related features beneficial for distinguishing between tree species.

The most notable improvements are observed in the Black locust and Douglas classes, with performance gains of 8.5% and 14.5%, respectively. We hypothesize that these significant improvements stem from these species being native to the United States and likely included in the pre-training dataset. Furthermore, the improvement in the Douglas class may also be attributed to its small sample size (as shown in the Tiles column), which limits the scratch model’s ability to learn transferable features effectively.

Moreover, our models including both scratch and pre-trained ones have largely outperformed the baseline models reported by [2] even when the baseline has used additional features such as colors. Apart from the pre-training, such a big difference can be attributed partially to the difference in input data handling. For instance, while the baselines subsampled point cloud with a voxel size of 0.25 m, we used 0.06 m, which results in much higher resolution and maintains much finer details. Therefore, we expect our model to learn much better geometric features.

6.2.2. Terrain Scene Recognition

The overall results of terrain scene recognition are presented in Table 11. As shown, BEV-MAE pre-trained on the 3DEP dataset significantly outperforms the scratch model in terms of both mIoU and OA. This suggests that pre-training offers valuable improvements in the quality of representation and generally enhances terrain scene recognition. The detailed classification results for each class are presented in Table 12. In general, the pre-trained BEV-MAE performs better on terrain scenes such as Metropolis, Village, and Mountain, while showing lower performance on Small City terrain scenes. For Metropolis classes, the pre-trained model demonstrates significantly better recognition for the S1 class compared to the scratch model, indicating a stronger semantic understanding of urban scenes, particularly buildings with large roofs. Moreover, the pre-trained model achieves substantially better performance across all subclasses of the Mountain terrain scene, highlighting its ability to understand varying terrains with vegetation. While S8 and S9 share similar steep terrain characteristics, the pre-trained model effectively differentiates between sparse and dense vegetation, performing significantly better than the scratch model. This result highlights the pre-trained model’s capability to capture fine-grained semantic differences across different forest scenes where complex terrain and surface object interactions occur.

We also compare our results with other point-based and voxel-based methods. Our results, detailed in Table 11 and Table 12, show that while BEV-MAE trained from scratch underperformed PointNet++, our pre-trained model achieved comparable or superior results to both point-based and voxel-based networks. PointNet++ demonstrated strong performance, likely due to its effective modeling of geometric details by processing individual points without discretization. However, this advantage leads to higher memory consumption and training costs compared to voxel-based methods, limiting its scalability.

6.2.3. Point Cloud Semantic Segmentation for Developed Areas

As shown in Table 13, pre-training results in only a slight increase in average mIoU, suggesting that it has a limited impact on learning useful features for the segmentation of Developed areas. We hypothesize that, as discussed in Section 6.1, the model struggles to capture fine-grained geometric details of objects in Developed areas, which are crucial for tasks like semantic segmentation. Regarding the model architecture, the inherent discretization of our voxel-based approach limits the level of detail the model can learn. Specifically, it struggles to capture details smaller than the designated voxel length of 0.6 m used in this study. To quantitatively investigate misclassifications arising from data characteristics, the normalized confusion matrix is shown in Figure 9. As anticipated, small-scale objects are misclassified relatively more often than larger objects. For instance, differentiating between car and truck classes, which can have very similar shapes, requires modeling more fine-grained geometric details. To qualitatively investigate the error patterns, we visualize the segmentation maps in Figure 10. First, we observe that misclassification frequently occurs around boundaries. Since boundaries are inherently very fine-grained, they can pose a challenge for methods that employ some form of data discretization. Second, misclassifications often happen between geometrically and semantically similar objects. For example, fences and lower vegetation, which are frequently located near each other (2nd column of Figure 10), are barely distinguishable based solely on geometry. On the other hand, misclassifications, such as those between ground and lower vegetation, could potentially be mitigated by providing additional radiometric input.

To further investigate the effectiveness of pre-training, we trained the model on datasets of varying scales and compared it against several dataset variants. First, we created dataset variants where the proposed geospatial sampling method was replaced with random sampling, meaning sampling was performed without selectively considering land covers or topographies. Second, we pre-trained the model using OpenGF [1], a large-scale dataset designed for ground filtering. OpenGF was included for comparison due to its diverse land covers and topographies, which are similar in design (but not scale) to our 3DEP dataset.

Interestingly, when the number of samples per project is relatively small (10 samples/project), the pre-trained model performed slightly worse than the scratch model. We hypothesize that insufficient data further hamper the model’s ability to learn local details. However, as the dataset scale increased with our geospatial sampling method, the model’s performance steadily improved, surpassing the scratch model after reaching 20 samples per project. This demonstrates that transferable features for urban scenes can be effectively learned with an increasing number of samples.

Conversely, the model pre-trained on OpenGF achieved the lowest performance, despite its inclusion of areas such as Metropolis, Small City, and Village. These results suggest that transferable features for urban scenes cannot be effectively learned when the dataset size is limited, highlighting the critical role of dataset scale in successful pre-training.

On the other hand, the randomly sampled datasets failed to provide meaningful improvements, even as the dataset size increased. Their performance consistently remained below that of the scratch model. This reveals that both dataset scale and the sampling strategy are crucial for effective pre-training, validating the importance of a well-designed geospatial sampling method.

7. Conclusions

In this study, we investigate large-scale pre-training for airborne laser scanning (ALS) applications. To support this, we propose a simple yet effective framework and construct a large-scale dataset tailored for pre-training. The framework takes land cover maps, DEMs, and ALS point clouds as input, and applies joint land cover–terrain inverse probability sampling to ensure diversity while allowing flexible control over dataset size. We pre-train BEV-MAE on the constructed dataset and evaluate its performance across three downstream tasks. Our results show that the pre-trained model consistently outperforms models trained from scratch. Moreover, we validate that scaling the dataset using the proposed framework leads to consistent performance improvements, in contrast to naive random sampling.

While we observe significant improvement in tree species classification and terrain scene recognition tasks, the performance gains in point cloud segmentation remain limited. We hypothesize that this could be due to the model’s inability to capture fine-grained details, as discussed in Section 5.2.3 and Section 6.1, which hinders its ability to model small objects and boundaries. Additionally, we believe that another contributing factor to the problem is that the baseline model, originating from the general computer vision domain, may not be fully adapted to address specific challenges in ALS data, such as the drastic variations in object scales within a single scene. Furthermore, given the varying density across point cloud projects, it remains uncertain how this density variability influences the final fine-tuning performance.

One possible direction for improvement is to focus on generating more detailed reconstructions. This could be achieved by extending the loss function, such as incorporating perceptual loss [81], or by adopting multi-scale SSL architectures [82]. Another promising direction is to develop tailored SSL methods specifically designed for ALS data, capable of handling both large-scale and small-scale objects simultaneously. For example, this could involve inventing new masking strategies [21] that better address the unique challenges of ALS data. For data analysis, the effect of pre-training data density could be further investigated by, for example, including only high-density point clouds in dataset construction. Regarding further applications, the effectiveness of the pre-trained model could be validated in time-sensitive scenarios, such as disaster response, using multi-temporal LiDAR point clouds.

We believe that the developed framework, constructed dataset, and pre-trained models can serve as a strong baseline for future research on large-scale ALS dataset development. Additionally, our findings offer valuable insights for advancing the application of the pre-training and fine-tuning paradigm in ALS-related tasks.

Author Contributions

Conceptualization, H.X., X.L., T.K. and K.-S.K.; methodology, H.X. and K.-S.K.; software, H.X.; validation, H.X.; formal analysis, H.X., X.L., T.K. and K.-S.K.; investigation, H.X.; resources, X.L. and K.-S.K.; data curation, H.X.; writing—original draft preparation, H.X., X.L., T.K. and K.-S.K.; writing—review and editing, H.X., X.L., T.K. and K.-S.K.; visualization, H.X.; supervision, X.L. and K.-S.K.; project administration, X.L. and K.-S.K.; funding acquisition, K.-S.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research is supported by AIST policy-based budget project R&D on Generative AI Foundation Models for the Physical Domain.

Data Availability Statement

Code for experiments and data generation is available at https://github.com/martianxiu/ALS_pretraining (accessed on 23 May 2025).

Acknowledgments

NLCD, 3DEP DEMs, and 3DEP ALS are courtesy of the United States Geological Survey.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Qin, N.; Tan, W.; Ma, L.; Zhang, D.; Li, J. OpenGF: An ultra-large-scale ground filtering dataset built upon open ALS point clouds around the world. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 1082–1091. [Google Scholar]
Gaydon, C.; Roche, F. PureForest: A Large-scale Aerial Lidar and Aerial Imagery Dataset for Tree Species Classification in Monospecific Forests. arXiv 2024, arXiv:2404.12064. [Google Scholar]
Varney, N.; Asari, V.K.; Graehling, Q. DALES: A large-scale aerial LiDAR data set for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Virtual, 14–19 June 2020; pp. 186–187. [Google Scholar]
Xiu, H.; Liu, X.; Wang, W.; Kim, K.S.; Shinohara, T.; Chang, Q.; Matsuoka, M. DS-Net: A dedicated approach for collapsed building detection from post-event airborne point clouds. Int. J. Appl. Earth Obs. Geoinf. 2023, 116, 103150. [Google Scholar] [CrossRef]
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Bommasani, R.; Hudson, D.A.; Adeli, E.; Altman, R.; Arora, S.; von Arx, S.; Bernstein, M.S.; Bohg, J.; Bosselut, A.; Brunskill, E.; et al. On the opportunities and risks of foundation models. arXiv 2021, arXiv:2108.07258. [Google Scholar]
Guo, X.; Lao, J.; Dang, B.; Zhang, Y.; Yu, L.; Ru, L.; Zhong, L.; Huang, Z.; Wu, K.; Hu, D.; et al. Skysense: A multi-modal remote sensing foundation model towards universal interpretation for earth observation imagery. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 27672–27683. [Google Scholar]
Mendieta, M.; Han, B.; Shi, X.; Zhu, Y.; Chen, C. Towards geospatial foundation models via continual pretraining. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 16806–16816. [Google Scholar]
Hong, D.; Zhang, B.; Li, X.; Li, Y.; Li, C.; Yao, J.; Yokoya, N.; Li, H.; Ghamisi, P.; Jia, X.; et al. SpectralGPT: Spectral remote sensing foundation model. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 5227–5244. [Google Scholar] [CrossRef]
Stoker, J.; Miller, B. The accuracy and consistency of 3d elevation program data: A systematic analysis. Remote Sens. 2022, 14, 940. [Google Scholar] [CrossRef]
Actueel Hoogtebestand Nederland. Actueel Hoogtebestand Nederland (AHN), n.d. Available online: https://www.ahn.nl/ (accessed on 26 December 2024).
Lin, Z.; Wang, Y.; Qi, S.; Dong, N.; Yang, M.H. BEV-MAE: Bird’s Eye View Masked Autoencoders for Point Cloud Pre-training in Autonomous Driving Scenarios. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 3531–3539. [Google Scholar]
Liu, H.; Li, C.; Wu, Q.; Lee, Y.J. Visual instruction tuning. Adv. Neural Inf. Process. Syst. 2024, 36. [Google Scholar]
Ayush, K.; Uzkent, B.; Meng, C.; Tanmay, K.; Burke, M.; Lobell, D.; Ermon, S. Geography-aware self-supervised learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 10181–10190. [Google Scholar]
Manas, O.; Lacoste, A.; Giró-i Nieto, X.; Vazquez, D.; Rodriguez, P. Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In Proceedings of the IEEE/CVF International Conference on Computer Vision, virtual, 11–17 October 2021; pp. 9414–9423. [Google Scholar]
Mall, U.; Hariharan, B.; Bala, K. Change-aware sampling and contrastive learning for satellite images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 5261–5270. [Google Scholar]
Chen, X.; Fan, H.; Girshick, R.; He, K. Improved baselines with momentum contrastive learning. arXiv 2020, arXiv:2003.04297. [Google Scholar]
He, K.; Chen, X.; Xie, S.; Li, Y.; Dollár, P.; Girshick, R. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 16000–16009. [Google Scholar]
Cong, Y.; Khanna, S.; Meng, C.; Liu, P.; Rozi, E.; He, Y.; Burke, M.; Lobell, D.; Ermon, S. Satmae: Pre-training transformers for temporal and multi-spectral satellite imagery. Adv. Neural Inf. Process. Syst. 2022, 35, 197–211. [Google Scholar]
Sun, X.; Wang, P.; Lu, W.; Zhu, Z.; Lu, X.; He, Q.; Li, J.; Rong, X.; Yang, Z.; Chang, H.; et al. RingMo: A remote sensing foundation model with masked image modeling. IEEE Trans. Geosci. Remote Sens. 2022, 61, 1–22. [Google Scholar] [CrossRef]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; IEEE: Piscataway, NJ, USA, 2009; pp. 248–255. [Google Scholar]
Han, B.; Zhang, S.; Shi, X.; Reichstein, M. Bridging remote sensors with multisensor geospatial foundation models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 27852–27862. [Google Scholar]
Liu, F.; Chen, D.; Guan, Z.; Zhou, X.; Zhu, J.; Ye, Q.; Fu, L.; Zhou, J. Remoteclip: A vision language foundation model for remote sensing. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5622216. [Google Scholar] [CrossRef]
Wang, Z.; Prabha, R.; Huang, T.; Wu, J.; Rajagopal, R. Skyscript: A large and semantically diverse vision-language dataset for remote sensing. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 5805–5813. [Google Scholar]
Mall, U.; Phoo, C.P.; Liu, M.K.; Vondrick, C.; Hariharan, B.; Bala, K. Remote sensing vision-language foundation models without annotations via ground remote alignment. arXiv 2023, arXiv:2312.06960. [Google Scholar]
Kuckreja, K.; Danish, M.S.; Naseer, M.; Das, A.; Khan, S.; Khan, F.S. Geochat: Grounded large vision-language model for remote sensing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 27831–27840. [Google Scholar]
Liu, H.; Li, C.; Li, Y.; Lee, Y.J. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 26296–26306. [Google Scholar]
Li, X.; Li, C.; Tong, Z.; Lim, A.; Yuan, J.; Wu, Y.; Tang, J.; Huang, R. Campus3d: A photogrammetry point cloud benchmark for hierarchical understanding of outdoor scene. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 238–246. [Google Scholar]
Hu, Q.; Yang, B.; Khalid, S.; Xiao, W.; Trigoni, N.; Markham, A. Sensaturban: Learning semantics from urban-scale photogrammetric point clouds. Int. J. Comput. Vis. 2022, 130, 316–343. [Google Scholar] [CrossRef]
Li, M.; Wu, Y.; Yeh, A.G.; Xue, F. HRHD-HK: A benchmark dataset of high-rise and high-density urban scenes for 3D semantic segmentation of photogrammetric point clouds. In Proceedings of the 2023 IEEE International Conference on Image Processing Challenges and Workshops (ICIPCW), Kuala Lumpur, Malaysia, 8–11 October 2023; Volume 1, pp. 3714–3718. [Google Scholar]
Chen, M.; Hu, Q.; Yu, Z.; Thomas, H.; Feng, A.; Hou, Y.; McCullough, K.; Ren, F.; Soibelman, L. Stpls3d: A large-scale synthetic and real aerial photogrammetry 3d point cloud dataset. arXiv 2022, arXiv:2203.09065. [Google Scholar]
Hackel, T.; Savinov, N.; Ladicky, L.; Wegner, J.D.; Schindler, K.; Pollefeys, M. Semantic3d. net: A new large-scale point cloud classification benchmark. arXiv 2017, arXiv:1704.03847. [Google Scholar]
Roynard, X.; Deschaud, J.E.; Goulette, F. Paris-Lille-3D: A large and high-quality ground-truth urban point cloud dataset for automatic segmentation and classification. Int. J. Robot. Res. 2018, 37, 545–557. [Google Scholar] [CrossRef]
Behley, J.; Garbade, M.; Milioto, A.; Quenzel, J.; Behnke, S.; Stachniss, C.; Gall, J. Semantickitti: A dataset for semantic scene understanding of lidar sequences. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9297–9307. [Google Scholar]
Tan, W.; Qin, N.; Ma, L.; Li, Y.; Du, J.; Cai, G.; Yang, K.; Li, J. Toronto-3D: A large-scale mobile LiDAR dataset for semantic segmentation of urban roadways. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 202–203. [Google Scholar]
Rottensteiner, F.; Sohn, G.; Gerke, M.; Wegner, J.D.; Breitkopf, U.; Jung, J. Results of the ISPRS benchmark on urban object detection and 3D building reconstruction. ISPRS J. Photogramm. Remote Sens. 2014, 93, 256–271. [Google Scholar] [CrossRef]
Zolanvari, S.; Ruano, S.; Rana, A.; Cummins, A.; Da Silva, R.E.; Rahbar, M.; Smolic, A. DublinCity: Annotated LiDAR point cloud and its applications. arXiv 2019, arXiv:1909.03613. [Google Scholar]
Ye, Z.; Xu, Y.; Huang, R.; Tong, X.; Li, X.; Liu, X.; Luan, K.; Hoegner, L.; Stilla, U. Lasdu: A large-scale aerial lidar dataset for semantic labeling in dense urban areas. ISPRS Int. J. -Geo-Inf. 2020, 9, 450. [Google Scholar] [CrossRef]
Devlin, J. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Yu, X.; Tang, L.; Rao, Y.; Huang, T.; Zhou, J.; Lu, J. Point-bert: Pre-training 3d point cloud transformers with masked point modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 19313–19322. [Google Scholar]
Fu, K.; Gao, P.; Liu, S.; Qu, L.; Gao, L.; Wang, M. Pos-bert: Point cloud one-stage bert pre-training. Expert Syst. Appl. 2024, 240, 122563. [Google Scholar] [CrossRef]
Fu, K.; Yuan, M.; Liu, S.; Wang, M. Boosting point-bert by multi-choice tokens. IEEE Trans. Circuits Syst. Video Technol. 2023, 34, 438–447. [Google Scholar] [CrossRef]
Rolfe, J.T. Discrete Variational Autoencoders. arXiv 2017, arXiv:1609.02200. [Google Scholar]
Pang, Y.; Wang, W.; Tay, F.E.; Liu, W.; Tian, Y.; Yuan, L. Masked autoencoders for point cloud self-supervised learning. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 604–621. [Google Scholar]
Liu, H.; Cai, M.; Lee, Y.J. Masked discrimination for self-supervised learning on point clouds. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 657–675. [Google Scholar]
Zhang, R.; Guo, Z.; Gao, P.; Fang, R.; Zhao, B.; Wang, D.; Qiao, Y.; Li, H. Point-m2ae: Multi-scale masked autoencoders for hierarchical point cloud pre-training. Adv. Neural Inf. Process. Syst. 2022, 35, 27061–27074. [Google Scholar]
Chen, G.; Wang, M.; Yang, Y.; Yu, K.; Yuan, L.; Yue, Y. Pointgpt: Auto-regressively generative pre-training from point clouds. Adv. Neural Inf. Process. Syst. 2023, 36, 29667–29679. [Google Scholar]
Yan, S.; Yang, Y.; Guo, Y.; Pan, H.; Wang, P.s.; Tong, X.; Liu, Y.; Huang, Q. 3d feature prediction for masked-autoencoder-based point cloud pretraining. arXiv 2023, arXiv:2304.06911. [Google Scholar]
Zhang, R.; Wang, L.; Qiao, Y.; Gao, P.; Li, H. Learning 3d representations from 2d pre-trained models via image-to-point masked autoencoders. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 21769–21780. [Google Scholar]
Guo, Z.; Zhang, R.; Qiu, L.; Li, X.; Heng, P.A. Joint-mae: 2D-3D joint masked autoencoders for 3d point cloud pre-training. arXiv 2023, arXiv:2302.14007. [Google Scholar]
Qi, Z.; Dong, R.; Fan, G.; Ge, Z.; Zhang, X.; Ma, K.; Yi, L. Contrast with reconstruct: Contrastive 3d representation learning guided by generative pretraining. In Proceedings of the International Conference on Machine Learning, PMLR, Honolulu, HI, USA, 23–29 July 2023; pp. 28223–28243. [Google Scholar]
Hess, G.; Jaxing, J.; Svensson, E.; Hagerman, D.; Petersson, C.; Svensson, L. Masked autoencoder for self-supervised pre-training on lidar point clouds. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–7 January 2023; pp. 350–359. [Google Scholar]
Tian, X.; Ran, H.; Wang, Y.; Zhao, H. Geomae: Masked geometric target prediction for self-supervised point cloud pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 13570–13580. [Google Scholar]
Yang, H.; He, T.; Liu, J.; Chen, H.; Wu, B.; Lin, B.; He, X.; Ouyang, W. GD-MAE: Generative decoder for MAE pre-training on lidar point clouds. In Proceedings of the PIEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 9403–9414. [Google Scholar]
Carós, M.; Just, A.; Seguí, S.; Vitrià, J. Self-Supervised Pre-Training Boosts Semantic Scene Segmentation on LiDAR data. In Proceedings of the 2023 18th International Conference on Machine Vision and Applications (MVA), Hamamatsu, Japan, 23–25 July 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–6. [Google Scholar]
Zbontar, J.; Jing, L.; Misra, I.; LeCun, Y.; Deny, S. Barlow twins: Self-supervised learning via redundancy reduction. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual Event, 18–24 July 2021; pp. 12310–12320. [Google Scholar]
de Gélis, I.; Saha, S.; Shahzad, M.; Corpetti, T.; Lefèvre, S.; Zhu, X.X. Deep unsupervised learning for 3d als point clouds change detection. ISPRS Open J. Photogramm. Remote Sens. 2023, 9, 100044. [Google Scholar] [CrossRef]
Yang, H.; Huang, S.; Wang, R.; Wang, X. Self-Supervised Pre-Training for 3D Roof Reconstruction on LiDAR Data. IEEE Geosci. Remote Sens. Lett. 2024, 21, 6500405. [Google Scholar]
Zhang, Y.; Yao, J.; Zhang, R.; Wang, X.; Chen, S.; Fu, H. HAVANA: Hard Negative Sample-Aware Self-Supervised Contrastive Learning for Airborne Laser Scanning Point Cloud Semantic Segmentation. Remote Sens. 2024, 16, 485. [Google Scholar] [CrossRef]
U.S. Geological Survey. What is 3DEP? 2024. Available online: https://www.usgs.gov/3d-elevation-program/what-3dep (accessed on 2 December 2024).
U.S. Geological Survey. USGS 3DEP LiDAR Point Clouds. 2024. Available online: https://registry.opendata.aws/usgs-lidar/ (accessed on 2 December 2024).
Wickham, J.; Stehman, S.V.; Sorenson, D.G.; Gass, L.; Dewitz, J.A. Thematic accuracy assessment of the NLCD 2019 land cover for the conterminous United States. Gisci. Remote Sens. 2023, 60, 2181143. [Google Scholar] [CrossRef] [PubMed]
Multi-Resolution Land Characteristics (MRLC) Consortium. National Land Cover Database Class Legend and Description. 2024. Available online: https://www.mrlc.gov/data/legends/national-land-cover-database-class-legend-and-description (accessed on 3 December 2024).
Pamela, P.; Yukni, A.; Imam, S.A.; Kartiko, R.D. The selective causative factors on landslide susceptibility assessment: Case study Takengon, Aceh, Indonesia. In Proceedings of the AIP Conference Proceedings, Maharashtra, India, 5–6 July 2018; AIP Publishing: Melville, NY, USA, 2018; Volume 1987. [Google Scholar]
Chegini, T.; Li, H.Y.; Leung, L.R. HyRiver: Hydroclimate Data Retriever. J. Open Source Softw. 2021, 6, 1–3. [Google Scholar] [CrossRef]
U.S. Geological Survey. Thematic Accuracy Assessment of NLCD 2019 Land Cover for the Conterminous United States. 2024. Available online: https://www.usgs.gov/publications/thematic-accuracy-assessment-nlcd-2019-land-cover-conterminous-united-states (accessed on 2 December 2024).
Melekhov, I.; Umashankar, A.; Kim, H.J.; Serkov, V.; Argyle, D. ECLAIR: A High-Fidelity Aerial LiDAR Dataset for Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 7627–7637. [Google Scholar]
Graves, S.; Marconi, S. IDTReeS 2020 Competition Data. 2020. Available online: https://zenodo.org/records/3700197 (accessed on 23 May 2025).
Sithole, G.; Vosselman, G. Experimental comparison of filter algorithms for bare-Earth extraction from airborne laser scanning point clouds. ISPRS J. Photogramm. Remote Sens. 2004, 59, 85–101. [Google Scholar] [CrossRef]
Graham, B.; Engelcke, M.; Van Der Maaten, L. 3d semantic segmentation with submanifold sparse convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 9224–9232. [Google Scholar]
Choy, C.; Gwak, J.; Savarese, S. 4D spatio-temporal convnets: Minkowski convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 3075–3084. [Google Scholar]
Qin, N.; Hu, X.; Dai, H. Deep fusion of multi-view and multimodal representation of ALS point cloud for 3D terrain scene recognition. ISPRS J. Photogramm. Remote Sens. 2018, 143, 205–212. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, Part III 18. Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 652–660. [Google Scholar]
Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Adv. Neural Inf. Process. Syst. 2017, 5105–5114. [Google Scholar]
Chen, Y.; Liu, J.; Zhang, X.; Qi, X.; Jia, J. VoxelNeXt: Fully Sparse VoxelNet for 3D Object Detection and Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023. [Google Scholar]
Thomas, H.; Qi, C.R.; Deschaud, J.E.; Marcotegui, B.; Goulette, F.; Guibas, L.J. Kpconv: Flexible and deformable convolution for point clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Long Beach, CA, USA, 16–20 June 2019; pp. 6411–6420. [Google Scholar]
Hu, Q.; Yang, B.; Xie, L.; Rosa, S.; Guo, Y.; Wang, Z.; Trigoni, N.; Markham, A. Randla-net: Efficient semantic segmentation of large-scale point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 14–19 June 2020; pp. 11108–11117. [Google Scholar]
Yoo, S.; Jeong, Y.; Jameela, M.; Sohn, G. Human vision based 3d point cloud semantic segmentation of large-scale outdoor scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 6577–6586. [Google Scholar]
Tukra, S.; Hoffman, F.; Chatfield, K. Improving visual representation learning through perceptual understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 14486–14495. [Google Scholar]
Reed, C.J.; Gupta, R.; Li, S.; Brockman, S.; Funk, C.; Clipp, B.; Keutzer, K.; Candido, S.; Uyttendaele, M.; Darrell, T. Scale-mae: A scale-aware masked autoencoder for multiscale geospatial representation learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Vancouver, BC, Canada, 18–22 June 2023; pp. 4088–4099. [Google Scholar]

Figure 1. Overview of the proposed framework: The framework begins by using land cover data, DEMs, and point cloud boundaries to download point cloud tiles from a data server. A proposed geospatial sampling strategy is then applied to ensure that the resulting dataset is balanced across land cover and terrain types, while also allowing flexible control over the dataset size. The point clouds are visualized with elevation-based coloring, where cooler colors represent lower elevations and warmer colors indicate higher elevations.

Figure 2. LiDAR point cloud boundaries used in this study are shown with randomly assigned colors for the boundary polygons. Each boundary represents a single LiDAR project. The boundary data were downloaded from [62] on 27 June 2024.

Figure 3. The upperleft figure displays the land cover map derived from the Anderson Level 1 classification system, while the upper right figure shows the slope derived from the DEM. The lower left figure presents the slope classification map, and the lower right figure illustrates the locations of the sampled tiles based on our sampling strategy.

Figure 4. Random samples of the dataset. Top: point cloud tiles labeled as “Developed”. Bottom: point cloud tiles labeled as “Forest”. From left to right: point cloud tiles labeled as “Flat”, “Sloping”, “Steep”.

Figure 5. Overview of the pre-training and fine-tuning using BEV-MAE. The details of the pre-training architecture and workflow are presented in Section 5.1.1. More elaborate descriptions of the fine-tuning architectures are presented in Section 5.2.

Figure 6. Some examples from the terrain scene recognition dataset developed based on OpenGF. S1 and S2 refer to metropolitan areas with large and dense roofs, respectively. S3 corresponds to a small city with flat ground, while S4 and S5 represent small cities with locally undulating ground and rugged ground, respectively. S6 denotes village areas, S7 corresponds to mountain areas with gentle slopes and dense vegetation, S8 represents mountain areas with steep slopes and sparse vegetation, and S9 indicates steep slopes with dense vegetation. The points are color-coded by elevation.

Figure 7. Results of coordinate reconstruction. GT represents the ground truth, while GT (masked) displays only the ground truth point clouds within the masked regions.

Figure 8. Density Reconstruction Results: Points in the Reconstruction, Recon. + Visible, and Ground Truth columns are color-coded by density (darker = lower, lighter = higher). In the Error column, points are color-coded by error (cooler = lower, warmer = higher). In the Point Cloud column, points are color-coded by elevation (warmer = higher, cooler = lower).

Figure 9. Row-normalized confusion matrix for point cloud semantic segmentation. Results are from the best run of BEV-MAE (Ours, 40 samples/project).

Figure 10. Qualitative results of point cloud semantic segmentation of. Shown are predictions (top row), ground truth (middle row), and the resulting error map (bottom row). Resutls are from the best run of BEV-MAE (Ours, 40 samples/project).

Table 1. Level I and Level II land cover classification system for NLCD2021.

Level I Class	Level II Class
Water	Open Water
	Perennial Ice/Snow
Developed	Developed, Open Space
	Developed, Low Intensity
	Developed, Medium Intensity
	Developed, High Intensity
Barren	Barren Land (Rock/Sand/Clay)
Forest	Deciduous Forest
	Evergreen Forest
	Mixed Forest
Shrubland	Dwarf Scrub (Alaska only)
	Shrub/Scrub
Herbaceous	Grassland/Herbaceous
	Sedge/Herbaceous (Alaska only)
	Lichens (Alaska only)
	Moss (Alaska only)
Planted/Cultivated	Pasture/Hay
	Cultivated Crops
Wetlands	Woody Wetlands
	Emergent Herbaceous Wetlands

Table 2. Slope classification with degree and percentage ranges.

Slope Class	Degree	Percentage
Flat	0–5°	0–8.7%
Sloped	5–17°	8.7–30.6%
Steep	≥17°	≥30.6%

Table 3. Number of tiles for each land cover and terrain class.

	Developed	Forest	All
Flat	28,523	18,071	46,594
Sloped	3774	16,021	19,795
Steep	308	7065	7373
All	32,605	41,157	73,762

Table 4. Specifications of representative geospatial datasets.

Dataset	Year	Coverage	# Points	Land Cover Type
ISPRS [37]	2014	–	1.2 M	Developed
DublinCity [38]	2019	$2 \times 10^{6} m^{2}$	260 M	Developed
LASDU [39]	2020	$1.02 \times 10^{6} m^{2}$	3.12 M	Developed
DALES [3]	2020	$10 \times 10^{6} m^{2}$	505 M	Developed
ECLAIR [68]	2024	$10.3 \times 10^{6} m^{2}$	582 M	Developed
IDTReeS [69]	2021	$3440 m^{2}$	0.02 M	Forest
PureForest [2]	2024	$339 \times 10^{6} m^{2}$	15 B	Forest
ISPRS filtertest [70]	–	$1.1 \times 10^{6} m^{2}$	0.4 M	Developed, Forest
OpenGF [1]	2021	$47.7 \times 10^{6} m^{2}$	542.1 M	Developed, Forest
3DEP (Ours)	-	17,691 $\times 10^{6} m^{2}$	184 B	Developed, Forest

Table 5. Density per square meter for each land cover and terrain classes. The average density and their standard deviations are reported.

	Developed	Forest	All
Flat	7.7/9.9	11.1/15.3	9.0/12.4
Sloped	11.2/16.0	11.9/15.5	11.8/15.6
Steep	28.0/39.6	18.0/19.2	18.4/20.5
All	8.2/11.6	12.6/16.3	10.7/14.6

Table 6. Standard deviation of ground points for each land cover and terrain classes. The average and standard deviations are reported.

	Developed	Forest	All
Flat	2.5/3.4	4.9/4.1	3.5/3.8
Sloped	12.7/6.1	14.7/8.7	14.4/8.3
Steep	36.1/18.8	43.6/19.1	43.3/19.1
All	4.0/6.1	15.3/16.8	10.4/14.4

Table 7. Return characteristics of the Developed class.

Return Number	Sum Point Count	Percent (%) of Total
Single	13,036,774,317	67.92
First	15,495,147,301	80.73
First of many	2,446,338,594	12.74
Second	2,491,508,834	12.98
Third	839,493,704	4.37
Fourth	255,303,252	1.33
Fifth	71,200,968	0.37
Sixth	28,103,502	0.15
Seventh	14,103,705	0.07
Last	15,594,824,929	81.24
Last of many	2,559,449,009	13.33

Table 8. Return characteristics of the Forest class.

Return Number	Sum Point Count	Percen (%) of Total
Single	27,509,188,124	49.32
First	38,520,832,462	69.07
First of Many	10,992,636,342	19.71
Second	11,082,071,467	19.87
Third	4,272,395,407	7.66
Fourth	1,369,446,735	2.46
Fifth	364,742,990	0.65
Sixth	115,577,887	0.21
Seventh	47,218,274	0.08
Last	38,817,185,224	69.60
Last of Many	11,310,467,287	20.28

Table 9. Results of tree species classification. Our scores are the average of three runs. Bold text shows the best performance.

Method	mIoU (%)	OA (%)
Lidar (Baseline) [2]	55.1	80.3
Lidar + RGBI [2]	53.6	79.1
Lidar + Elevation [2]	57.2	83.6
Aerial Imagery [2]	50.0	73.1
BEV-MAE (Scratch)	72.2	86.8
BEV-MAE (Ours)	75.6	87.1

Table 10. Detailed results of tree species classification. The scores are IoUs. We show the best run among three runs for ours and scratch. Bold scores show the best performance.

Category	# Tiles	Lidar (Baseline)	Scratch	Ours
Deciduous oak	48,055	73.4	78.3	78.5
Evergreen oak	22,361	59.4	63.2	63.7
Beech	12,670	88.8	93.1	92.2
Chestnut	3684	56.5	65.6	62.0
Black locust	2303	58.1	73.7	82.2
Maritime pine	7568	62.9	96.2	97.5
Scotch pine	18,265	58.6	86.7	88.2
Black pine	7226	46.2	74.3	79.0
Aleppo pine	4699	39.3	87.9	93.7
Fir	840	0.0	0.0	0.0
Spruce	4074	85.8	94.2	93.3
Larch	3294	50.6	81.7	85.6
Douglas	530	36.5	78.8	93.3
Mean	–	55.1	74.9	77.6

Table 11. Results of terrain scene recognition. The scores are the average of three runs. Bold text shows the best performance.

Method	mIoU (%)	OA (%)
PointNet [75]	65.0	77.6
PointNet++ [76]	87.6	93.1
VoxelNeXt [77]	86.6	92.6
BEV-MAE (Scratch)	86.6	92.6
BEV-MAE (Ours)	87.4	93.1

Table 12. Detailed results of terrain scene recognition. The scores represent the IoUs from the best model among three runs. Bold text indicates the best performance among the compared methods.

	Metropolis		Small City			Village	Mountain
	S1	S2	S3	S4	S5	S6	S7	S8	S9	Mean
PointNet	42.6	65.1	69.5	50.0	97.4	64.4	89.5	62.7	70.1	67.9
PointNet++	77.9	76.5	98.7	93.6	100.0	92.5	100.0	78.7	82.4	88.9
VoxelNeXt	74.1	80.5	98.7	86.8	100.0	86.1	100.0	80.3	84.3	87.8
BEV-MAE (Scratch)	75.0	81.7	97.3	93.4	100.0	89.3	98.7	77.3	80.6	88.2
BEV-MAE (Ours)	77.8	81.2	97.3	90.8	100.0	90.4	100.0	81.6	85.2	89.4

Table 13. Results of point cloud semantic segmentation. Bold values indicate the best performance. Random denotes the 3DEP dataset constructed with random sampling instead of the proposed geospatial sampling.

Method	mIoU (%)	OA (%)
PointNet++ [76]	68.3	95.7
KPConv [78]	81.1	97.8
RandLA [79]	79.3	97.1
EyeNet [80]	79.6	97.2
BEV-MAE (Scratch)	77.9	97.3
BEV-MAE (OpenGF [1])	77.3	97.2
BEV-MAE (Random, 10 samples/project)	77.6	97.3
BEV-MAE (Random, 20 samples/project)	77.8	97.2
BEV-MAE (Random, 40 samples/project)	77.7	97.3
BEV-MAE (Ours, 10 samples/project)	77.7	97.3
BEV-MAE (Ours, 20 samples/project)	78.0	97.3
BEV-MAE (Ours, 40 samples/project)	78.2	97.3

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xiu, H.; Liu, X.; Kim, T.; Kim, K.-S. Advancing ALS Applications with Large-Scale Pre-Training: Framework, Dataset, and Downstream Assessment. Remote Sens. 2025, 17, 1859. https://doi.org/10.3390/rs17111859

AMA Style

Xiu H, Liu X, Kim T, Kim K-S. Advancing ALS Applications with Large-Scale Pre-Training: Framework, Dataset, and Downstream Assessment. Remote Sensing. 2025; 17(11):1859. https://doi.org/10.3390/rs17111859

Chicago/Turabian Style

Xiu, Haoyi, Xin Liu, Taehoon Kim, and Kyoung-Sook Kim. 2025. "Advancing ALS Applications with Large-Scale Pre-Training: Framework, Dataset, and Downstream Assessment" Remote Sensing 17, no. 11: 1859. https://doi.org/10.3390/rs17111859

APA Style

Xiu, H., Liu, X., Kim, T., & Kim, K.-S. (2025). Advancing ALS Applications with Large-Scale Pre-Training: Framework, Dataset, and Downstream Assessment. Remote Sensing, 17(11), 1859. https://doi.org/10.3390/rs17111859

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Advancing ALS Applications with Large-Scale Pre-Training: Framework, Dataset, and Downstream Assessment

Abstract

1. Introduction

2. Related Work

2.1. Pre-Training and Fine-Tuning Paradigms Applied to Satellite Remote Sensing

2.2. Datasets for 3D Geospatial Applications

2.3. SSL Methods for 3D Point Clouds

2.3.1. SSL Methods for General 3D Point Clouds

2.3.2. SSL Methods for ALS 3D Point Clouds

3. Framework and Instantiation

3.1. Framework Design

3.1.1. Land Cover and Terrain Information

3.1.2. Joint Land Cover–Terrain Inverse Probability Sampling

3.2. Data Source

3.2.1. LiDAR Point Cloud

3.2.2. Land Cover and Terrain Information

3.3. Framework Instantiation

3.3.1. Data Processing and Sampling Details

3.3.2. Dataset Statistics and Comparison

4. Dataset Validation

4.1. Point Density

4.2. Ground Point Standard Deviation

4.3. Return Characteristics

5. Experiments

5.1. Model Architecture and Pre-Training

5.1.1. Model Description

5.1.2. Implementation Details for Pre-Training

5.2. Task

5.2.1. Tree Species Classification

5.2.2. Terrain Scene Recognition

5.2.3. Point Cloud Semantic Segmentation

5.3. Evaluation Metrics

6. Results

6.1. Pre-Training Results

6.2. Downstream Application Results

6.2.1. Tree Species Classification

6.2.2. Terrain Scene Recognition

6.2.3. Point Cloud Semantic Segmentation for Developed Areas

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI