Large-Scale Airborne LiDAR Point Cloud Building Extraction Based on Improved Voxelized Deep Learning Network

Xue, Bai; Song, Yanru; Ai, Pi; Li, Hongzhou; Liu, Shuhan; Guo, Li

doi:10.3390/buildings16071450

Open AccessArticle

Large-Scale Airborne LiDAR Point Cloud Building Extraction Based on Improved Voxelized Deep Learning Network

by

Bai Xue

¹,

Yanru Song

^2,*

,

Pi Ai

¹,

Hongzhou Li

¹,

Shuhan Liu

¹ and

Li Guo

¹

Land Satellite Remote Sensing Application Center, MNR, Beijing 100048, China

²

China Aero Geophysical Survey and Remote Sensing Center for Nature Resources, Beijing 100083, China

^*

Author to whom correspondence should be addressed.

Buildings 2026, 16(7), 1450; https://doi.org/10.3390/buildings16071450

Submission received: 26 February 2026 / Revised: 23 March 2026 / Accepted: 1 April 2026 / Published: 7 April 2026

(This article belongs to the Section Construction Management, and Computers & Digitization)

Download

Browse Figures

Versions Notes

Abstract

High-precision 3D building data are pivotal for smart city development, urban planning, and disaster management. However, large-scale building extraction from airborne LiDAR point clouds remains challenging due to semantic ambiguity, uneven point density, and complex architectural structures. To address these limitations, we propose a novel framework integrating geometric topology perception with cross-dimensional attention mechanisms within a Sparse Voxel Convolutional Neural Network (SPVCNN). The key contributions include: (1) an enhanced LaserMix++ multi-scale hybrid augmentation strategy featuring cross-scene block replacement, ground normal–constrained rotation, and non-uniform scaling; (2) a dual-branch SPVCNN architecture embedding a collaborative module of Geometric Self-Attention (GSA) and Cross-Space Residual Attention (CSRA) to preserve topological consistency and enable cross-dimensional feature interaction; and (3) a Boundary Enhancement Module (BEM) specifically designed to resolve boundary ambiguity and overlapping predictions. Evaluated on a 177 km² dataset covering Washington, D.C., our method significantly outperforms the baseline SPVCNN, improving accuracy by 12.04 percentage points (0.8212 to 0.9416) and Intersection over Union (IoU) by 9.96 percentage points (0.866 to 0.9656). Furthermore, it surpasses mainstream networks such as Cylinder3D and MinkResNet by over 50% in absolute accuracy gain. These results demonstrate the effectiveness of synergistically combining geometric perception with adaptive attention for robust building extraction from large-scale LiDAR data.

Keywords:

LiDAR point cloud; deep learning; attention mechanisms; building extraction; GSA; CSRA; BEM

1. Introduction

With accelerating urbanization in the United States, annual new residential and commercial construction is projected to reach 3.25–3.76 billion square feet by 2026, characterized by a rise in compact single-family homes and mixed-use complexes [1]. This rapid expansion underscores the increasing complexity of 3D urban spatial structures and poses unprecedented challenges for dynamic building stock monitoring. High-precision 3D building spatial data serve as a foundational element for smart city development, critical for decision-making in urban planning, disaster management, and emerging applications such as digital twins and carbon neutrality assessments [2]. However, traditional methods relying on satellite or UAV imagery face inherent limitations—including illumination constraints, shadow interference, and vegetation occlusion—which hinder rapid, high-precision 3D structure acquisition at large scales [3]. In contrast, airborne LiDAR technology has emerged as a transformative solution, leveraging its unique capacity for vegetation penetration and efficient large-area 3D point cloud acquisition [4]. Despite its advantages, airborne LiDAR continues to encounter significant technical challenges in real-world applications, particularly semantic ambiguity in complex urban environments and irregular point cloud density distribution [5]. Consequently, developing efficient and precise methods for urban LiDAR point cloud building extraction remains a critical research priority.

Current building extraction methods using airborne LiDAR data can be categorized into four major approaches: Geometry and feature-based analysis methods, deep learning-based methods, multi-source data fusion methods, and graph-structured methods.

Geometry and feature-based analysis methods primarily leverage intrinsic point cloud features (such as normal vectors, curvature, and density) combined with traditional techniques (e.g., region growing, morphological filtering, and Delaunay triangulation) to achieve building segmentation. For instance, regarding neighborhood distribution and projection optimization, Wu et al. [1] proposed an entropy model to determine optimal neighborhood sizes, constructing initial projection planes via Principal Component Analysis (PCA) to extract boundary points. Similarly, Park et al. [2] employed multi-scale supervoxel segmentation integrated with azimuthal distribution models to enhance contour accuracy using neighborhood geometric features. In terms of plane segmentation and regularization, Cao et al. [3] and Gilani et al. [4] utilized normal vector clustering combined with Delaunay triangulation and region growing to extract and regularize roof planes effectively. Alternatively, morphological and elevation-based methods focus on elevation differences; Rottensteiner et al. [5] and Karsli et al. [6] demonstrated this by generating building masks using DSM-DTM differences and octree structures to separate ground points and regularize contours. Furthermore, histogram- and threshold-based methods emphasize feature statistics for category separation. Liu et al. [7] and Guo et al. [8] applied normal vector histograms and multi-scale differential analysis (e.g., Difference of Normals) to distinguish buildings from vegetation. Despite their robustness in regular scenarios, these methods rely heavily on manually preset parameters (e.g., neighborhood radius, curvature thresholds) and exhibit limited adaptability to complex geometries, such as curved roofs or overlapping structures. Consequently, they are prone to mis-segmentation and contour fracture under conditions of uneven density or noise. Therefore, overcoming these parameter limitations to develop more adaptive approaches for complex urban scenes remains a critical challenge.

Deep learning-based methods primarily leverage existing point cloud semantic segmentation frameworks to extract building instances. For instance, Hu et al. [9] enhanced PointNet++ by integrating an attention mechanism to better capture discriminative building features. To mitigate data scarcity, Yang et al. [10] proposed a self-supervised learning (SSL) framework to improve edge point recognition without relying on extensive labeled data. Similarly, Kong et al. [11] employed a generative adversarial network (GAN) to refine building contours derived from binary segmentation masks. Despite their success in certain scenarios, these approaches often suffer from three key limitations: strong dependence on high-quality annotated data, high computational complexity, and degraded performance on small-scale structures or regions with low point density.

Multi-source data fusion methods integrate LiDAR point clouds with optical imagery to leverage complementary geometric and spectral information. For instance, Tian et al. [12] proposed a double P-Snake model that fuses LiDAR-derived elevation cues with image texture to refine building contours. While such approaches achieve high accuracy in conventional urban scenes, they often suffer from significant missed detections in regions with sparse point cloud coverage. To address this, Wang et al. [13] mapped 2D image-derived edge lines into 3D space, integrating them with RANSAC-based optimization to suppress false segmentation. However, these methods remain susceptible to interference from high-density urban elements, such as billboards and vegetation, which introduce spurious edges and degrade contour fidelity.

Graph-structured methods analyze building geometry through topological modeling. For example, Jiang et al. [14] proposed a cascaded framework in which corner points are first detected and their interconnections subsequently modeled using a graph neural network (GNN). While this approach achieves high precision for regular buildings, it exhibits significant errors in modeling curved roofs, as initial corner localization inaccuracies propagate through the reconstruction pipeline. Although the method demonstrates robustness in complex scenarios, such as handling facade openings, its computational complexity grows exponentially with point cloud size. This renders it impractical for large-scale urban scenes or structures with irregular geometries.

In this work, we propose a novel framework that integrates geometric topology perception with cross-dimensional attention to address key challenges in large-scale LiDAR building extraction. Our main contributions are: (1) A multi-scale hybrid augmentation strategy (LaserMix++) that enriches training data diversity while preserving building geometry [15];

(2) A dual-branch SPVCNN architecture embedding a collaborative (Geometric Self Attention) GSA- (Cross Space Residual Attention) CSRA mechanism for topology-preserving feature encoding and cross-dimensional interaction [16,17,18];

(3) A Boundary Enhancement Module (BEM) to improve boundary delineation in overlapping or complex building structures [19].

The remainder of this paper is organized into six sections. Section 1 introduces the research background, reviews the current status, and clarifies the innovation value and orientation of this work. Section 2 describes the experimental data in detail, including dataset composition and collection sources. Section 3 presents the core methodology, detailing the preprocessing pipeline, proposed algorithm model, theoretical basis, architecture design, and implementation specifics. Section 4 focuses on experimental verification, comprehensively evaluating the algorithm’s performance through comparative experiments. Section 5 provides an in-depth discussion of the results, analyzing implications and limitations. Finally, Section 6 summarizes the key findings and outlines potential directions for future improvement.

2. Experimental Data

This study uses a 2018 airborne LiDAR dataset (LAS format) covering 177 km² of Washington D.C. [20]. The dataset was acquired using a Leica ALS80 sensor (Leica Geosystems AG, Heerbrugg, Switzerland) mounted on a Piper Navajo PA31-50 aircraft (Piper Aircraft, Inc., Vero Beach, FL, USA; registration N62912), flying at an average altitude of 9700 feet. The survey included 24 flight lines (22 primary and 2 crosslines), with the ALS80 chosen specifically to comply with Washington D.C.’s strict flight regulations while minimizing altitude changes. All LiDAR systems used onboard GPS positioning, supported by 31 ground control points (6 GCPs, 20 NAD checkpoints, and 5 VVA checkpoints). The coordinate system used was Maryland State Plane (FIPS1900) with NAD1983/NAD83 (2011) horizontal datums and NAVD88 (GEOID12B) vertical datum. The dataset achieved exceptional elevation accuracy with 0.027 m RMSE (0.051 m at 95% confidence level). The point cloud has an average density of 6.16 points/m² (mean spacing of 0.40 m) across the study area. The classified data includes 11 semantic categories: unclassified points, bare ground, low/medium/high vegetation, buildings, low/high noise, water, bridge decks, and ignored ground points. Some areas were intentionally omitted due to U.S. Secret Service security requirements. This high-precision dataset is now publicly available on Amazon Web Services, serving as a benchmark for urban 3D modeling in Washington D.C. municipal planning.

As shown in Figure 1, the classification label rendering and intensity value rendering results (including 3D and planar views) of the point cloud data demonstrate continuous and homogeneous coverage with distinct intensity distribution, meeting the data requirements for algorithmic research.

3. Method

The proposed framework employs SPVCNN as its core backbone, achieving precise building point cloud extraction through a three-stage design. First, we propose an enhanced LaserMix++ multi-scale hybrid augmentation strategy, which employs cross-scene point cloud block replacement with probability-driven sampling, coupled with ground normal–constrained rotation matrices and non-uniform scaling. This strategy significantly enhances data diversity while preserving the topological relationships of building structures. Second, a collaborative mechanism of GSA and CSRA is embedded within the SPVCNN dual-branch architecture. Topological preservation encoding of building geometric features is achieved via dynamic voxel granularity adjustment and the GSA module, while the channel–space dual-path CSRA module is designed to establish cross-dimensional interaction of multi-scale features. Finally, we introduce a BEM to effectively resolve segmentation challenges in highly overlapping structures and mitigate boundary ambiguity issues. The specific implementation details and comparative experiments validating the method are presented in the following sections.

3.1. Point Cloud Preprocessing

The original point cloud exhibits complete spatial coverage, consistent intensity values, and minimal outlier noise. To further enhance the diversity and representativeness of the training data, preprocessing includes cross-scene semantic fusion and spatial distribution optimization. An improved LaserMix++ data augmentation strategy is applied, as illustrated in Figure 2. LaserMix++ performs cross-scene block replacement, constrained rotations, and non-uniform scaling, while largely preserving local geometric structures of buildings. This approach increases the variability of training samples and supports more effective feature learning. Although extremely sparse regions or atypical building types may still pose challenges for fully capturing structural variations, subsequent stages of the network are designed to compensate for such cases, ensuring reliable extraction performance.

3.1.1. Cross-Scene Semantic Fusion

First, the original point cloud

P = \{p_{i} | p_{i} \in ℝ^{N_{i} \times (3 + I)}\}

(where I is the intensity value) is cut by a cube according to a certain scale:

B^{s} = C r o p C u b e (P, s)

(1)

where s is the cutting-edge length. Then, the original scene

B_{s}^{s k}

and the target scene

B_{t}^{s k}

are randomly selected to perform probabilistic replacement:

{\tilde{P}}_{t} = U_{k = 1}^{K} ({II}_{\{r_{k} < a\}} \cdot B_{s}^{s k} U {II}_{\{r_{k} \geq a\}} \cdot B_{t}^{s k})

(2)

where a is the mixing ratio, r_k~U (0,1), and s_k is the randomly selected space scale. In this work, the mixing ratio a is set to 0.5. In our implementation, this augmentation is applied to each batch with a probability of 0.5. The augmentation is performed on the raw point clouds before voxelization, ensuring that the structural and geometric relationships are preserved during the subsequent voxelization stage.

3.1.2. Spatial Distribution Optimization

First, apply the rotation matrix R to the point cloud obtained in the previous section to constrain the ground normal vector:

R = R_{z} (ϕ) \cdot R_{x y} (θ), θ = \arccos (n_{g} \cdot z)

(3)

where ϕ∼U (0,2π), R_xy is used to align the ground vector

n_{g}

in the vertical direction, at the same time, the randomness of the ground in the xoy direction is preserved. Then, the non-uniform scaling strategy is adopted to maintain the building structure and increase the robustness of algorithm training:

S c a l e (x, y, z) = (λ_{x}, λ_{y}, 0.5 (λ_{x} + λ_{y})), λ_{x}, λ_{y} ~ U (0.8, 1.2)

(4)

where

λ_{x}, λ_{y}

is the independent scaling coefficient in X and Y directions.

3.2. Improved SPVCNN

This study proposes an improved SPVCNN model, whose core innovation lies in organically integrating GSA and CSRA into a dual-branch feature learning framework. As shown in Figure 3, the overall architecture includes a five-stage progressive feature encoding and decoding process, with specific improvement steps as follows:

3.2.1. Dual Branch Feature Encoder

Voxelization is a fundamental technique that discretizes 3D space into uniform volumetric units (voxels). Originating from medical imaging in the 1980s (e.g., CT/MRI 3D reconstruction), this concept was adapted for organ tissue modeling via voxel grids. With advancements in computer vision, voxelization was introduced to 3D point cloud processing in the early 21st century, transforming disordered points into structured formats to accommodate traditional convolutional operations (e.g., Voxel Grid filtering). After 2015, its adoption surged with the rise of deep learning architectures such as VoxNet and SPVCNN. By balancing geometric precision with computational efficiency, voxelization has significantly enhanced performance in 3D object detection and segmentation tasks.

However, as LiDAR scanning density increases, higher voxel resolutions are required to maintain precision, leading to expanded convolution sizes and excessive GPU memory consumption. For instance, in PVCNN, a voxel-based network structure running on a single GPU (24 GB memory) supports a limited voxel capacity (e.g., approximately 1.6 million voxels for a 100 m × 100 m × 10 m scene at 0.8 m × 0.8 m × 0.1 m resolution). In such scenarios, smaller structures occupy fewer voxel grids, making it difficult to extract discriminative features and resulting in reduced performance. Furthermore, the computational time for standard voxel convolution via sliding windows often exceeds practical limits (e.g., 35 s per scene).

To address these challenges, researchers have proposed sparse convolution. Sparse convolution matrices are optimized for sparse data (such as 3D point clouds with low non-zero voxel ratios), computing only within effective activation regions. By utilizing hash tables or coordinate mappings to skip zero-value areas, this approach significantly improves computational efficiency. TorchSparse, a PyTorch extension library designed for sparse 3D data, enables efficient sparse convolution and pooling operations while supporting dynamic sparse tensor storage and GPU acceleration. It has been widely applied in point cloud segmentation and detection tasks. In this study, we adopt the tensor format of TorchSparse to represent voxels, which not only ensures model training with lower memory requirements but also maintains high resolution, allowing the system to capture detailed information under limited computational conditions.

Input the point cloud of each scene preprocessed in the previous section

{\tilde{P}}_{t}

for multi-scale voxelization:

V^{s} = V o x e l i z e ({\tilde{P}}_{t}, r^{s})

(5)

where

r^{s}

represents the resolution parameter. To avoid redundant computations or information loss caused by fixed resolution, we adopt a dynamic voxel granularity that dynamically adjusts voxel size based on local point cloud density and geometric complexity. This approach reduces computational load in flat areas (e.g., walls) using coarse-grained voxels while preserving structural details in intricate regions (e.g., window frames), achieving an optimal balance between computational efficiency and accuracy.

r^{s} = B a s e Re s \cdot (1 + β \cdot E n t r o p y ({\tilde{P}}_{l o c a l}))

(6)

where

β

is the hyperparameter and Entropy is the local entropy of

\tilde{P}

, in our experiments, the hyperparameter β controlling the influence of local entropy on voxel size is set to 0.5. Let

{\tilde{P}}_{l o c a l} = \{p_{1}, p_{2}, \dots, p_{n}\}

be the set of points in a local voxel neighborhood. The local entropy is computed as:

E n t r o p y ({\tilde{P}}_{l o c a l}) = - \sum_{i}^{n} p_{i} \log (p_{i})

(7)

where p_i is the normalized probability of a point p_i occurring in a local feature distribution (e.g., density of points or voxel occupancy). This metric captures the geometric complexity of the local region: higher entropy indicates more irregular or complex local structures. In general, the resolution ranges from approximately 0.25 to 2 times the average point spacing; for the experimental dataset used in this work, the voxel resolutions vary

r^{s}

between 0.1 m and 0.8 m. Resolutions that are too low can cause small or narrow structures to be missed and result in a loss of geometric detail, while resolutions that are too high can lead to excessive computational cost and memory consumption without significant improvement in feature extraction

GSA. Building point clouds usually have regular structural features (such as flat walls, sharp edges) and complex geometric distribution (such as door and window details). Traditional voxel convolution (such as MinkUNet) tends to lose these geometric details due to regular grid processing. The GSA module employs voxel-centered coordinates as geometric priors to dynamically adjust feature aggregation weights, assuming that local voxel blocks exhibit a regular geometric structure that can guide attention weighting. This mechanism enables smooth processing of flat wall areas and sharpening of edge/corner regions, thereby preserving and enhancing geometry-sensitive feature representations in residual connections. Dynamic voxelization adjusts voxel sizes based on local point density and structural complexity to balance computation and detail preservation. This design significantly improves the network’s recognition accuracy for architectural structures (such as facade continuity and roof contours), particularly mitigating resolution loss and geometric blurring caused by voxelization in large-scale scenarios, though it may be less effective for highly irregular structures or extremely low-density voxel regions. The specific improvements include inserting the GSA module after the original MinkUNet convolutional layer to achieve ensemble structure enhancement:

F_{v}^{l + 1} = G S A (σ (M i n k U N e t (F_{v}^{l})) \oplus F_{v}^{l}, C_{v})

(8)

where C_v is the voxel center coordinate, ⊕ represents the residual connection, and σ is the nonlinear activation function (ReLU). The specific GSA architecture is shown in Figure 4.

CSRA. The core advantage of inserting the CSRA module into the point cloud branch is that it significantly improves the modeling capability of the network on the local structure of building point cloud by dynamically combining the geometric context information of sampling point coordinates, assuming that neighboring points provide reliable geometric context. Traditional point cloud processing (such as simple MLP or KNN) tends to ignore the spatial dependence between points, while CSRA explicitly associates sampling points with their neighbors’ geometric distribution through an attention mechanism and adaptively weights feature aggregation. While effective in dense and moderately structured areas, CSRA may have reduced effectiveness in extremely sparse or noisy regions where neighborhood information is insufficient. This design not only makes up for the loss of local information caused by farthest point sampling (FPS), but also realizes the understanding of building structure from local to global through hierarchical feature propagation (multi-level CSRA), and finally improves the boundary accuracy and semantic consistency of segmentation or reconstruction tasks. The specific improvements are as follows: Construct a hierarchical structure with FPS and MLP, and insert CSRA after each level of sampling:

F_{p}^{l} = C S R A (M L P (F_{p}^{l - 1}), P_{c}^{l})

(9)

where

P_{c}^{l}

is the coordinate of the current sampling point. The specific CSRA architecture is shown in Figure 5.

3.2.2. GSA-CSRA Fusion Mechanism

The GSA-CSRA collaborative attention mechanism addresses the geometric and semantic misalignment between point cloud and voxel features. Building point clouds require both precise geometric details (e.g., edges and surfaces) and high-level semantic understanding (e.g., wall and roof categories), which a single feature branch struggles to handle simultaneously. This mechanism explicitly correlates the spatial relationships between these representations through deformable attention: it maps voxel grid semantic features to point cloud geometric positions while enabling adaptive gating fusion to balance their contributions. In simple structural areas (e.g., flat walls), it prioritizes stable voxel semantic features, whereas in complex areas (e.g., decorative components), it enhances point cloud geometric details. This design significantly improves boundary accuracy and semantic coherence in building point cloud segmentation, demonstrating particular adaptability to intricate architectural structures.

Geometric-semantic feature alignment. Two branch features are associated through a deformable attention mechanism:

A_{g e o} = Soft \max (\frac{Q_{v} K_{p}^{T}}{\sqrt{d}} + Δ P_{v p})

(10)

where

Δ P_{v p}

represents a voxel-point offset-based position encoding used to correct spatial misalignment between voxels and the real point cloud;

Q_{v}

denotes the query vector derived from the linear projection of feature

F_{v}

in the voxel branch;

K_{p}^{T}

is the key vector obtained from the linear projection of feature

F_{p}^{T}

in the point cloud branch;

\sqrt{d}

is the scaling factor indicating the spatial dimension of the feature vector; T stands for the transpose operation.

Dual-path feature fusion. The adaptive weighted features are obtained by adopting a gating fusion strategy:

F_{f u s e} = γ \cdot G S A - N o r m (F_{v}) + (1 - γ) \cdot G S A - N o r m (F_{p})

(11)

where the gating coefficient γ is dynamically generated by the regional complexity:

γ = σ (M L P (E n t r o p y (F_{v} \oplus F_{p})))

(12)

3.3. Decoder Design with BEM

The introduction of the BEM in architectural point cloud segmentation primarily addresses the blurring issue of point cloud or voxel features at object boundaries. Edges in architectural point clouds, such as wall corners and door/window outlines, typically contain critical geometric structural information. However, traditional convolution operations often weaken boundary features due to local smoothing effects. The BEM explicitly extracts high-frequency gradient information from predicted feature maps using the 3D Sobel operator and integrates it into the final prediction through weighted fusion, particularly for architectural regions, assuming that building boundaries exhibit sufficient gradient information. This design significantly enhances boundary segmentation sharpness while avoiding noise introduction to non-architectural areas like vegetation, although false boundary responses may still occur in areas with dense vegetation, occlusions, or other non-building structures, which is mitigated by combining the responses with the predicted mask. The formulas are shown in Equations (13) and (14), introducing BEM into the final prediction layer:

P_{e d g e} = S o b e l 3 D (F_{d}^{1})

(13)

P_{f i n a l} = P_{p r e d} + λ \cdot P_{e d g e} \cdot {II}_{building}

(14)

where

P_{e d g e}

is the boundary response map; λ is the learnable or preset weighted coefficient. In this work, the Sobel weight λ is set to 0.2 to balance the contribution of the boundary response in the final feature fusion;

{II}_{building}

is the binary mask of the building area, referring to the predicted mask obtained from the network output at the current stage. This ensures that the Boundary Enhancement Module (BEM) extracts high-frequency boundary information from the predicted building regions;

P_{f i n a l}

is the final prediction result.

3.4. Comparison Attention Modules

To demonstrate the effectiveness of our proposed collaborative attention mechanism, Section 3.4 presents comparative experiments between our method and classical attention modules, including Squeeze-and-Excitation (SE) [21], Convolutional Block Attention Module (CBAM) [22], Local Feature Aggregation (LFA) [23], and Pyramid Transformer (PT) [24]. Each attention mechanism is introduced in detail in the following subsections.

The SE module (As shown in Figure 6), a classic channel attention mechanism, enhances feature representation by explicitly modeling inter-channel dependencies. It first applies global average pooling to compress spatial dimensions and generate channel statistics. Two fully connected layers then learn nonlinear relationships between channels, ultimately producing channel weights through Sigmoid activation. These weights are multiplied with original features channel-by-channel to achieve feature reweighting. The lightweight design of the SE module allows flexible integration into various convolutional networks, though its limitation lies in focusing solely on channel dimensions while neglecting spatial relationships.

Building upon the Squeeze-and-Excitation (SE) architecture, CBAM introduces a dual attention mechanism that integrates channel and spatial components (Figure 7). The channel attention branch retains SE’s dual-path pooling structure, while the spatial attention branch captures contextual information through channel-wise pooling followed by a 7 × 7 convolution. This cascaded design enables CBAM to model feature relationships more comprehensively, with the large convolutional kernel expanding the receptive field and thereby enhancing spatial attention effectiveness. However, its global convolutional processing for spatial attention lacks the fine-grained positional awareness inherent in GSA’s row–column separated position embedding strategy.

The LFA module (As shown in Figure 8) is specifically designed for irregular data, enhancing local geometric perception through explicit neighborhood search and feature aggregation. It requires coordinate information to construct a K-nearest neighbor graph, then performs neighborhood feature fusion based on distance or learnable weights. This hard attention mechanism based on physical space distance contrasts sharply with the soft attention implemented by GSA through learnable embeddings. The former relies more on precise coordinate input, while the latter implicitly learns positional relationships through parameterized methods.

The PT module (as shown in Figure 9) extends the self-attention mechanism to a multi-scale feature pyramid, capturing rich contextual information through cross-level feature interactions. Its core lies in constructing a multi-scale feature map pyramid and establishing key-query interaction paths across different scales. While this design enhances global modeling capabilities, it significantly increases computational complexity. In contrast, GSA achieves precise location awareness at a lower cost through a single-scale design that synergizes content and positional attention mechanisms.

3.5. Accuracy Assessments

To verify the performance of the proposed method, three evaluation metrics are adopted: Accuracy (Acc), Intersection over Union (IoU), and Boundary F1-score. Acc measures the overall correctness of point-wise classification, IoU evaluates the overlap between predicted building regions and ground-truth annotations, and Boundary F1-score further assesses the quality of predicted building contours.

The formulas for Acc and IoU are given in Equations (15) and (16), respectively:

I o U = \frac{T P}{T P + F P + F N}

(15)

A c c = \frac{T P}{T P + T N + F P + F N}

(16)

The visual explanations of TP, FP, FN and TN are shown in Table 1: In this work, Acc and IoU are computed specifically for the binary building vs. non-building case, rather than as an average over multiple semantic classes.

In addition, to better evaluate the quality of predicted building contours, the Boundary F1-score is introduced as a complementary boundary-sensitive metric:

B o u n d a r y F 1 - s c o r e = 2 \cdot \frac{P r e c i s i o n_{b} \cdot R e c a l l_{b}}{P r e c i s i o n_{b} + R e c a l l_{b}}

(17)

where

P r e c i s i o n_{b} = \frac{T P_{b}}{T P_{b} + F P_{b}}

(18)

R e c a l l_{b} = \frac{T P_{b}}{T P_{b} + F P_{b}}

(19)

Here, TP_b, FP_b, and FN_b represent true positive, false positive, and false negative points located on the boundary region of the building. The boundary region is defined as a narrow band around the true building edges, with a width of one voxel.

4. Experiment

4.1. Experimental Environment and Parameter Settings

The experiments were implemented using Python 3.8, PyTorch 2.4.1, and CUDA 11.8. The hardware platform consisted of an Intel Core i5-14600KF CPU (base frequency 3.5 GHz), an NVIDIA GeForce RTX 3090 GPU (24 GB), and 32 GB of system memory. The specific hyperparameter settings for the training process are listed in Table 2. We employed the AdamW optimizer with an initial learning rate of 0.008 and a weight decay coefficient of 0.01. The model was trained for a maximum of 250 epochs with a batch size of 4. A dynamic learning rate adjustment strategy was utilized to ensure rapid convergence in the early phase and precise parameter refinement in later stages. The final model performance was evaluated based on the best results achieved throughout the entire training process.

4.2. Comparison of Deep Learning Methods

To comprehensively evaluate the effectiveness of our proposed method, we conduct comparative experiments with several mainstream deep learning approaches for building extraction from airborne LiDAR point clouds. The loss curves of different networks are shown in Figure 10.

Figure 10 presents the training loss curves of all compared networks over 250 epochs. As observed, all methods exhibit consistent and stable convergence behavior throughout the training process. The loss values decrease rapidly during the initial 100 epochs and gradually stabilize thereafter, indicating effective learning and optimization. No obvious signs of abnormal optimization behavior are observed in the presented curves, such as sharply diverging validation trends or persistent high-loss plateaus. The smooth downward trends across all methods suggest that the experimental settings, including the learning rate schedule and regularization strategy, are generally appropriate. Among all approaches, our method achieves the lowest final loss value, indicating stronger optimization performance on the presented training and validation setting.

In order to further verify the correctness of using our method as the network architecture, we also compared it with the current mainstream point cloud semantic segmentation network, and the results are shown in Table 3.

In comparative experiments with mainstream point cloud deep learning architectures, our proposed method demonstrates substantial advantages over existing approaches. As shown in Table 3, classic models such as Cylinder3D (Acc: 0.4189, IoU: 0.6348) and PointNet++ (Acc: 0.3562, IoU: 0.6021) exhibited relatively limited performance, primarily due to their simplified architectures or constrained receptive fields. RegNet achieved moderate results with an accuracy of 0.4027 and IoU of 0.6919. While dynamic convolution-based PAConv (Acc: 0.4873, IoU: 0.7116) and sparse voxel network-based MinkResNet (Acc: 0.5328, IoU: 0.7424) showed notable improvements, they still suffered from insufficient feature extraction capabilities for complex building structures [25,26,27,28]. In contrast, our method achieves remarkable performance with 94.16% accuracy and 96.56% IoU, representing absolute improvements of 40.88 percentage points in accuracy and 22.32 percentage points in IoU compared to the second-best method (MinkResNet). This corresponds to relative improvements of 76.7% in accuracy and 30.1% in IoU. These substantial gains validate the effectiveness of our innovative geometric self-attention and cross-space residual attention mechanisms in capturing discriminative features and modeling complex geometric relationships in airborne LiDAR point clouds.

Table 4 summarizes the training time, inference time, and VRAM usage of six mainstream deep learning networks for building extraction on the experimental dataset. As observed, the proposed method achieves the shortest training time among all tested networks (3.2 h), which is lower than MinkResNet (3.5 h) and other baseline networks, indicating an efficient optimization process. Similarly, its inference time is competitive (7 min per scene), demonstrating practical applicability for large-scale point clouds. While the VRAM consumption of our method (23 GB) is slightly higher than that of MinkResNet (23.2 GB) and other lightweight networks, this overhead is justified by the substantial gains in segmentation accuracy and boundary fidelity (as shown in Table 3). Overall, the results indicate that the proposed approach maintains a favorable balance between computational efficiency and high-quality extraction performance, addressing concerns regarding resource costs. To further examine the algorithm’s performance, the following figure displays detailed comparison results for each method.

As shown in Figure 11, other deep learning networks exhibit significant limitations compared to our method. (a) demonstrates that Cylinder3D, despite employing irregular voxel primitives, proves inadequate for architectural extraction from airborne point clouds. (b) reveals that PointNet++’s outdated algorithm fails to handle large-scale urban point cloud tasks. (c) indicates that PAConv, while improving point feature aggregation, still misidentifies non-architectural points (rectangularly arranged forests) as building points. Similar issues are observed in (d) and (e).

4.3. Effect of Different Attention Mechanisms

We use SPVCNN as the skeleton network and combine various attention modules introduced in Section 3.4 for comparative experiments. The experimental results of the comparative experiments and our proposed method are shown in Table 4.

As shown in Table 5, the proposed method achieves substantial improvements over the baseline SPVCNN (Acc: 0.8012, IoU: 0.8460, Boundary F1-score: 0.7825), with Acc increasing by 12.04% to 0.9416, IoU by 9.96% to 0.9656, and Boundary F1-score by 16.60% to 0.9124. The proposed method also outperforms conventional attention variants, including SE and CBAM, by a clear margin. These results demonstrate that the integration of learnable positional encoding and content-space attention mechanisms is more effective in capturing both geometric structures and semantic information in point clouds. In particular, the significant gain in Boundary F1-score further confirms its superiority in preserving fine architectural contours and improving boundary delineation, while avoiding excessive computational overhead. Therefore, the proposed design better addresses the limitations of traditional attention mechanisms in 3D point cloud processing.

Table 6 reports the computational cost of SPVCNN augmented with different attention modules, including training time, inference time, and VRAM usage. As observed, lightweight modules such as SE and CBAM introduce minor overhead, while more complex modules like LFA and PT increase training time and memory consumption noticeably. Notably, the proposed GSA-CSRA module achieves the highest segmentation performance (Acc: 0.9416, IoU: 0.9656, Boundary F1-score: 0.9124) while maintaining moderate computational requirements: its training time (3.2 h) is lower than that of PT (4.3 h) and LFA (3.8 h), inference time (7 min) is among the lowest, and VRAM usage (23 GB) remains reasonable compared with other modules. These results indicate that the GSA-CSRA mechanism effectively improves both global feature representation and boundary delineation without introducing excessive computational cost, supporting its practical applicability in large-scale LiDAR building extraction. Complementing the numerical results, Figure 12 provides qualitative comparisons of building extraction maps generated by SPVCNN with different attention modules.

As demonstrated in Figure 12, the proposed GSA-CSRA method shows clear advantages in building point cloud extraction. Compared with other attention modules, such as SE and CBAM, it produces fewer false negatives and fewer false positives in visually inspected complex structural regions. More importantly, it yields more complete and sharper segmentation results near building edges. These visual observations are consistent with the quantitative results in Table 4, particularly the improvements in Acc, IoU, and Boundary F1-score, further supporting the effectiveness of the collaborative GSA-CSRA mechanism for improving both overall extraction accuracy and boundary delineation quality.

5. Discussion

5.1. Interpretation of the Main Results

Extracting buildings from large-scale airborne LiDAR data presents inherent challenges, such as irregular point distribution and intricate architectural details. Rather than restating these challenges, we focus here on how the proposed GSA-CSRA framework addresses them. By combining geometric-aware voxel feature encoding with cross-dimensional attention in the point branch, the network effectively captures both global semantics and fine-grained boundary details. This design leads to significant performance gains: the baseline SPVCNN achieves an Acc of 0.8212, IoU of 0.8660, and Boundary F1-score of 0.7825, whereas our method reaches 0.9416, 0.9656, and 0.9124, respectively, reflecting substantial improvements in building/non-building discrimination, region overlap, and boundary fidelity.

The qualitative comparisons further support this interpretation. Visual results indicate that the proposed method substantially reduces false negatives and keeps false positives relatively limited to a small number of structurally complex junctions, while also improving boundary fidelity. This observation is consistent with the quantitative gains in Acc, IoU, and Boundary F1-score, suggesting that the framework improves not only global segmentation performance but also boundary delineation quality in challenging urban LiDAR scenes.

5.2. Mechanistic Discussion

5.2.1. GSA Improves Topology-Preserving Feature Encoding

The Geometric Self-Attention (GSA) module strengthens voxel feature encoding by incorporating voxel-centered geometric priors. This allows the network to distinguish between flat regions and high-curvature areas such as corners, window frames, and roof ridges. Analysis of intermediate feature maps shows that GSA effectively preserves topological structures, resulting in improved segmentation consistency along building edges.

Analysis of the ablation results indicates that the GSA module significantly enhances topology-preserving feature encoding. In particular, intermediate feature maps show improved delineation of roof ridges, façade edges, and window/door frames, which translates into higher boundary fidelity and reduced local misclassifications. Compared with the baseline without GSA, Acc and IoU gains of 2–3% demonstrate the module’s effectiveness in capturing global structural semantics and geometry-sensitive details.

5.2.2. CSRA Strengthens Local Structural Modeling in the Point Branch

The Cross-Space Residual Attention (CSRA) module in the point branch explicitly models the relationships between sampled points and their local neighborhoods. By weighting feature aggregation based on geometric context, CSRA compensates for information loss caused by farthest point sampling (FPS) and enhances local-to-global feature propagation. This contributes to a reduction in missed detections and an increase in semantic coherence, particularly in cluttered or narrow structural regions.

The CSRA module in the point branch further strengthens local structural modeling. By associating sampled points with neighboring geometric distributions through attention, CSRA improves feature propagation from local to global levels. The ablation study shows that CSRA reduces false negatives in narrow or complex architectural regions and improves semantic coherence, contributing an additional 2–3% increase in Acc and IoU beyond the GSA-enhanced network.

5.2.3. GSA-CSRA Collaborative Fusion Resolves Voxel–Point Misalignment

The combined GSA-CSRA mechanism addresses misalignment between voxel and point features. Experimental analysis shows that deformable attention alignment and adaptive gating successfully reconcile discrepancies between coarse voxel semantics and fine-grained point geometry. In practice, this leads to more accurate building boundaries in planar and complex areas alike, outperforming conventional single-branch attention modules.

5.2.4. BEM Improves Edge Accuracy

The BEM leverages 3D Sobel gradients to enhance boundary features before fusion into the final prediction. Evaluation of false positives and false negatives indicates a clear reduction in edge errors, with high-frequency boundary details preserved even in densely overlapping building clusters. This module is particularly effective in mitigating boundary smoothing inherent to standard convolutional decoding, improving the precision of architectural edge extraction.

5.2.5. LaserMix++ Augmentation Enhances Cross-Scene Robustness

LaserMix++ cross-scene augmentation introduces block replacement, constrained rotation, and non-uniform scaling to enrich training data diversity. Experiments show that this augmentation improves generalization across different neighborhoods and flight lines, especially in regions with variable point density. Performance gains in Acc and IoU confirm that the method enhances network robustness without compromising geometric detail preservation.

5.3. Comparison with Prior Studies in a Broader Context

Previous airborne LiDAR building extraction research has spanned (i) geometry/feature-based segmentation, (ii) deep learning frameworks, (iii) multimodal fusion, and (iv) graph-structured modeling. The manuscript summarizes that classical approaches can be effective for regular planes but often suffer from strong parameter dependence and poor adaptability to complex scenes and uneven density, leading to contour fractures or detail loss.

In contrast to many deep learning baselines, the proposed framework achieves a large improvement over mainstream point cloud segmentation networks on the benchmark used here. For example, Cylinder3D and MinkResNet remain far below the proposed approach in both Acc and IoU. A reasonable interpretation is that these gains do not come from ‘more capacity’ alone, but from introducing task-aligned inductive bias. These include: (a) topology-preserving geometric attention, (b) cross-dimensional feature interaction, and (c) explicit boundary enhancement—three elements tightly connected to the known failure modes in urban LiDAR building extraction.

5.4. Practical Implications

The dataset covers a large urban area (177 km² Washington, D.C.) with moderate density (~6.16 pts/m²) and multiple semantic categories. Achieving high IoU and Acc at this scale implies the method is promising for operational applications such as citywide building footprint/3D building stock monitoring, urban planning, and disaster assessment, where boundary quality and missed detections are often more consequential than small gains in per-point accuracy.

6. Conclusions

To address the core challenges of semantic ambiguity, uneven point density, and limited adaptability to complex structures in large-scale urban LiDAR point cloud building extraction, this study proposes a novel framework that integrates geometric topology perception with a cross-dimensional attention mechanism.

First, we present an enhanced LaserMix++ multi-scale hybrid augmentation algorithm that combines cross-scene point cloud block replacement with probability-driven sampling, ground normal–constrained rotation matrices, and non-uniform scaling strategies. This approach significantly enriches data diversity while preserving the topological integrity of building structures.

Second, we embed—for the first time—a collaborative mechanism of GSA and CSRA into the SPVCNN dual-branch architecture. Topology-preserving encoding of geometric features is achieved through dynamic voxel granularity adjustment and the GSA module, while the channel–spatial dual-path CSRA module enables effective cross-dimensional interaction across multi-scale features.

Third, we introduce a BEM to resolve segmentation ambiguities in densely overlapping structures and mitigate boundary blurring.

Evaluated on 177 km² of airborne LiDAR data from Washington, D.C., the proposed GSA-CSRA framework achieves an accuracy of 94.16% (+12.04 percentage points) and an IoU of 96.56% (+9.96 percentage points) compared to the baseline SPVCNN (Acc = 0.8212, IoU = 0.866). It substantially outperforms conventional attention variants (e.g., SE, CBAM) and surpasses mainstream point cloud networks by over 50 percentage points in accuracy—e.g., Cylinder3D (Acc = 0.4189) and MinkResNet (Acc = 0.5328)—demonstrating the effectiveness of synergistically combining geometric perception with adaptive attention for building extraction.

To address current limitations, future work will focus on 4 key directions:

(1) Extending the framework to full-scene semantic segmentation of all urban elements;

(2) Integrating semi-supervised learning to reduce reliance on manual annotation;

(3) Exploring unsupervised model construction following semi-supervised pretraining to ultimately eliminate the need for labeled data.

(4) Validating and extending the method across diverse geographic regions, building types, and urban environments to ensure broader applicability and robustness, as the current study is based on a single urban area (Washington, D.C.).

Author Contributions

Conceptualization, B.X. and Y.S.; methodology, B.X.; software, B.X.; validation, P.A.; formal analysis, H.L.; investigation, S.L.; resources, L.G.; data curation, S.L.; writing—original draft preparation, B.X.; writing—review and editing, Y.S.; visualization, B.X.; supervision, Y.S.; project administration, Y.S.; funding acquisition, Y.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the following grants: Guizhou Provincial Basic Research Program (Natural Science) (Qian Ke He Basic-No.MS [2025] 239); High-Level Innovative Talents in Guizhou Province (BKRCH [2024] No. 7); Bijie Scientist Workstation Project (BKHPT [2025] No. 2); Karst Plateau Resources and Environment Remote Sensing Talent Team (BWRLT [2023] No. 14); Intelligent Geospatial Information Application Engineering Center (BKLH [2023] No. 8).

Data Availability Statement

The data supporting the findings of this study are openly available in the District of Columbia-Classified Point Cloud LiDAR dataset, hosted on the Open Data on AWS registry. The dataset can be accessed at: https://registry.opendata.aws/dc-lidar (accessed on 25 December 2025). No new data were generated in this study.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wu, C.; Chen, X.; Jin, T.; Hua, X.; Liu, W.; Liu, J.; Cao, Y.; Zhao, B.; Jiang, Y.; Hong, Q. UAV building point cloud contour extraction based on the feature recognition of adjacent points distribution. Measurement 2024, 230, 114519. [Google Scholar] [CrossRef]
Park, Y.; Guldmann, J.M. Creating 3D city models with building footprints and LIDAR point cloud classification: A machine learning approach. Comput. Environ. Urban Syst. 2019, 75, 76–89. [Google Scholar] [CrossRef]
Cao, R.; Zhang, Y.; Liu, X.; Zhao, Z. Roof plane extraction from airborne lidar point clouds. Int. J. Remote Sens. 2017, 38, 3684–3703. [Google Scholar] [CrossRef]
Gilani, S.A.N.; Awrangjeb, M.; Lu, G. Segmentation of Airborne Point Cloud Data for Automatic Building Roof Extraction. GIScience Remote Sens. 2017, 55, 63–89. [Google Scholar] [CrossRef]
Rottensteiner, F.; Briese, C. A new method for building extraction in urban areas from high-resolution LiDAR data. Int. Soc. Photogramm. Remote Sens. 2017, 34, 82. [Google Scholar]
Karsli, B.; Yilmazturk, F.; Bahadir, M.; Karsli, F.; Ozdemir, E. Automatic building footprint extraction from photogrammetric and LiDAR point clouds using a novel improved-Octree approach. J. Build. Eng. 2024, 82, 108281. [Google Scholar] [CrossRef]
Liu, M.; Shao, Y.; Li, R.; Wang, Y.; Sun, X.; Wang, J.; You, Y. Method for extraction of airborne LiDAR point cloud buildings based on segmentation. PLoS ONE 2020, 15, e0232778. [Google Scholar] [CrossRef]
Guo, L.; Deng, X.; Liu, Y.; He, H.; Lin, H.; Qiu, G.; Yang, W. Extraction of dense urban buildings from photogrammetric and LiDAR point clouds. IEEE Access 2021, 9, 111823–111832. [Google Scholar] [CrossRef]
Hu, H.; Tan, Q.; Kang, R.; Wu, Y.; Liu, H.; Wang, B. Building extraction from oblique photogrammetry point clouds based on PointNet++ with attention mechanism. Photogramm. Rec. 2024, 39, 141–156. [Google Scholar] [CrossRef]
Yang, H.; Xu, S.; Xu, S. A self-supervised pretraining framework for context-aware building edge extraction from 3-D point clouds. In IEEE Geoscience and Remote Sensing Letters; IEEE: New York, NY, USA, 2025; Volume 22, pp. 1–5. [Google Scholar]
Kong, G.; Fan, H.; Lobaccaro, G. Automatic building outline extraction from ALS point cloud data using generative adversarial network. Geocarto Int. 2022, 37, 15964–15981. [Google Scholar] [CrossRef]
Tian, Z.; Fang, Y.; Fang, X.; Ma, Y.; Li, H. A Large-Scale Building Unsupervised Extraction Method Leveraging Airborne LiDAR Point Clouds and Remote Sensing Images Based on a Dual P-Snake Model. Sensors 2024, 24, 7503. [Google Scholar] [CrossRef] [PubMed]
Wang, Y.; Ma, Y.; Zhu, A.-X.; Zhao, H.; Liao, L. Accurate facade feature extraction method for buildings from three-dimensional point cloud data considering structural information. ISPRS J. Photogramm. Remote Sens. 2018, 139, 146–153. [Google Scholar] [CrossRef]
Jiang, T.; Wang, Y.; Zhang, Z.; Liu, S.; Dai, L.; Yang, Y.; Jin, X.; Zeng, W. Extracting 3-D structural lines of building from ALS point clouds using graph neural network embedded with corner information. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–28. [Google Scholar] [CrossRef]
Kong, L.; Xu, X.; Ren, J.; Zhang, W.; Pan, L.; Chen, K.; Ooi, W.T.; Liu, Z. Multi-modal data-efficient 3D scene understanding for autonomous driving. Comput. Vis. Pattern Recognit. 2025, 47, 3748–3765. [Google Scholar] [CrossRef]
Tang, H.; Liu, Z.; Zhao, S.; Lin, Y.; Lin, J.; Wang, H.; Han, S. Searching Efficient 3D Architectures with Sparse Point-Voxel Convolution. In Proceedings of the Computer Vision and Pattern Recognition; Springer International Publishing: Cham, Switzerland, 2020; Volume 2007, p. 16100v2. [Google Scholar]
Shen, Z.; Bello, I.; Vemulapalli, R.; Jia, X.; Chen, C.-H. Global self-attention networks for image recognition. arXiv 2010, arXiv:2010.03019. [Google Scholar]
Zhu, K.; Wu, J. Residual attention: A simple but effective method for multi-label recognition. arXiv 2021, arXiv:2108.02456. [Google Scholar] [CrossRef]
Zhang, L.; Sun, X.; Li, Z.; Kong, D.; Liu, J.; Ni, P. Boundary enhancement-driven accurate semantic segmentation networks for unmanned surface vessels in complex marine environments. IEEE Sens. J. 2024, 24, 24972–24987. [Google Scholar] [CrossRef]
District of Columbia—Classified Point Cloud LiDAR. Available online: https://registry.opendata.aws/dc-lidar (accessed on 11 July 2024).
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. arXiv 2018, arXiv:1709.01507. [Google Scholar]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional block attention module. Proc. Eur. Conf. Comput. Vis. 2018, 11211, 3–19. [Google Scholar]
Hu, Q.; Yang, B.; Xie, L.; Rosa, S.; Guo, Y.; Wang, Z.; Trigoni, N.; Markham, A. RandLA-Net: Efficient semantic segmentation of large-scale point clouds. Comput. Vis. Pattern Recognit. 2020, 1911, 11108–11117. [Google Scholar]
Zhao, H.; Jiang, L.; Jia, J.; Torr, P.H.; Koltun, V. Point transformer. Comput. Vis. Pattern Recognit. 2021, 2012, 16259–16268. [Google Scholar]
Zhu, X.; Zhou, H.; Wang, T.; Hong, F.; Ma, Y.; Li, W.; Li, H.; Lin, D. Cylindrical and asymmetrical 3D convolution networks for LiDAR segmentation. arXiv 2020, arXiv:2011.10033. [Google Scholar] [CrossRef]
Xu, M.; Ding, R.; Zhao, H.; Qi, X. PAConv: Position adaptive convolution with dynamic kernel assembling on point clouds. Comput. Vis. Pattern Recognit. 2021, 00318, 3173–3182. [Google Scholar]
Choy, C.; Gwak, J.Y.; Savarese, S. 4D spatio-temporal ConvNets: Minkowski convolutional neural networks. Proc. Comput. Vis. Pattern Recognit. 2019, 1904, 3075–3084. [Google Scholar]
Radosavovic, I.; Kosaraju, R.P.; Girshick, R.; He, K.; Dollár, P. Designing Network Design Spaces. arXiv 2020, arXiv:2003.13678. [Google Scholar] [CrossRef]

Figure 1. Overview of the study area and the airborne LiDAR dataset.

Figure 2. Point cloud preprocessing pipeline.

Figure 3. Improved SPVCNN.

Figure 4. GSA module.

Figure 5. CSRA module.

Figure 6. SE module: (a) normal framework and (b) +KNN framework.

Figure 7. CBAM: (a) normal framework and (b) +KNN framework.

Figure 8. LFA module.

Figure 9. PT module.

Figure 10. Schematic view of Loss Function Curve.

Figure 11. Test results of different deep learning networks. (a) Cylinder3D, (b) PointNet++, (c) PAConv, (d) MinkResNet, (e) RegNet, (f) Ours.

Figure 12. Combined with the SPVCNN test results of different attention modules: (a) Test datasets labels, (b) +GSA-CSRA (Our), (c) +PT, (d) SPVCNN, (e) +SE, (f) +SE(+KNN), (g) +CBAM, (h) +CBAM(+KNN) and (i) +LFA.

Table 1. Parameter interpretation in accuracy evaluation.

Points Label	Actual Building Points	Actual Non-Building Points
Predicted as building points	TP	FP
Predicted as non-building points	FN	TN

Table 2. Hyperparameter settings for the experiments.

Parameter	Value
Batch Size	4
Epochs	250
Optimizer	AdamW
Initial Learning Rate	0.008
Weight Decay	0.01
Learning Rate Scheduler	Dynamic Adjustment

Table 3. Evaluation of other deep learning networks in experimental data benchmark.

Method	Cylinder3D	PointNet++	PAConv	MinkResNet	RegNet	Ours
Acc	0.4189	0.3562	0.4873	0.5328	0.4027	0.9416
IoU	0.6348	0.6021	0.7116	0.7424	0.6919	0.9656
Boundary F1-score	0.6057	0.5783	0.6921	0.7224	0.6685	0.9124

Table 4. Comparison of training and testing times and VRAM usage for different methods.

	Cylinder3D	PointNet++	PAConv	MinkResNet	RegNet	Ours
Training time (h)	5.2	4.8	6.0	3.5	4.2	3.2
Test time (min)	10	9	11	6	8	7
VRAM usage (GB)	21	20	22	23.2	21.5	23

Table 5. Comparison of different modules in the experimental data benchmark.

Method	SPVCNN	+SE	+SE(+KNN)	+CBAM	+CBAM(+KNN)	+LFA	+PT	+GSA-CSRA (Our)
Acc	0.8212	0.8204	0.8254	0.8237	0.8180	0.8166	0.8254	0.9416
IoU	0.866	0.8659	0.8671	0.8667	0.8654	0.8650	0.8697	0.9656
Boundary F1-score	0.7825	0.8012	0.8128	0.8043	0.7937	0.7871	0.8115	0.9124

Table 6. Comparison of training and testing times and VRAM usage for different modules.

Method	SPVCNN	+SE	+SE(+KNN)	+CBAM	+CBAM(+KNN)	+LFA	+PT	+GSA-CSRA (Our)
Training time (h)	3.1	3.3	3.5	3.4	3.7	3.8	4.3	3.2
Test time (min)	7	7.5	8	7.7	8.2	7.5	8	7
VRAM usage (GB)	23	23.5	24	24	24	24	24	23

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Xue, B.; Song, Y.; Ai, P.; Li, H.; Liu, S.; Guo, L. Large-Scale Airborne LiDAR Point Cloud Building Extraction Based on Improved Voxelized Deep Learning Network. Buildings 2026, 16, 1450. https://doi.org/10.3390/buildings16071450

AMA Style

Xue B, Song Y, Ai P, Li H, Liu S, Guo L. Large-Scale Airborne LiDAR Point Cloud Building Extraction Based on Improved Voxelized Deep Learning Network. Buildings. 2026; 16(7):1450. https://doi.org/10.3390/buildings16071450

Chicago/Turabian Style

Xue, Bai, Yanru Song, Pi Ai, Hongzhou Li, Shuhan Liu, and Li Guo. 2026. "Large-Scale Airborne LiDAR Point Cloud Building Extraction Based on Improved Voxelized Deep Learning Network" Buildings 16, no. 7: 1450. https://doi.org/10.3390/buildings16071450

APA Style

Xue, B., Song, Y., Ai, P., Li, H., Liu, S., & Guo, L. (2026). Large-Scale Airborne LiDAR Point Cloud Building Extraction Based on Improved Voxelized Deep Learning Network. Buildings, 16(7), 1450. https://doi.org/10.3390/buildings16071450

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Large-Scale Airborne LiDAR Point Cloud Building Extraction Based on Improved Voxelized Deep Learning Network

Abstract

1. Introduction

2. Experimental Data

3. Method

3.1. Point Cloud Preprocessing

3.1.1. Cross-Scene Semantic Fusion

3.1.2. Spatial Distribution Optimization

3.2. Improved SPVCNN

3.2.1. Dual Branch Feature Encoder

3.2.2. GSA-CSRA Fusion Mechanism

3.3. Decoder Design with BEM

3.4. Comparison Attention Modules

3.5. Accuracy Assessments

4. Experiment

4.1. Experimental Environment and Parameter Settings

4.2. Comparison of Deep Learning Methods

4.3. Effect of Different Attention Mechanisms

5. Discussion

5.1. Interpretation of the Main Results

5.2. Mechanistic Discussion

5.2.1. GSA Improves Topology-Preserving Feature Encoding

5.2.2. CSRA Strengthens Local Structural Modeling in the Point Branch

5.2.3. GSA-CSRA Collaborative Fusion Resolves Voxel–Point Misalignment

5.2.4. BEM Improves Edge Accuracy

5.2.5. LaserMix++ Augmentation Enhances Cross-Scene Robustness

5.3. Comparison with Prior Studies in a Broader Context

5.4. Practical Implications

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI