1. Introduction
With accelerating urbanization in the United States, annual new residential and commercial construction is projected to reach 3.25–3.76 billion square feet by 2026, characterized by a rise in compact single-family homes and mixed-use complexes [
1]. This rapid expansion underscores the increasing complexity of 3D urban spatial structures and poses unprecedented challenges for dynamic building stock monitoring. High-precision 3D building spatial data serve as a foundational element for smart city development, critical for decision-making in urban planning, disaster management, and emerging applications such as digital twins and carbon neutrality assessments [
2]. However, traditional methods relying on satellite or UAV imagery face inherent limitations—including illumination constraints, shadow interference, and vegetation occlusion—which hinder rapid, high-precision 3D structure acquisition at large scales [
3]. In contrast, airborne LiDAR technology has emerged as a transformative solution, leveraging its unique capacity for vegetation penetration and efficient large-area 3D point cloud acquisition [
4]. Despite its advantages, airborne LiDAR continues to encounter significant technical challenges in real-world applications, particularly semantic ambiguity in complex urban environments and irregular point cloud density distribution [
5]. Consequently, developing efficient and precise methods for urban LiDAR point cloud building extraction remains a critical research priority.
Current building extraction methods using airborne LiDAR data can be categorized into four major approaches: Geometry and feature-based analysis methods, deep learning-based methods, multi-source data fusion methods, and graph-structured methods.
Geometry and feature-based analysis methods primarily leverage intrinsic point cloud features (such as normal vectors, curvature, and density) combined with traditional techniques (e.g., region growing, morphological filtering, and Delaunay triangulation) to achieve building segmentation. For instance, regarding neighborhood distribution and projection optimization, Wu et al. [
1] proposed an entropy model to determine optimal neighborhood sizes, constructing initial projection planes via Principal Component Analysis (PCA) to extract boundary points. Similarly, Park et al. [
2] employed multi-scale supervoxel segmentation integrated with azimuthal distribution models to enhance contour accuracy using neighborhood geometric features. In terms of plane segmentation and regularization, Cao et al. [
3] and Gilani et al. [
4] utilized normal vector clustering combined with Delaunay triangulation and region growing to extract and regularize roof planes effectively. Alternatively, morphological and elevation-based methods focus on elevation differences; Rottensteiner et al. [
5] and Karsli et al. [
6] demonstrated this by generating building masks using DSM-DTM differences and octree structures to separate ground points and regularize contours. Furthermore, histogram- and threshold-based methods emphasize feature statistics for category separation. Liu et al. [
7] and Guo et al. [
8] applied normal vector histograms and multi-scale differential analysis (e.g., Difference of Normals) to distinguish buildings from vegetation. Despite their robustness in regular scenarios, these methods rely heavily on manually preset parameters (e.g., neighborhood radius, curvature thresholds) and exhibit limited adaptability to complex geometries, such as curved roofs or overlapping structures. Consequently, they are prone to mis-segmentation and contour fracture under conditions of uneven density or noise. Therefore, overcoming these parameter limitations to develop more adaptive approaches for complex urban scenes remains a critical challenge.
Deep learning-based methods primarily leverage existing point cloud semantic segmentation frameworks to extract building instances. For instance, Hu et al. [
9] enhanced PointNet++ by integrating an attention mechanism to better capture discriminative building features. To mitigate data scarcity, Yang et al. [
10] proposed a self-supervised learning (SSL) framework to improve edge point recognition without relying on extensive labeled data. Similarly, Kong et al. [
11] employed a generative adversarial network (GAN) to refine building contours derived from binary segmentation masks. Despite their success in certain scenarios, these approaches often suffer from three key limitations: strong dependence on high-quality annotated data, high computational complexity, and degraded performance on small-scale structures or regions with low point density.
Multi-source data fusion methods integrate LiDAR point clouds with optical imagery to leverage complementary geometric and spectral information. For instance, Tian et al. [
12] proposed a double P-Snake model that fuses LiDAR-derived elevation cues with image texture to refine building contours. While such approaches achieve high accuracy in conventional urban scenes, they often suffer from significant missed detections in regions with sparse point cloud coverage. To address this, Wang et al. [
13] mapped 2D image-derived edge lines into 3D space, integrating them with RANSAC-based optimization to suppress false segmentation. However, these methods remain susceptible to interference from high-density urban elements, such as billboards and vegetation, which introduce spurious edges and degrade contour fidelity.
Graph-structured methods analyze building geometry through topological modeling. For example, Jiang et al. [
14] proposed a cascaded framework in which corner points are first detected and their interconnections subsequently modeled using a graph neural network (GNN). While this approach achieves high precision for regular buildings, it exhibits significant errors in modeling curved roofs, as initial corner localization inaccuracies propagate through the reconstruction pipeline. Although the method demonstrates robustness in complex scenarios, such as handling facade openings, its computational complexity grows exponentially with point cloud size. This renders it impractical for large-scale urban scenes or structures with irregular geometries.
In this work, we propose a novel framework that integrates geometric topology perception with cross-dimensional attention to address key challenges in large-scale LiDAR building extraction. Our main contributions are: (1) A multi-scale hybrid augmentation strategy (LaserMix++) that enriches training data diversity while preserving building geometry [
15];
(2) A dual-branch SPVCNN architecture embedding a collaborative (Geometric Self Attention) GSA- (Cross Space Residual Attention) CSRA mechanism for topology-preserving feature encoding and cross-dimensional interaction [
16,
17,
18];
(3) A Boundary Enhancement Module (BEM) to improve boundary delineation in overlapping or complex building structures [
19].
The remainder of this paper is organized into six sections.
Section 1 introduces the research background, reviews the current status, and clarifies the innovation value and orientation of this work.
Section 2 describes the experimental data in detail, including dataset composition and collection sources.
Section 3 presents the core methodology, detailing the preprocessing pipeline, proposed algorithm model, theoretical basis, architecture design, and implementation specifics.
Section 4 focuses on experimental verification, comprehensively evaluating the algorithm’s performance through comparative experiments.
Section 5 provides an in-depth discussion of the results, analyzing implications and limitations. Finally,
Section 6 summarizes the key findings and outlines potential directions for future improvement.
3. Method
The proposed framework employs SPVCNN as its core backbone, achieving precise building point cloud extraction through a three-stage design. First, we propose an enhanced LaserMix++ multi-scale hybrid augmentation strategy, which employs cross-scene point cloud block replacement with probability-driven sampling, coupled with ground normal–constrained rotation matrices and non-uniform scaling. This strategy significantly enhances data diversity while preserving the topological relationships of building structures. Second, a collaborative mechanism of GSA and CSRA is embedded within the SPVCNN dual-branch architecture. Topological preservation encoding of building geometric features is achieved via dynamic voxel granularity adjustment and the GSA module, while the channel–space dual-path CSRA module is designed to establish cross-dimensional interaction of multi-scale features. Finally, we introduce a BEM to effectively resolve segmentation challenges in highly overlapping structures and mitigate boundary ambiguity issues. The specific implementation details and comparative experiments validating the method are presented in the following sections.
3.1. Point Cloud Preprocessing
The original point cloud exhibits complete spatial coverage, consistent intensity values, and minimal outlier noise. To further enhance the diversity and representativeness of the training data, preprocessing includes cross-scene semantic fusion and spatial distribution optimization. An improved LaserMix++ data augmentation strategy is applied, as illustrated in
Figure 2. LaserMix++ performs cross-scene block replacement, constrained rotations, and non-uniform scaling, while largely preserving local geometric structures of buildings. This approach increases the variability of training samples and supports more effective feature learning. Although extremely sparse regions or atypical building types may still pose challenges for fully capturing structural variations, subsequent stages of the network are designed to compensate for such cases, ensuring reliable extraction performance.
3.1.1. Cross-Scene Semantic Fusion
First, the original point cloud
(where
I is the intensity value) is cut by a cube according to a certain scale:
where
s is the cutting-edge length. Then, the original scene
and the target scene
are randomly selected to perform probabilistic replacement:
where
a is the mixing ratio,
rk~
U (0,1), and
sk is the randomly selected space scale. In this work, the mixing ratio
a is set to 0.5. In our implementation, this augmentation is applied to each batch with a probability of 0.5. The augmentation is performed on the raw point clouds before voxelization, ensuring that the structural and geometric relationships are preserved during the subsequent voxelization stage.
3.1.2. Spatial Distribution Optimization
First, apply the rotation matrix
R to the point cloud obtained in the previous section to constrain the ground normal vector:
where
ϕ∼
U (0,2
π),
Rxy is used to align the ground vector
in the vertical direction, at the same time, the randomness of the ground in the
xoy direction is preserved. Then, the non-uniform scaling strategy is adopted to maintain the building structure and increase the robustness of algorithm training:
where
is the independent scaling coefficient in
X and
Y directions.
3.2. Improved SPVCNN
This study proposes an improved SPVCNN model, whose core innovation lies in organically integrating GSA and CSRA into a dual-branch feature learning framework. As shown in
Figure 3, the overall architecture includes a five-stage progressive feature encoding and decoding process, with specific improvement steps as follows:
3.2.1. Dual Branch Feature Encoder
Voxelization is a fundamental technique that discretizes 3D space into uniform volumetric units (voxels). Originating from medical imaging in the 1980s (e.g., CT/MRI 3D reconstruction), this concept was adapted for organ tissue modeling via voxel grids. With advancements in computer vision, voxelization was introduced to 3D point cloud processing in the early 21st century, transforming disordered points into structured formats to accommodate traditional convolutional operations (e.g., Voxel Grid filtering). After 2015, its adoption surged with the rise of deep learning architectures such as VoxNet and SPVCNN. By balancing geometric precision with computational efficiency, voxelization has significantly enhanced performance in 3D object detection and segmentation tasks.
However, as LiDAR scanning density increases, higher voxel resolutions are required to maintain precision, leading to expanded convolution sizes and excessive GPU memory consumption. For instance, in PVCNN, a voxel-based network structure running on a single GPU (24 GB memory) supports a limited voxel capacity (e.g., approximately 1.6 million voxels for a 100 m × 100 m × 10 m scene at 0.8 m × 0.8 m × 0.1 m resolution). In such scenarios, smaller structures occupy fewer voxel grids, making it difficult to extract discriminative features and resulting in reduced performance. Furthermore, the computational time for standard voxel convolution via sliding windows often exceeds practical limits (e.g., 35 s per scene).
To address these challenges, researchers have proposed sparse convolution. Sparse convolution matrices are optimized for sparse data (such as 3D point clouds with low non-zero voxel ratios), computing only within effective activation regions. By utilizing hash tables or coordinate mappings to skip zero-value areas, this approach significantly improves computational efficiency. TorchSparse, a PyTorch extension library designed for sparse 3D data, enables efficient sparse convolution and pooling operations while supporting dynamic sparse tensor storage and GPU acceleration. It has been widely applied in point cloud segmentation and detection tasks. In this study, we adopt the tensor format of TorchSparse to represent voxels, which not only ensures model training with lower memory requirements but also maintains high resolution, allowing the system to capture detailed information under limited computational conditions.
Input the point cloud of each scene preprocessed in the previous section
for multi-scale voxelization:
where
represents the resolution parameter. To avoid redundant computations or information loss caused by fixed resolution, we adopt a dynamic voxel granularity that dynamically adjusts voxel size based on local point cloud density and geometric complexity. This approach reduces computational load in flat areas (e.g., walls) using coarse-grained voxels while preserving structural details in intricate regions (e.g., window frames), achieving an optimal balance between computational efficiency and accuracy.
where
is the hyperparameter and
Entropy is the local entropy of
, in our experiments, the hyperparameter β controlling the influence of local entropy on voxel size is set to 0.5. Let
be the set of points in a local voxel neighborhood. The local entropy is computed as:
where
pi is the normalized probability of a point
pi occurring in a local feature distribution (e.g., density of points or voxel occupancy). This metric captures the geometric complexity of the local region: higher entropy indicates more irregular or complex local structures. In general, the resolution ranges from approximately 0.25 to 2 times the average point spacing; for the experimental dataset used in this work, the voxel resolutions vary
between 0.1 m and 0.8 m. Resolutions that are too low can cause small or narrow structures to be missed and result in a loss of geometric detail, while resolutions that are too high can lead to excessive computational cost and memory consumption without significant improvement in feature extraction
GSA. Building point clouds usually have regular structural features (such as flat walls, sharp edges) and complex geometric distribution (such as door and window details). Traditional voxel convolution (such as MinkUNet) tends to lose these geometric details due to regular grid processing. The GSA module employs voxel-centered coordinates as geometric priors to dynamically adjust feature aggregation weights, assuming that local voxel blocks exhibit a regular geometric structure that can guide attention weighting. This mechanism enables smooth processing of flat wall areas and sharpening of edge/corner regions, thereby preserving and enhancing geometry-sensitive feature representations in residual connections. Dynamic voxelization adjusts voxel sizes based on local point density and structural complexity to balance computation and detail preservation. This design significantly improves the network’s recognition accuracy for architectural structures (such as facade continuity and roof contours), particularly mitigating resolution loss and geometric blurring caused by voxelization in large-scale scenarios, though it may be less effective for highly irregular structures or extremely low-density voxel regions. The specific improvements include inserting the GSA module after the original MinkUNet convolutional layer to achieve ensemble structure enhancement:
where
Cv is the voxel center coordinate, ⊕ represents the residual connection, and σ is the nonlinear activation function (ReLU). The specific GSA architecture is shown in
Figure 4.
CSRA. The core advantage of inserting the CSRA module into the point cloud branch is that it significantly improves the modeling capability of the network on the local structure of building point cloud by dynamically combining the geometric context information of sampling point coordinates, assuming that neighboring points provide reliable geometric context. Traditional point cloud processing (such as simple MLP or KNN) tends to ignore the spatial dependence between points, while CSRA explicitly associates sampling points with their neighbors’ geometric distribution through an attention mechanism and adaptively weights feature aggregation. While effective in dense and moderately structured areas, CSRA may have reduced effectiveness in extremely sparse or noisy regions where neighborhood information is insufficient. This design not only makes up for the loss of local information caused by farthest point sampling (FPS), but also realizes the understanding of building structure from local to global through hierarchical feature propagation (multi-level CSRA), and finally improves the boundary accuracy and semantic consistency of segmentation or reconstruction tasks. The specific improvements are as follows: Construct a hierarchical structure with FPS and MLP, and insert CSRA after each level of sampling:
where
is the coordinate of the current sampling point. The specific CSRA architecture is shown in
Figure 5.
3.2.2. GSA-CSRA Fusion Mechanism
The GSA-CSRA collaborative attention mechanism addresses the geometric and semantic misalignment between point cloud and voxel features. Building point clouds require both precise geometric details (e.g., edges and surfaces) and high-level semantic understanding (e.g., wall and roof categories), which a single feature branch struggles to handle simultaneously. This mechanism explicitly correlates the spatial relationships between these representations through deformable attention: it maps voxel grid semantic features to point cloud geometric positions while enabling adaptive gating fusion to balance their contributions. In simple structural areas (e.g., flat walls), it prioritizes stable voxel semantic features, whereas in complex areas (e.g., decorative components), it enhances point cloud geometric details. This design significantly improves boundary accuracy and semantic coherence in building point cloud segmentation, demonstrating particular adaptability to intricate architectural structures.
Geometric-semantic feature alignment. Two branch features are associated through a deformable attention mechanism:
where
represents a voxel-point offset-based position encoding used to correct spatial misalignment between voxels and the real point cloud;
denotes the query vector derived from the linear projection of feature
in the voxel branch;
is the key vector obtained from the linear projection of feature
in the point cloud branch;
is the scaling factor indicating the spatial dimension of the feature vector;
T stands for the transpose operation.
Dual-path feature fusion. The adaptive weighted features are obtained by adopting a gating fusion strategy:
where the gating coefficient
γ is dynamically generated by the regional complexity:
3.3. Decoder Design with BEM
The introduction of the BEM in architectural point cloud segmentation primarily addresses the blurring issue of point cloud or voxel features at object boundaries. Edges in architectural point clouds, such as wall corners and door/window outlines, typically contain critical geometric structural information. However, traditional convolution operations often weaken boundary features due to local smoothing effects. The BEM explicitly extracts high-frequency gradient information from predicted feature maps using the 3D Sobel operator and integrates it into the final prediction through weighted fusion, particularly for architectural regions, assuming that building boundaries exhibit sufficient gradient information. This design significantly enhances boundary segmentation sharpness while avoiding noise introduction to non-architectural areas like vegetation, although false boundary responses may still occur in areas with dense vegetation, occlusions, or other non-building structures, which is mitigated by combining the responses with the predicted mask. The formulas are shown in Equations (13) and (14), introducing BEM into the final prediction layer:
where
is the boundary response map; λ is the learnable or preset weighted coefficient. In this work, the Sobel weight λ is set to 0.2 to balance the contribution of the boundary response in the final feature fusion;
is the binary mask of the building area, referring to the predicted mask obtained from the network output at the current stage. This ensures that the Boundary Enhancement Module (BEM) extracts high-frequency boundary information from the predicted building regions;
is the final prediction result.
3.4. Comparison Attention Modules
To demonstrate the effectiveness of our proposed collaborative attention mechanism,
Section 3.4 presents comparative experiments between our method and classical attention modules, including Squeeze-and-Excitation (SE) [
21], Convolutional Block Attention Module (CBAM) [
22], Local Feature Aggregation (LFA) [
23], and Pyramid Transformer (PT) [
24]. Each attention mechanism is introduced in detail in the following subsections.
The SE module (As shown in
Figure 6), a classic channel attention mechanism, enhances feature representation by explicitly modeling inter-channel dependencies. It first applies global average pooling to compress spatial dimensions and generate channel statistics. Two fully connected layers then learn nonlinear relationships between channels, ultimately producing channel weights through Sigmoid activation. These weights are multiplied with original features channel-by-channel to achieve feature reweighting. The lightweight design of the SE module allows flexible integration into various convolutional networks, though its limitation lies in focusing solely on channel dimensions while neglecting spatial relationships.
Building upon the Squeeze-and-Excitation (SE) architecture, CBAM introduces a dual attention mechanism that integrates channel and spatial components (
Figure 7). The channel attention branch retains SE’s dual-path pooling structure, while the spatial attention branch captures contextual information through channel-wise pooling followed by a 7 × 7 convolution. This cascaded design enables CBAM to model feature relationships more comprehensively, with the large convolutional kernel expanding the receptive field and thereby enhancing spatial attention effectiveness. However, its global convolutional processing for spatial attention lacks the fine-grained positional awareness inherent in GSA’s row–column separated position embedding strategy.
The LFA module (As shown in
Figure 8) is specifically designed for irregular data, enhancing local geometric perception through explicit neighborhood search and feature aggregation. It requires coordinate information to construct a K-nearest neighbor graph, then performs neighborhood feature fusion based on distance or learnable weights. This hard attention mechanism based on physical space distance contrasts sharply with the soft attention implemented by GSA through learnable embeddings. The former relies more on precise coordinate input, while the latter implicitly learns positional relationships through parameterized methods.
The PT module (as shown in
Figure 9) extends the self-attention mechanism to a multi-scale feature pyramid, capturing rich contextual information through cross-level feature interactions. Its core lies in constructing a multi-scale feature map pyramid and establishing key-query interaction paths across different scales. While this design enhances global modeling capabilities, it significantly increases computational complexity. In contrast, GSA achieves precise location awareness at a lower cost through a single-scale design that synergizes content and positional attention mechanisms.
3.5. Accuracy Assessments
To verify the performance of the proposed method, three evaluation metrics are adopted: Accuracy (Acc), Intersection over Union (IoU), and Boundary F1-score. Acc measures the overall correctness of point-wise classification, IoU evaluates the overlap between predicted building regions and ground-truth annotations, and Boundary F1-score further assesses the quality of predicted building contours.
The formulas for Acc and IoU are given in Equations (15) and (16), respectively:
The visual explanations of TP, FP, FN and TN are shown in
Table 1: In this work, Acc and IoU are computed specifically for the binary building vs. non-building case, rather than as an average over multiple semantic classes.
In addition, to better evaluate the quality of predicted building contours, the Boundary F1-score is introduced as a complementary boundary-sensitive metric:
where
Here, TPb, FPb, and FNb represent true positive, false positive, and false negative points located on the boundary region of the building. The boundary region is defined as a narrow band around the true building edges, with a width of one voxel.
5. Discussion
5.1. Interpretation of the Main Results
Extracting buildings from large-scale airborne LiDAR data presents inherent challenges, such as irregular point distribution and intricate architectural details. Rather than restating these challenges, we focus here on how the proposed GSA-CSRA framework addresses them. By combining geometric-aware voxel feature encoding with cross-dimensional attention in the point branch, the network effectively captures both global semantics and fine-grained boundary details. This design leads to significant performance gains: the baseline SPVCNN achieves an Acc of 0.8212, IoU of 0.8660, and Boundary F1-score of 0.7825, whereas our method reaches 0.9416, 0.9656, and 0.9124, respectively, reflecting substantial improvements in building/non-building discrimination, region overlap, and boundary fidelity.
The qualitative comparisons further support this interpretation. Visual results indicate that the proposed method substantially reduces false negatives and keeps false positives relatively limited to a small number of structurally complex junctions, while also improving boundary fidelity. This observation is consistent with the quantitative gains in Acc, IoU, and Boundary F1-score, suggesting that the framework improves not only global segmentation performance but also boundary delineation quality in challenging urban LiDAR scenes.
5.2. Mechanistic Discussion
5.2.1. GSA Improves Topology-Preserving Feature Encoding
The Geometric Self-Attention (GSA) module strengthens voxel feature encoding by incorporating voxel-centered geometric priors. This allows the network to distinguish between flat regions and high-curvature areas such as corners, window frames, and roof ridges. Analysis of intermediate feature maps shows that GSA effectively preserves topological structures, resulting in improved segmentation consistency along building edges.
Analysis of the ablation results indicates that the GSA module significantly enhances topology-preserving feature encoding. In particular, intermediate feature maps show improved delineation of roof ridges, façade edges, and window/door frames, which translates into higher boundary fidelity and reduced local misclassifications. Compared with the baseline without GSA, Acc and IoU gains of 2–3% demonstrate the module’s effectiveness in capturing global structural semantics and geometry-sensitive details.
5.2.2. CSRA Strengthens Local Structural Modeling in the Point Branch
The Cross-Space Residual Attention (CSRA) module in the point branch explicitly models the relationships between sampled points and their local neighborhoods. By weighting feature aggregation based on geometric context, CSRA compensates for information loss caused by farthest point sampling (FPS) and enhances local-to-global feature propagation. This contributes to a reduction in missed detections and an increase in semantic coherence, particularly in cluttered or narrow structural regions.
The CSRA module in the point branch further strengthens local structural modeling. By associating sampled points with neighboring geometric distributions through attention, CSRA improves feature propagation from local to global levels. The ablation study shows that CSRA reduces false negatives in narrow or complex architectural regions and improves semantic coherence, contributing an additional 2–3% increase in Acc and IoU beyond the GSA-enhanced network.
5.2.3. GSA-CSRA Collaborative Fusion Resolves Voxel–Point Misalignment
The combined GSA-CSRA mechanism addresses misalignment between voxel and point features. Experimental analysis shows that deformable attention alignment and adaptive gating successfully reconcile discrepancies between coarse voxel semantics and fine-grained point geometry. In practice, this leads to more accurate building boundaries in planar and complex areas alike, outperforming conventional single-branch attention modules.
5.2.4. BEM Improves Edge Accuracy
The BEM leverages 3D Sobel gradients to enhance boundary features before fusion into the final prediction. Evaluation of false positives and false negatives indicates a clear reduction in edge errors, with high-frequency boundary details preserved even in densely overlapping building clusters. This module is particularly effective in mitigating boundary smoothing inherent to standard convolutional decoding, improving the precision of architectural edge extraction.
5.2.5. LaserMix++ Augmentation Enhances Cross-Scene Robustness
LaserMix++ cross-scene augmentation introduces block replacement, constrained rotation, and non-uniform scaling to enrich training data diversity. Experiments show that this augmentation improves generalization across different neighborhoods and flight lines, especially in regions with variable point density. Performance gains in Acc and IoU confirm that the method enhances network robustness without compromising geometric detail preservation.
5.3. Comparison with Prior Studies in a Broader Context
Previous airborne LiDAR building extraction research has spanned (i) geometry/feature-based segmentation, (ii) deep learning frameworks, (iii) multimodal fusion, and (iv) graph-structured modeling. The manuscript summarizes that classical approaches can be effective for regular planes but often suffer from strong parameter dependence and poor adaptability to complex scenes and uneven density, leading to contour fractures or detail loss.
In contrast to many deep learning baselines, the proposed framework achieves a large improvement over mainstream point cloud segmentation networks on the benchmark used here. For example, Cylinder3D and MinkResNet remain far below the proposed approach in both Acc and IoU. A reasonable interpretation is that these gains do not come from ‘more capacity’ alone, but from introducing task-aligned inductive bias. These include: (a) topology-preserving geometric attention, (b) cross-dimensional feature interaction, and (c) explicit boundary enhancement—three elements tightly connected to the known failure modes in urban LiDAR building extraction.
5.4. Practical Implications
The dataset covers a large urban area (177 km2 Washington, D.C.) with moderate density (~6.16 pts/m2) and multiple semantic categories. Achieving high IoU and Acc at this scale implies the method is promising for operational applications such as citywide building footprint/3D building stock monitoring, urban planning, and disaster assessment, where boundary quality and missed detections are often more consequential than small gains in per-point accuracy.
6. Conclusions
To address the core challenges of semantic ambiguity, uneven point density, and limited adaptability to complex structures in large-scale urban LiDAR point cloud building extraction, this study proposes a novel framework that integrates geometric topology perception with a cross-dimensional attention mechanism.
First, we present an enhanced LaserMix++ multi-scale hybrid augmentation algorithm that combines cross-scene point cloud block replacement with probability-driven sampling, ground normal–constrained rotation matrices, and non-uniform scaling strategies. This approach significantly enriches data diversity while preserving the topological integrity of building structures.
Second, we embed—for the first time—a collaborative mechanism of GSA and CSRA into the SPVCNN dual-branch architecture. Topology-preserving encoding of geometric features is achieved through dynamic voxel granularity adjustment and the GSA module, while the channel–spatial dual-path CSRA module enables effective cross-dimensional interaction across multi-scale features.
Third, we introduce a BEM to resolve segmentation ambiguities in densely overlapping structures and mitigate boundary blurring.
Evaluated on 177 km2 of airborne LiDAR data from Washington, D.C., the proposed GSA-CSRA framework achieves an accuracy of 94.16% (+12.04 percentage points) and an IoU of 96.56% (+9.96 percentage points) compared to the baseline SPVCNN (Acc = 0.8212, IoU = 0.866). It substantially outperforms conventional attention variants (e.g., SE, CBAM) and surpasses mainstream point cloud networks by over 50 percentage points in accuracy—e.g., Cylinder3D (Acc = 0.4189) and MinkResNet (Acc = 0.5328)—demonstrating the effectiveness of synergistically combining geometric perception with adaptive attention for building extraction.
To address current limitations, future work will focus on 4 key directions:
(1) Extending the framework to full-scene semantic segmentation of all urban elements;
(2) Integrating semi-supervised learning to reduce reliance on manual annotation;
(3) Exploring unsupervised model construction following semi-supervised pretraining to ultimately eliminate the need for labeled data.
(4) Validating and extending the method across diverse geographic regions, building types, and urban environments to ensure broader applicability and robustness, as the current study is based on a single urban area (Washington, D.C.).