Large-Scale Point Cloud Semantic Segmentation with Density-Based Grid Decimation

Jiang, Liangcun; Ma, Jiacheng; Zhou, Han; Shangguan, Boyi; Xiao, Hongyu; Chen, Zeqiang

doi:10.3390/ijgi14070279

Open AccessArticle

Large-Scale Point Cloud Semantic Segmentation with Density-Based Grid Decimation

by

Liangcun Jiang

¹

,

Jiacheng Ma

¹

,

Han Zhou

¹

,

Boyi Shangguan

^2,*,

Hongyu Xiao

³ and

Zeqiang Chen

⁴

¹

School of Resources and Environmental Engineering, Wuhan University of Technology, Wuhan 430070, China

²

The State Key Laboratory of Space-Ground Integrated Information Technology, Space Star Technology Co., Ltd., Beijing 100095, China

³

Changjiang Schinta Software Technology Co., Ltd., Wuhan 430010, China

⁴

The National Engineering Research Center of Geographic Information System, China University of Geosciences, Wuhan 430074, China

^*

Author to whom correspondence should be addressed.

ISPRS Int. J. Geo-Inf. 2025, 14(7), 279; https://doi.org/10.3390/ijgi14070279

Submission received: 6 May 2025 / Revised: 6 July 2025 / Accepted: 15 July 2025 / Published: 17 July 2025

(This article belongs to the Topic 3D Computer Vision and Smart Building and City, 3rd Edition)

Download

Browse Figures

Versions Notes

Abstract

Accurate segmentation of point clouds into categories such as roads, buildings, and trees is critical for applications in 3D reconstruction and autonomous driving. However, large-scale point cloud segmentation encounters challenges such as uneven density distribution, inefficient sampling, and limited feature extraction capabilities. To address these issues, this paper proposes RT-Net, a novel framework that incorporates a density-based grid decimation algorithm for efficient preprocessing of outdoor point clouds. The proposed framework helps alleviate the problem of uneven density distribution and improves computational efficiency. RT-Net also introduces two modules: Local Attention Aggregation, which extracts local detailed features of points using an attention mechanism, enhancing the model’s recognition ability for small-sized objects; and Attention Residual, which integrates local details of point clouds with global features by an attention mechanism to improve the model’s generalization ability. Experimental results on the Toronto3D, Semantic3D, and SemanticKITTI datasets demonstrate the superiority of RT-Net for small-sized object segmentation, achieving state-of-the-art mean Intersection over Union (mIoU) scores of 86.79% on Toronto3D and 79.88% on Semantic3D.

Keywords:

deep learning; point cloud compression; semantic segmentation

1. Introduction

Point clouds, serving as a fundamental geospatial data structure [1], play an important role in 3D perception and understanding. Point cloud semantic segmentation refers to the process of assigning a semantic label to each point in a 3D point cloud. It is also called point cloud classification in photogrammetry and remote sensing [2]. Accurate semantic segmentation of point clouds into distinct entities like roads, buildings, and trees is vital for applications such as 3D reconstruction or autonomous driving. Traditional techniques for point cloud semantic segmentation include region growth-based segmentation [3], model fitting-based segmentation [4], graph optimization-based segmentation [5], and edge information-based segmentation [6]. While these methods are reliable and widely used in commercial applications, they often encounter challenges when processing extensive point cloud data.

The past several years have witnessed significant advancements in semantic segmentation algorithms rooted in deep learning, offering promising solutions for handling extensive Earth observation data. Notable developments include Point-Net [1], Point-Net++ [7], Point-CNN [8], DGCNN [9], Point-RNN [10], and Point Transformer [11]. However, the scope of many such methods is confined to smaller point cloud datasets, a limitation partly attributed to their dependence on time-consuming or inefficient point sampling methods. For example, using the farthest point sampling (FPS) algorithm to sample 10% of the points from one million points takes over 200 s [7]. In contrast, RandLA-Net introduced by [12], incorporates the Random Sampling (RS) algorithm, which exhibited remarkable efficiency, completing the same sampling task in just 0.004 s. Subsequently, many works [13,14,15] have adopted the RS algorithm during the down-sampling processes in their networks. This substantial decrease in sampling time highlights the potential of the RS algorithm in optimizing semantic segmentation networks for large-scale outdoor point cloud data.

Large-scale outdoor point cloud datasets often contain millions or even billions of points, making it computationally infeasible to process them in their entirety. Therefore, preprocessing before network training is crucial to reduce the computational burden by selecting representative subsets of points for analysis. While grid-based decimation is a commonly employed preprocessing technique, it fails to address the uneven distribution of object categories within large-scale point clouds, due to its use of a fixed grid size. This limitation often leads to unsatisfactory segmentation results, especially for small-sized objects. To address variations in point cloud density across different outdoor scenes, we propose a density-based grid decimation algorithm that dynamically adjusts grid size based on the density of input point clouds.

Accurate semantic segmentation of point clouds also depends on robust feature extraction from their complex structures. This involves developing architectures that can effectively down-sample large-scale point clouds while preserving important spatial structures and semantic information. Recent studies, such as those by [11,16,17], have shown that Transformer-based models excel in point cloud semantic segmentation, highlighting their remarkable feature recognition and extraction capabilities. However, these Transformer-based methods can result in significant computational overhead and require considerable memory resources, especially when processing large point batches. Thus, developing a network architecture that balances feature extraction capabilities with GPU constraints remains a key focus.

To address the challenges of uneven point density, imbalanced sampling, and limited feature extraction capabilities in large-scale outdoor point clouds, we propose a novel framework, RT-Net, that leverages a density-based grid decimation algorithm as its foundation. This decimation algorithm dynamically adjusts grid sizes based on point cloud density, ensuring balanced representation of object categories and improving computational efficiency. Further, RT-Net incorporates attention-based modules to enhance feature extraction from both local and global structures. These innovations enable RT-Net to achieve superior segmentation performance, particularly for small-sized object categories, while maintaining computational feasibility on modern GPUs. The main contributions of this paper are as follows:

Density-Based Grid Decimation Algorithm: A novel preprocessing method that dynamically adjusts grid sizes based on point density, addressing imbalanced sampling compared to traditional grid-based approaches.
Attention-Based Modules: Two new modules—Local Attention Aggregation (LAA) and Attention Residual (AR)—are designed to efficiently capture both local and global features, reducing memory consumption and computational overhead.

2. Related Works

2.1. Deep Learning Methods for Extracting Point Cloud Features

Deep learning-based methods for extracting features from point clouds are mainly categorized into projection-based, voxel-based, and point-based strategies.

The projection-based method converts 3D point clouds into multiple 2D views to leverage established 2D convolutional neural networks. To capture the contextual 3D spatial relationships from 2D views, Landrieu et al. proposed a deep learning framework based on a superpoint graph and graph convolutional network [15]. To capture the geometric information of the original point cloud, Boulch et al. proposed a method combining depth view fusion with 3D back-projection for point cloud labeling [18]. However, the speed of this approach is relatively slow. Alex et al. proposed PointPillars [19], which partially accelerates the speed of inference. Yang et al. proposed a real-time 3D object detection method for autonomous driving named PIXOR, which represents the scene through a bird’s-eye view, uses a proposal-free, single-stage detector [20]. To minimize the point overlaps, Lyu et al. reveal local geometric features by projecting the point cloud onto an ellipsoidal space, instead of the planar space [21]. Despite these attempts, such methods inevitably lost geometric details.

Voxel-based approaches involve converting point clouds into 3D voxel grids, allowing for the application of 3D convolutions for feature extraction. To achieve end-to-end semantic segmentation, Tchapmi et al. proposed SEGCloud, which integrates neural networks, trilinear interpolation, and fully connected Conditional Random Fields (FC-CRF) [22]. Zhou and Tuzel proposed another end-to-end 3D detection network named VoxelNet, which unifies feature extraction and bounding box prediction [23]. To handle sparsely distributed points, Meng et al. proposed a point cloud segmentation algorithm that converts unstructured point clouds into regular voxel grids [24]. However, this results in a significant performance overhead. Liu et al. [25] and Chen et al. [26] proposed two 3D deep learning models that reduce memory consumption and improve computational efficiency to some extent by combining point cloud representation and voxel convolution. Yet they lead to data redundancy and increased computational demands, making them less suitable for processing large-scale outdoor point cloud datasets.

In contrast, point-based approaches have become increasingly popular for their adaptability to point cloud structures and flexibility for adjustment, as evidenced by the works of PointNet [1], PointNet++ [7], and PointCNN [8]. To improve the fusion of geometric features and semantic information of different receptive fields, Li et al. proposed a multi-scale voxel point adaptive fusion network (MVP-Net) for semantic segmentation of point clouds in urban scenes [27]. To address the challenge of object recognition in complex geometric structures, Liu et al. proposed the DG Net algorithm, which simulates long-range dependencies in point clouds to some extent, helping to solve this issue. Zeng et al. proposed LACV Net, which solves the problems of local perceptual blur and global feature capture in large-scale point cloud semantic segmentation [14]. Park et al. proposed the PCSCNet model to partially improve the semantic segmentation performance of LiDAR point clouds at high and low resolutions [28]. To adapt to work with computers of weaker performance, RandLA-Net [29], a lightweight neural network architecture, was proposed to handle semantic segmentation tasks for large-scale 3D point clouds. While Fan and Yang proposed PointRNN and its variants, PointGRU and PointLSTM, for predicting mobile point cloud sequences [30]. Xu et al. proposed a simple and effective point cloud semantic segmentation network named NeiEA-NET that optimizes local neighborhoods in 3D Euclidean space and fully utilizes high-dimensional feature space [31].

The success of Transformer models in both natural language processing and computer vision has spurred their increasing adoption in the domain of point cloud semantic segmentation. Point Transformer, introduced by Zhao [11], is a pioneering study that explores the implementation of Transformer architecture within point cloud semantic segmentation. However, as the model goes deeper, it tends to cause overfitting in the network’s feature extraction ability on the majority of point categories. Recognizing these limitations [16,17] improved the network by optimizing feature extraction modules using various techniques. PReFormer [32] further enhanced self-attention calculations in Transformer, thereby improving memory efficiency and accuracy in segmenting point clouds. These innovations underscore the potential of attention to balance feature extraction capabilities with computational feasibility, making them promising solutions for the semantic segmentation of extensive point cloud datasets.

2.2. Preprocessing of Point Clouds Before Training

Point cloud preprocessing involves a range of processing steps aimed at preparing point clouds for integration into training networks. Given current hardware constraints, direct training on all points within large-scale point clouds is impractical. Consequently, thinning processing is typically employed as a necessary step. Many preprocessing techniques for vast point cloud datasets avoid heuristic or complex mathematical procedures due to their high memory and time consumption. Instead, spatial region filtering [33] and grid-based decimation [12,15,34,35] are preferred alternatives for preprocessing point clouds. Since spatial region filtering relies heavily on empirical knowledge, grid-based decimation is gaining popularity. However, when utilized in large-scale, complex point cloud scenes, grid-based decimation often results in uneven point cloud distribution. KPFCNN [36] leverages density-based kernels for convolution processing in point cloud segmentation, but it still faces the challenge of introducing redundant data. To address this, we propose using density-based grid decimation to mitigate the uneven distribution of object categories.

3. Methodology

3.1. Network Architecture

Our network’s design follows an encoder-decoder pattern inspired by U-Net (see Figure 1), but with significant modifications tailored for 3D point cloud processing. While traditional U-Net employs a symmetric encoder–decoder design for 2D image segmentation, our approach integrates LAA (see Figure 2) modules and random sampling in the encoder phase to better handle 3D point cloud sparsity and combines up-sampling with AR (see Figure 3) modules in the decoder phase to dynamically fuse multi-scale features. These modifications address key limitations of U-Net in processing irregular 3D data, as further demonstrated by our quantitative results in Section 4.

To be specific, a fully connected (FC) layer is used to process the input point clouds, allowing the extraction of features for each point. During the encoding phase, every encoding layer incorporates an LAA module along with an RS module. The network uses four encoding layers to progressively down-sample the point size in a sequential manner (from N to N/4, then N/16, N/64, and finally N/256), with N denoting the initial point count. Concurrently, the feature dimensions for each point are expanded following the sequence

(8 \to 32 \to 128 \to 256 \to 512)

. Here, feature dimension refers to the number of learned attributes per point. This expansion of feature dimensions helps to mitigate the information loss caused by down-sampling, allowing the network to retain and capture more complex patterns and details during the encoding process. The architecture bridges the encoder and decoder with a shared MLP (Multi-Layer Perceptron) featuring a 1 × 1 convolutional kernel, an activation function, and a normalization layer to encapsulate the contextual information of the point clouds. In the decoding phase, each decoder layer consists of an AR module and an interpolation up-sampling (US) module. The shape transformations of the feature channel are reversed compared to the encoder. Finally, a pair of FC layers is employed to produce the semantic label predictions.

3.2. Local Attention Aggregation

Encoding local characteristics through nearest-neighbor features is fundamental for local pattern recognition in 3D point cloud deep learning. However, this process can be computationally expensive. To address this issue, we propose the LAA module, which extracts local detailed features of points using an attention mechanism. This module helps improve the model’s ability to recognize small-sized objects and efficiently reduces the resources needed for capturing local details of point clouds.

Figure 2 details the LAA module. It consists of two units: Local Spatial Encoding and Attention Aggregation. The former unit integrates each point’s positional context in relation to others, which enables the latter unit to consider spatial relationships among points during self-attention operations, rather than solely relying on feature similarity to calculate attention coefficients. The latter unit employs a self-attention mechanism to extract internal features from each block of the K-nearest-neighbor points. Unlike the LFA (Local Features Aggregation) module introduced by Hu [29], we have replaced its Attentive Pooling unit with our Attention Aggregation unit, as shown on the right side of Figure 2. The equations used in this module are represented as follows:

p_{r e l, k} = p_{c o o r, k} - p_{c o o r, i},

(1)

d i s t_{k} = | | p_{c o o r, k} - p_{c o o r, i} | |,

(2)

r_{k} = p_{c o o r, k} \oplus p_{c o o r, i} \oplus p_{r e l, k} \oplus d i s t_{k},

(3)

f_{p o s, k} = C o n v (r_{k}) .

(4)

Here,

| | \cdot | |

computes the Euclidean distance, ⊕ represents the operation of concatenation, and

C o n v

signifies the convolution operation. In the above equations,

p_{r e l, k}

is the relative position bias computed from the center point coordinates

p_{c o o r, k}

and the feature point coordinates

p_{c o o r, i}

, and

d i s t_{k}

is the corresponding absolute distance. The set

r_{k}

combines these values and is passed through a convolutional layer to obtain the local spatial encoding.

The Attention Aggregation unit merges the collection of neighboring features to derive aggregated features for each center point

p_{i}

. It utilizes a self-attention block to expand the receptive field of each point, capturing long-range contexts for better generalization. First, the collection of neighboring features undergoes standard self-attention operations, producing attention features of

p_{i}

. These attention features are subsequently merged with the neighboring point features of

p_{i}

. This operation facilitates a comprehensive capture of local features within point clouds by supplementing them with self-attention features, thus enhancing the overall representation. Finally, the concatenated features are summed and passed through an MLP layer with shared parameters to derive the aggregated features of

p_{i}

. Through Local Spatial Encoding and Attention Aggregation units, the input point clouds transform into their corresponding aggregated features that effectively encode local contextual information.

3.3. Attention Residual

The core concept of traditional residual networks is the incorporation of skip connections between selected layers, enabling the input to bypass certain layers and directly contribute to the output of subsequent layers. This architectural design helps mitigate issues such as gradient vanishing and explosion, making it easier to train deep networks. However, despite these benefits, such a structure can limit the learning capacity of the model. Specifically, when training data is insufficient, the network becomes prone to overfitting, hindering its ability to generalize. To address this challenge and improve model robustness, we have developed an AR module. This module integrates local details from point clouds with global features using attention mechanisms, thereby enhancing the model’s generalization capabilities.

As depicted in Figure 1 and Figure 3, the AR module receives inputs from both the preceding layer’s output features and the aggregated features generated by the LAA module. First, it conducts up-sampling on the previous layer’s output. The up-sampled features are then fed into the Attention block as the Query input. The Key and Value inputs for the Attention block are extracted from the LAA module’s aggregated features using a 1 × 1 convolutional layer. Subsequently, attention features are computed in the Attention block, and their feature dimensions are restored through concatenation with the up-sampled point features. This procedure is devised to accelerate computations and enhance the precision of point cloud feature extraction. Further elaboration on this topic is provided in the Ablation of AR in Section 4. Finally, the concatenated result is transformed by a shared MLP into attention residual features, which serves as the output of the AR module. The equations used in this module are represented as follows:

A R = C o n c a t e n a t e (r e s ., S o f t \max (\frac{Q K^{T}}{\sqrt{d}}) V),

(5)

Q, K, V = X {W_{q}}^{\frac{d}{2}}, X {W_{k}}^{\frac{d}{2}}, X {W_{v}}^{\frac{d}{2}},

(6)

Here,

r e s .

is the feature of the residual layer. Q stands for the query matrix, K for the key matrix, and V for the value matrix. X represents the point set matrix, while

W_{q}

,

W_{k}

,

W_{v}

are the linear transformation matrices.

3.4. Density-Based Grid Decimation

Given the extensive volume of points within raw point clouds, direct utilization of the original data for model training is considered impractical. Data preprocessing becomes a necessary step before model training. In the original point cloud set, each element is delineated by its three-dimensional coordinates. The sequence of steps involved in density-based grid decimation for point clouds is as follows:

Compute the upper and lower bounds of the 3D coordinates in the original point cloud set $P$ .
Specify the grid size $r$ and calculate the grid count per dimension.
Determine the three-dimensional grid indices for every point $p_{i}$ within the point set $P$ by their grid index.
Classify points based on the indices calculated in the previous step. Points sharing the same index are grouped into a grid. Unlike traditional grid-based decimation, where the grid size is fixed, our approach considers the density of points and dynamically adjusts the grid size accordingly. If the point count in a grid exceeds a preset threshold, the grid is subdivided into eight equal-sized sub-grids. This process repeats until the number of points in all grids falls below the preset threshold. Subsequently, each grid randomly retains one point while discarding the rest, completing the density-based grid decimation process and resulting in a sparse point cloud that preserves density features.

The sparse point clouds

s u b P = \{s u b p_{1}, s u b p_{2}, s u b p_{3} \dots s u b p_{m}\}

(m is the number of sub-point clouds, automatically generated according to grid-based decimation), resulting from density-based grid decimation processing, serve as the basis for model training. Utilizing a sampling method based on probability ensures that the data used for training in each epoch is not fixed. This variability arises because the data sampled in each epoch differs, aiding the model in comprehending more intricate scenes. In the experiments section, the data preprocessing process involves simultaneous density-based grid decimation and probability sampling operations on both the training and testing datasets. Additionally, for every point in the original point clouds, the model identifies its closest neighbor in the thinned point clouds. This process allows mapping predictions from the thinned point clouds back to the original for precise performance assessment. Figure 4 shows that, under identical sampling settings, the points sampled using our method are more concentrated and primarily distributed in high-density regions, such as those near roads and buildings. In contrast, conventional grid-based decimation results in more dispersed sampling points.

4. Experiments and Results

4.1. Experiment Details

In the following experiments, we used three benchmark datasets: Toronto3D [37], Semantic3D [38], and SemanticKITTI [39], applying the same network architecture to them. The initial grid size in the preprocessing phase was configured at 10, while the threshold for grid subdivision is set to 16. An Adam optimizer was employed for training with its default parameter settings. To reduce the influence of class imbalance, we opted for a weighted cross-entropy loss, with class weights calculated as the inverse of their frequency in the training samples. The model was trained for 100 epochs. We initiated the learning rate at 0.01, with each subsequent epoch’s rate set to 95% of its predecessor. For the KNN algorithm, we utilized the Kd-Tree module from the Scikit-learn package, setting it to find 16 closest neighbors. All algorithms in this section were implemented using PyTorch 1.13.1 and CUDA 12.4 on Ubuntu 22.04. Our experiments were performed on a NVIDIA GeForce RTX 3090 24G GPU.

4.2. Benchmark Assessment of Semantic Segmentation Models

In this section, we assess the performance of our RT-Net architecture on large-scale point clouds and compare its results with those of state-of-the-art semantic segmentation algorithms. For the Toronto3D dataset, following the practice described by Tan [37], we designated the L002 region as the test subset, with the remaining regions forming the training subset. Performance assessment was conducted using eight categories of mean Intersection over Union (mIoU) along with their respective IoU metrics. For the Semantic3D dataset, following the guidelines set by Thomas [36], we assigned two regions as the test subset, with the rest of the regions allocated for training and validation purposes, and applied the same evaluation metrics. For the SemanticKITTI dataset, sequences 00 to 10 are used as the training set, sequence 08 as the validation set, and sequences 11 to 21 as the test set. During the preprocessing stage, infrequent categories are discarded, resulting in 19 categories being used for training and evaluation.

Table 1 summarizes the quantitative results for the Toronto3D dataset, presenting mIoU and IoU values for each category. The data from the table indicates that RT-Net, integrated with a density-based grid decimation approach for semantic segmentation on the Toronto3D dataset, surpasses current state-of-the-art models in mIoU for small-sized object categories. Among these categories, fences demonstrate the most significant improvement, with a 20.87% increase from 49.42% to 70.29%. Road markings, poles, and utility lines also see substantial improvements, with increases of 17.5%, 10.11%, and 5.67%, respectively. It should be noted that while slight IoU decreases occur in certain categories, the performance drop remains tolerable, particularly for large-sized object categories: roads see a decrease from 94.69% to 92.28%, natural areas from 96.62% to 95.03%, utility lines from 88.06% to 86.97%, and cars from 93.37% to 88.06%, respectively.

It should be noted that our rigorous replication of RandLA-Net’s experiment yielded a lower mIoU value (76.64%) than the authors’ reported figures (81.77%). Despite ensuring consistency in parameter settings according to the original study, discrepancies likely arose due to variations in experimental conditions. In our replication of RandLA-Net, employing the density-based grid decimation method, we realized enhanced outcomes as well, with an increase in performance from 76.64% to 81.55%. As shown in Figure 5, L001 demonstrates the strong segmentation performance of our method on edge points, while L002 highlights its effectiveness in segmenting pole point clouds. Additionally, L003 and L004 illustrate the method’s ability to accurately identify ground marks.

Under identical experimental conditions, we conducted semantic segmentation experiments on the Semantic3D dataset. The proposed RT-Net outperforms other semantic segmentation models, as shown in Table 2, reaching a state-of-the-art mIoU score of 79.88%. Specifically, our model achieves remarkable results in the segmentation of high vegetation, hardscape, and cars categories, with IoU scores of 90.35%, 60.48%, and 67.55%, respectively. These results highlight our model’s robust segmentation capabilities, particularly in challenging terrains and diverse landscape categories, and indicate its strong generalization potential.

Similar results can be observed from our experiments on the SemanticKITTI, as shown in Table 3 and Figure 6. Our model achieves significant breakthroughs in small-sized object categories compared to other models. Specifically, our accuracy has improved by 9.8% in the “Other-gro” category (50.8%), 8.7% in “Fence” (80.8%), and 12.7% in “Traffic sign” (74.1%), all while maintaining the mIoU of other categories.

Our comparative analysis across different datasets indicates that the results can be attributed to two main factors: the robust feature extraction capabilities of the attention modules and the density-based grid decimation method’s proficiency in evenly distributing point cloud categories and preserving fine details.

4.3. Efficiency of Density-Based Grid Decimation

In this section, the efficiency of our density-based grid decimation is assessed against the traditional grid-based method. While larger initial grid sizes preserve adequate point density per grid, excessively large values degrade computational efficiency. We conducted a series of four experiments to evaluate their impact on semantic segmentation performance. All the preprocessing approaches were followed by the full RT-Net semantic segmentation network.

Table 4 demonstrates our density-based grid decimation method’s advantage, with higher mIoU scores over the grid-based method. It is noteworthy that initializing the grid size at 1.0 yielded the best outcomes, with a mIoU score of 86.91%. However, we ultimately opted for density-based grid decimation starting at a grid size of 10. This choice was driven by the high computational cost of the former solution, which required 120 times longer processing time (19.152 s versus 0.161 s), for only a marginal enhancement in mIoU.

It is worth mentioning that our preprocessing strategy is not limited to RT-Net; it is versatile and can be applied to other networks. This adaptability is evident from successful experiments conducted with RandLA-Net, as indicated in Table 1. The integration of density-based grid decimation into the RandLA-Net methodology also resulted in notable enhancements, particularly for small-sized object categories such as road markings, utility lines, poles, cars, and fences. These enhancements are attributed to the ability of density-based grid decimation to effectively recognize and preserve intricate details within the scene. Consequently, it emerges as an effective preprocessing technique for large-scale point clouds, offering efficiency gains across different semantic segmentation networks.

To validate the efficacy of the modules in RT-Net, we carried out a set of ablation studies concentrating on the LAA and AR modules. These studies were all conducted on the Toronto3D dataset applying density-based grid decimation, with evaluations focused on the L002 region.

4.4. Ablation of RT-Net Framework

The core components of the RT-Net framework are the LAA and AR modules, while the remainder of the framework adheres to the design put forward by Hu [12]. To showcase the effectiveness of each component, we designed two key ablation experiments:

Removal of Self-Attention Pooling: This structure facilitates the aggregation of features from neighboring points in the point clouds. Upon its removal, we replaced it with standard max/mean/sum pooling for the local feature encoding.
Removal of Attention Residual: This structure enhances the effectiveness of residual adversarial networks by emphasizing feature values through attention mechanisms. Upon its removal, we utilized an original residual connection.

Table 5 presents the mIoU results derived from our network ablation studies. The results underscore the crucial function of the self-attention pooling module in enhancing the network’s overall performance. Its absence leads to a notable 21.13% drop in mIoU, stemming from the substantial loss of detailed features during the random sampling phase. On the other hand, the AR module is also crucial for the network’s overall performance. Its removal is associated with a significant drop in mIoU, amounting to an 11.98% decrease. This indicates that while its impact may be less dramatic than the self-attention pooling module, the AR module is nonetheless a key contributor to the network’s effectiveness.

4.5. Ablation of AR

As described in Section 3, we formulated attention residuals and subsequently calculated attention residual features. To assess the impact of different residual block structures, we conducted additional ablation experiments, summarized in Table 6:

RT-Net with Residual: Utilizes a standard residual connections module.
RT-Net with Attention Residual: Uses an attention residual connections module with addition.
RT-Net with Attention Residual (Concatenation): Employs our complete attention residual module with concatenation.

Table 6 displays the impact of different residual configurations on RT-Net’s performance. Implementing the attention residual module in RT-Net led to a 5.15% improvement in mIoU scores over the use of the standard residual structure. Furthermore, utilizing the attention residual module with concatenation in RT-Net yielded an additional 3.04% increase in mIoU.

4.6. Efficiency of RT-Net Framework

To evaluate the advantages of our model in terms of inference time and model complexity, we conducted a series of control experiments comparing it with state-of-the-art models, with a focus on model parameter size and per-batch inference time. As shown in Table 7, our model has a relatively small number of parameters (5.5 M), only 0.9 M more than PointNet++ (4.6 M), which has the smallest parameter count. Thus, the model remains compact while maintaining competitive performance.

We extracted a point cloud block containing 40,960 points from the L002 scene of the Toronto3D as a batch to evaluate the inference speed of each model. As shown in Table 7, our model achieves an inference speed of 71 milliseconds per batch, outperforming the other models in terms of speed.

5. Conclusions

This paper presents RT-Net, a semantic segmentation framework for large-scale point clouds, harnessing the power of random sampling, attention mechanisms, and density-based grid decimation. RT-Net features innovative local attention aggregation and attention residual modules designed to capture a comprehensive range of features within point clouds, including both local and global characteristics. The integration of a density-based grid decimation algorithm for preprocessing large-scale point clouds addresses the issue of imbalanced sampling categories encountered with traditional preprocessing methods. These innovations enable RT-Net to outperform existing methods for the segmentation of small-sized object categories, such as road markings, utility lines, poles, and fences. Our approach has been rigorously evaluated on the Toronto3D, Semantic3D, and SemanticKITTI datasets, achieving state-of-the-art mIoU scores of 86.79% on Toronto3D and 79.88% on Semantic3D, while demonstrating significant improvements in small-sized object categories across all three benchmark datasets. These results highlight RT-Net’s robustness and adaptability to various datasets and scene types.

While RT-Net has demonstrated outstanding performance, there are several promising areas for future work. One area of exploration involves performing a sensitivity analysis on the parameters of density-based decimation. Such an analysis can help optimize density-based grid thinning techniques. Considering the labor-intensive process of annotating point cloud data, exploring domain-adaptive transfer learning is a crucial next step. This approach, by leveraging existing labeled datasets, could significantly enhance segmentation performance and streamline the learning process across various scenes. Additionally, achieving real-time capabilities in semantic segmentation of point clouds is vital for applications that require rapid analytical and decision-making processes. Further research in this area will not only reinforce RT-Net’s practicality in real-world applications but also ensure the network’s ongoing adaptability and responsiveness to new data.

Author Contributions

Conceptualization, Liangcun Jiang, Jiacheng Ma, and Boyi Shangguan; Methodology, Jiacheng Ma and Liangcun Jiang; Validation, Jiacheng Ma, Han Zhou, and Boyi Shangguan; Formal analysis, Jiacheng Ma and Boyi Shangguan; Investigation, Liangcun Jiang, Hongyu Xiao, and Zeqiang Chen; Writing—original draft, Liangcun Jiang, Jiacheng Ma, Boyi Shangguan, and Zeqiang Chen; Writing—review & editing, Liangcun Jiang, Jiacheng Ma, Han Zhou, and Boyi Shangguan; Visualization, Jiacheng Ma, Han Zhou, and Hongyu Xiao; Supervision, Liangcun Jiang and Han Zhou; Funding acquisition, Liangcun Jiang and Zeqiang Chen. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded in part by the National Key R&D Program of China (No. 2024YFB3909604) and the National Natural Science Foundation of China (No. 42471446); in part by the Open Fund of National Engineering Research Center of Geographic Information System, China University of Geosciences, Wuhan 430074, China (No. 2023KFJJ10).

Data Availability Statement

Data from Toronto3D can be accessed through https://github.com/WeikaiTan/Toronto-3D/ (accessed on 26 April 2025). For access to the data collected from Semantic3D, visit https://www.semantic3d.net/ (accessed on 26 April 2025). While the SemanticKITTI dataset is available for download at https://semantic-kitti.org/ (accessed on 26 April 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

GPU	Graphics processing unit
LAA	Local attention aggregation
AR	Attention residual
FC	Fully connected
US	Up-sampling
MLP	Multi-layer perceptron
KNN	K-nearest neighbor
IoU	Intersection over union
mIoU	Mean intersection over union

References

Charles, R.Q.; Su, H.; Kaichun, M.; Guibas, L.J. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 77–85. [Google Scholar] [CrossRef]
Xie, Y.; Tian, J.; Zhu, X.X. Linking Points With Labels in 3D: A Review of Point Cloud Semantic Segmentation. IEEE Geosci. Remote Sens. Mag. 2020, 8, 38–59. [Google Scholar] [CrossRef]
Weinmann, M.; Jutzi, B.; Hinz, S.; Mallet, C. Semantic Point Cloud Interpretation Based on Optimal Neighborhoods, Relevant Features and Efficient Classifiers. ISPRS J. Photogramm. Remote Sens. 2015, 105, 286–304. [Google Scholar] [CrossRef]
Schnabel, R.; Wahl, R.; Klein, R. Efficient RANSAC for Point-Cloud Shape Detection. Comput. Graph. Forum 2007, 26, 214–226. [Google Scholar] [CrossRef]
Strom, J.; Richardson, A.; Olson, E. Graph-Based Segmentation for Colored 3D Laser Point Clouds. In Proceedings of the 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems, Taipei, Taiwan, 18–22 October 2010; IEEE: Piscataway, NJ, USA, 2010; pp. 2131–2136. [Google Scholar] [CrossRef]
Jiang, X.Y.; Meier, U.; Bunke, H. Fast Range Image Segmentation Using High-Level Segmentation Primitives. In Proceedings of the Third IEEE Workshop on Applications of Computer Vision WACV’96, Sarasota, FL, USA, 2–4 December 1996; IEEE Comput. Soc. Press: Sarasota, FL, USA, 1996; pp. 83–88. [Google Scholar] [CrossRef]
Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. Volume 30. [Google Scholar] [CrossRef]
Li, Y.; Bu, R.; Sun, M.; Wu, W.; Di, X.; Chen, B. PointCNN: Convolution On X-Transformed Points. In Proceedings of the Advances in Neural Information Processing Systems 31 (NeurIPS 2018), Montréal, QC, Canada, 3–8 December 2018; Volume 31. [Google Scholar] [CrossRef]
Wang, Y.; Sun, Y.; Liu, Z.; Sarma, S.E.; Bronstein, M.M.; Solomon, J.M. Dynamic Graph CNN for Learning on Point Clouds. ACM Trans. Graph. 2019, 38, 5. [Google Scholar] [CrossRef]
Fan, S.; Dong, Q.; Zhu, F.; Lv, Y.; Ye, P.; Wang, F.-Y. SCF-Net: Learning Spatial Contextual Features for Large-Scale Point Cloud Segmentation. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 14499–14508. [Google Scholar] [CrossRef]
Zhao, H.; Jiang, L.; Jia, J.; Torr, P.; Koltun, V. Point Transformer. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 16239–16248. [Google Scholar] [CrossRef]
Hu, Q.; Yang, B.; Xie, L.; Rosa, S.; Guo, Y.; Wang, Z.; Trigoni, N.; Markham, A. Learning Semantic Segmentation of Large-Scale Point Clouds with Random Sampling. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 8338–8354. [Google Scholar] [CrossRef]
Liu, T.; Ma, T.; Du, P.; Li, D. Semantic Segmentation of Large-Scale Point Cloud Scenes via Dual Neighborhood Feature and Global Spatial-Aware. Int. J. Appl. Earth Obs. Geoinf. 2024, 129, 103862. [Google Scholar] [CrossRef]
Zeng, Z.; Xu, Y.; Xie, Z.; Tang, W.; Wan, J.; Wu, W. Large-Scale Point Cloud Semantic Segmentation via Local Perception and Global Descriptor Vector. Expert Syst. Appl. 2024, 246, 123269. [Google Scholar] [CrossRef]
Landrieu, L.; Simonovsky, M. Large-Scale Point Cloud Semantic Segmentation with Superpoint Graphs. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 4558–4567. [Google Scholar] [CrossRef]
Wu, X.; Lao, Y.; Jiang, L.; Liu, X.; Zhao, H. Point Transformer V2: Grouped Vector Attention and Improved Sampling—Supplementary Material. In Proceedings of the Advances in Neural Information Processing Systems 35 (NeurIPS 2022), New Orleans, LA, USA, 28 November–9 December 2022; Volume 35. pp. 33330–33342. [Google Scholar] [CrossRef]
Wu, X.; Jiang, L.; Wang, P.-S.; Liu, Z.; Liu, X.; Qiao, Y.; Ouyang, W.; He, T.; Zhao, H. Point Transformer V3: Simpler, Faster, Stronger 2024. arXiv 2024, arXiv:2312.10035. [Google Scholar] [CrossRef]
Boulch, A.; Guerry, J.; Saux, B.L.; Audebert, N. SnapNet: 3D Point Cloud Semantic Labeling with 2D Deep Segmentation Networks. Comput. Graph. 2018, 71, 189–198. [Google Scholar] [CrossRef]
Lang, A.H.; Vora, S.; Caesar, H.; Zhou, L.; Yang, J.; Beijbom, O. PointPillars: Fast Encoders for Object Detection From Point Clouds. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 12689–12697. [Google Scholar] [CrossRef]
Yang, B.; Luo, W.; Urtasun, R. PIXOR: Real-Time 3D Object Detection from Point Clouds. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 7652–7660. [Google Scholar] [CrossRef]
Lyu, Y.; Huang, X.; Zhang, Z. EllipsoidNet: Ellipsoid Representation for Point Cloud Classification and Segmentation. In Proceedings of the 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 256–266. [Google Scholar] [CrossRef]
Tchapmi, L.; Choy, C.; Armeni, I.; Gwak, J.; Savarese, S. SEGCloud: Semantic Segmentation of 3D Point Clouds. In Proceedings of the 2017 International Conference on 3D Vision (3DV), Qingdao, China, 20 October 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 537–547. [Google Scholar] [CrossRef]
Zhou, Y.; Tuzel, O. VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 4490–4499. [Google Scholar] [CrossRef]
Meng, H.-Y.; Gao, L.; Lai, Y.-K.; Manocha, D. VV-Net: Voxel VAE Net With Group Convolutions for Point Cloud Segmentation. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 8499–8507. [Google Scholar] [CrossRef]
Liu, Z.; Tang, H.; Lin, Y.; Han, S. Point-Voxel CNN for Efficient 3D Deep Learning. In Proceedings of the Advances in Neural Information Processing Systems 32 (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019. Volume 32. [Google Scholar] [CrossRef]
Chen, Y.; Liu, S.; Shen, X.; Jia, J. Fast Point R-CNN. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 9774–9783. [Google Scholar] [CrossRef]
Li, H.; Guan, H.; Ma, L.; Lei, X.; Yu, Y.; Wang, H.; Delavar, M.R.; Li, J. MVPNet: A Multi-Scale Voxel-Point Adaptive Fusion Network for Point Cloud Semantic Segmentation in Urban Scenes. Int. J. Appl. Earth Obs. Geoinf. 2023, 122, 103391. [Google Scholar] [CrossRef]
Park, J.; Kim, C.; Kim, S.; Jo, K. PCSCNet: Fast 3D Semantic Segmentation of LiDAR Point Cloud for Autonomous Car Using Point Convolution and Sparse Convolution Network. Expert Syst. Appl. 2023, 212, 118815. [Google Scholar] [CrossRef]
Hu, Q.; Yang, B.; Xie, L.; Rosa, S.; Guo, Y.; Wang, Z.; Trigoni, N.; Markham, A. RandLA-Net: Efficient Semantic Segmentation of Large-Scale Point Clouds. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 11105–11114. [Google Scholar] [CrossRef]
Fan, H.; Yang, Y. PointRNN: Point Recurrent Neural Network for Moving Point Cloud Processing 2019. arXiv 2019, arXiv:1910.08287. [Google Scholar] [CrossRef]
Xu, Y.; Tang, W.; Zeng, Z.; Wu, W.; Wan, J.; Guo, H.; Xie, Z. NeiEA-NET: Semantic Segmentation of Large-Scale Point Cloud Scene via Neighbor Enhancement and Aggregation. Int. J. Appl. Earth Obs. Geoinf. 2023, 119, 103285. [Google Scholar] [CrossRef]
Akwensi, P.H.; Wang, R.; Guo, B. PReFormer: A Memory-Efficient Transformer for Point Cloud Semantic Segmentation. Int. J. Appl. Earth Obs. Geoinf. 2024, 128, 103730. [Google Scholar] [CrossRef]
Rethage, D.; Wald, J.; Sturm, J.; Navab, N.; Tombari, F. Fully-Convolutional Point Networks for Large-Scale Point Clouds. In Computer Vision—ECCV 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Springer: Berlin/Heidelberg, Germany, 2018; Volume 11208; pp. 625–640. ISBN 978-3-030-01224-3. [Google Scholar] [CrossRef]
Tatarchenko, M.; Park, J.; Koltun, V.; Zhou, Q.-Y. Tangent Convolutions for Dense Prediction in 3D. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 3887–3896. [Google Scholar] [CrossRef]
Chen, S.; Niu, S.; Lan, T.; Liu, B. PCT: Large-Scale 3d Point Cloud Representations Via Graph Inception Networks with Applications to Autonomous Driving. In Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–25 September 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 4395–4399. [Google Scholar] [CrossRef]
Thomas, H.; Qi, C.R.; Deschaud, J.-E.; Marcotegui, B.; Goulette, F.; Guibas, L. KPConv: Flexible and Deformable Convolution for Point Clouds. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 6410–6419. [Google Scholar] [CrossRef]
Tan, W.; Qin, N.; Ma, L.; Li, Y.; Du, J.; Cai, G.; Yang, K.; Li, J. Toronto-3D: A Large-Scale Mobile LiDAR Dataset for Semantic Segmentation of Urban Roadways. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 14–19 June 2020; pp. 797–806. [Google Scholar] [CrossRef]
Hackel, T.; Savinov, N.; Ladicky, L.; Wegner, J.D.; Schindler, K.; Pollefeys, M. SEMANTIC3D.NET: A NEW LARGE-SCALE POINT CLOUD CLASSIFICATION BENCHMARK. ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci. 2017, IV-1/W1, 91–98. [Google Scholar] [CrossRef]
Behley, J.; Garbade, M.; Milioto, A.; Quenzel, J.; Behnke, S.; Stachniss, C.; Gall, J. SemanticKITTI: A Dataset for Semantic Scene Understanding of LiDAR Sequences. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 9296–9306. [Google Scholar] [CrossRef]
Li, Y.; Ma, L.; Zhong, Z.; Cao, D.; Li, J. TGNet: Geometric Graph CNN on 3-D Point Cloud Segmentation. IEEE Trans. Geosci. Remote Sens. 2020, 58, 3588–3600. [Google Scholar] [CrossRef]
Wan, J.; Zeng, Z.; Qiu, Q.; Xie, Z.; Xu, Y. PointNest: Learning Deep Multiscale Nested Feature Propagation for Semantic Segmentation of 3-D Point Clouds. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 9051–9066. [Google Scholar] [CrossRef]
Yoo, S.; Jeong, Y.; Jameela, M.; Sohn, G. Human Vision Based 3D Point Cloud Semantic Segmentation of Large-Scale Outdoor Scenes. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Vancouver, BC, Canada, 17–24 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 6577–6586. [Google Scholar] [CrossRef]
Boulch, A.; Saux, B.L.; Audebert, N. Unstructured Point Cloud Semantic Labeling Using Deep Segmentation Networks. 3dor@ Eurographics 2017, 3, 1–8. [Google Scholar] [CrossRef]
Contreras, J.; Denzler, J. Edge-Convolution Point Net for Semantic Segmentation of Large-Scale Point Clouds. In Proceedings of the IGARSS 2019–2019 IEEE International Geoscience and Remote Sensing Symposium, Yokohama, Japan, 28 July–2 August 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 5236–5239. [Google Scholar] [CrossRef]
Truong, G.; Gilani, S.Z.; Islam, S.M.S.; Suter, D. Fast Point Cloud Registration Using Semantic Segmentation. In Proceedings of the 2019 Digital Image Computing: Techniques and Applications (DICTA), Perth, Australia, 2–4 December 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1–8. [Google Scholar] [CrossRef]
Liu, C.; Zeng, D.; Akbar, A.; Wu, H.; Jia, S.; Xu, Z.; Yue, H. Context-Aware Network for Semantic Segmentation Toward Large-Scale Point Clouds in Urban Environments. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1. [Google Scholar] [CrossRef]
Zeng, Z.; Xu, Y.; Xie, Z.; Tang, W.; Wan, J.; Wu, W. LEARD-Net: Semantic Segmentation for Large-Scale Point Cloud Scene. Int. J. Appl. Earth Obs. Geoinf. 2022, 112, 102953. [Google Scholar] [CrossRef]
Yin, F.; Huang, Z.; Chen, T.; Luo, G.; Yu, G.; Fu, B. DCNet: Large-Scale Point Cloud Semantic Segmentation With Discriminative and Efficient Feature Aggregation. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 4083–4095. [Google Scholar] [CrossRef]
Luo, L.; Lu, J.; Chen, X.; Zhang, K.; Zhou, J. LSGRNet: Local Spatial Latent Geometric Relation Learning Network for 3D Point Cloud Semantic Segmentation. Comput. Graph. 2024, 124, 104053. [Google Scholar] [CrossRef]
Xu, J.; Zhang, R.; Dou, J.; Zhu, Y.; Sun, J.; Pu, S. RPVNet: A Deep and Efficient Range-Point-Voxel Fusion Network for LiDAR Point Cloud Segmentation. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 16004–16013. [Google Scholar] [CrossRef]
Hou, Y.; Zhu, X.; Ma, Y.; Loy, C.C.; Li, Y. Point-to-Voxel Knowledge Distillation for LiDAR Semantic Segmentation. arXiv 2022, arXiv:2206.02099. [Google Scholar] [CrossRef]

Figure 1. Network architecture of RT-Net. The notation signifies the count of points and feature dimensions, respectively.

Figure 2. The proposed Local Attention Aggregation module. Accept point features as module input, extracts its local detail information through an attention transformation, fusing it into a channel output containing local features of the points, thereby ensuring the network’s recognition ability for local small-sized objects.

Figure 3. The proposed Attention Residual module. Accept aggregated features and network features as inputs. Attention residual features as module output.

Figure 4. Grid-based decimation (1496761 points, dilution ratio 6.94%) and density-based grid decimation (1125806 points, dilution ratio 5.22%) on the L001 region of Toronto3D.

Figure 5. Semantic segmentation results on Toronto3D datasets, compared with RandLA-Net.

Figure 6. Semantic segmentation results on SemanticKITTI. The results of the baseline experiment are shown in the middle, and our experimental results are shown on the right.

Table 1. Quantitative results on Toronto3D. Bold numbers mark the highest column values.

Methods	mIoU	Road	Rd Mrk.	Natural	Building	Util. Line	Pole	Car	Fence
PointNet++ [7]	41.81	89.27	0.00	69.06	54.16	43.78	23.30	52.00	2.95
DGCNN [9]	61.79	93.88	0.00	91.25	80.39	62.40	62.32	88.26	15.81
KPFCNN [36]	69.11	94.62	0.06	96.07	91.51	87.68	81.56	85.66	15.81
TGNet [40]	61.34	93.54	0.00	90.93	81.57	65.26	62.98	88.73	7.85
RandLA-Net [12]	81.77	96.69	64.21	96.62	94.24	88.06	77.84	93.37	42.86
PointNest [41]	74.7	91.0	27.9	96.2	89.5	88.3	78.6	91.1	35.1
MVPNet [27]	84.14	98.00	76.36	97.34	94.77	87.69	84.61	94.63	39.74
EyeNet [42]	81.13	96.98	65.02	97.83	93.51	86.77	84.86	94.02	30.01
PReFormer [32]	75.8	96.8	65.4	92.4	84.6	82.0	68.3	85.5	31.2
DG-Net [13]	82.1	97.1	65.3	97.2	92.6	88.1	84.2	93.6	38.7
LACV-Net [14]	82.7	97.1	66.9	97.3	93.0	87.3	83.4	93.4	43.1
RandLA-Net (Ours rep.)	76.64	93.10	55.23	94.45	93.35	76.21	73.37	80.24	47.18
RandLA-Net (Ours w/density-grid)	81.55	94.77	60.85	96.25	95.31	80.66	79.28	86.99	54.64
Ours (w/RGB w/o density-grid)	80.85	94.95	63.72	96.00	95.01	81.30	80.34	86.06	49.42
Ours (w/RGB and density-grid)	86.79	92.28	81.22	95.03	89.96	86.97	90.45	88.06	70.29

Table 2. Quantitative results on Semantic3D (Semantic-8). Bold numbers mark the highest column values.

Methods	mIoU	Man-made.	Natural.	High Veg.	Low Veg.	Buildings	Hard Scape	Scanning Art.	Cars
PointNet++ [7]	63.1	81.9	78.1	64.3	51.7	75.9	36.4	43.7	72.6
SPGraph [15]	76.2	91.5	75.6	78.3	71.7	94.4	56.8	52.9	88.4
ConvPoint [43]	76.5	92.1	80.6	76.0	71.9	95.6	47.3	61.1	87.7
EdgeConv [44]	64.4	91.1	69.5	65.0	56.0	89.7	30.0	43.8	69.7
RGNet [45]	72.0	86.4	70.3	69.5	68.0	96.9	43.4	52.3	89.5
RandLA-Net [12]	77.8	97.4	93.0	70.2	65.2	94.4	49.0	44.7	92.7
SCF-Net [10]	77.6	97.1	91.8	86.3	51.2	95.3	50.5	67.9	80.7
CAN [46]	74.7	97.9	94.1	70.8	64.3	94.0	48.5	38.8	89.2
LEARD-Net [47]	74.5	97.5	92.7	74.6	61.0	93.2	40.2	44.2	92.2
DCNet [48]	74.1	97.9	86.5	72.9	64.6	96.2	48.7	35.3	90.4
LSGRNet [49]	77.5	97.2	91.2	84.4	52.2	94.8	51.6	70.1	78.5
RandLA-Net (Ours rep.)	71.80	91.71	86.81	87.51	55.07	91.93	31.26	54.07	76.03
RandLA-Net (Ours w/density-grid)	76.22	90.13	87.65	87.29	57.96	93.52	55.75	62.84	74.59
Ours (w/RGB w/o density-grid)	74.58	90.20	86.67	83.34	61.70	92.44	52.76	59.30	70.28
Ours (w/RGB and density-grid)	79.88	92.41	86.91	90.35	63.32	95.04	60.48	67.55	82.96

Table 3. Quantitative results on SemanticKITTI. Bold numbers mark the highest column values.

Methods	mIoU	Road	Sidewalk	Parking	Other-Gro.	Building	Car	Truck.	Bicycle	Motorcycle	Other-Veh.	Vegetation	Trunk	Terrain	Person	Bicyclist	Motorcyclist	Fence	Pole	Traffic sign
PointNet++ [7]	20.1	72.0	41.8	18.7	5.6	62.3	53.7	0.9	1.9	0.2	0.2	46.5	13.8	30.0	0.9	1.0	0.0	16.9	6.0	8.9
KPConv [36]	58.8	90.3	72.7	61.3	31.5	90.5	95.0	33.4	30.2	42.5	44.3	84.8	69.2	69.1	61.5	61.6	11.8	64.2	56.4	47.4
RandLA-Net [12]	55.9	90.5	74.0	61.8	24.5	89.7	94.2	43.9	29.8	32.2	39.1	83.8	63.6	68.6	48.4	47.4	9.4	60.4	51.0	50.7
RPVNet [50]	70.3	93.4	80.7	70.3	33.3	93.5	97.6	44.2	68.4	68.7	61.1	86.5	75.1	71.7	75.9	74.4	43.4	72.1	64.8	61.4
PVKD [51]	71.2	91.8	70.9	77.5	41.0	92.4	97.0	67.9	69.3	53.5	60.2	86.5	73.8	71.9	75.1	73.5	50.5	69.4	64.9	65.8
RandLA-Net (Ours rep.)	51.2	88.7	72.4	62.1	22.1	85.1	89.7	38.9	27.6	33.0	33.0	81.1	63.2	66.8	44.6	42.1	8.3	54.8	47.5	45.2
RandLA-Net (Ours w/density-grid)	53.6	84.3	80.2	63.3	37.6	91.3	91.7	41.8	45.6	59.2	32.8	84.9	68.5	70.5	64.5	49.9	20.8	68.2	60.4	59.3
Ours (w/RGB w/o density-grid)	70.2	89.9	76.8	59.3	40.1	91.6	96.8	57.9	43.5	53.5	58.8	80.2	72.8	70.9	60.5	66.4	47.3	69.8	61.5	59.8
Ours (w/RGB and density-grid)	69.9	88.9	86.9	67.4	50.8	93.4	97.4	57.6	72.6	70.2	60.4	83.4	71.9	60.4	52.4	60.8	46.9	80.8	72.8	74.1

Table 4. The mIoU result of RT-Net with different grid size.

Model	mIoU (%)
Full RT-Net with 0.01 grid-based decimation	76.66
Full RT-Net with 0.06 grid-based decimation	80.85
Full RT-Net with 1.0 initial grid density-based grid decimation	86.91
Full RT-Net with 10 initial grid density-based grid decimation	86.79

Table 5. The mIoU results of network ablation experiments.

Model	mIoU (%)
Removing self-attention pooling	65.66
Removing attention residual	74.81
The full framework (RT-Net)	86.79

Table 6. The mIoU results of ablation experiments with different residual configurations.

Model	mIoU (%)
RT-Net with residual	74.81
RT-Net with attention residual	79.96
RT-Net with attention residual (concatenation)	83.00

Table 7. Several models were compared in terms of parameter size and per-batch time rate. Bold values indicate optimal results across all compared methods.

Model	Parameter Size	Per-Batch Time
PointNet++	4.6 M	189 ms
DGCNN	10 M	248 ms
MVPNet	14 M	255 ms
DGNet	10 M	228 ms
RandLA-Net	4.7 M	88 ms
Point Transformer	30 M	357 ms
Ours	5.5 M	71 ms

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Published by MDPI on behalf of the International Society for Photogrammetry and Remote Sensing. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jiang, L.; Ma, J.; Zhou, H.; Shangguan, B.; Xiao, H.; Chen, Z. Large-Scale Point Cloud Semantic Segmentation with Density-Based Grid Decimation. ISPRS Int. J. Geo-Inf. 2025, 14, 279. https://doi.org/10.3390/ijgi14070279

AMA Style

Jiang L, Ma J, Zhou H, Shangguan B, Xiao H, Chen Z. Large-Scale Point Cloud Semantic Segmentation with Density-Based Grid Decimation. ISPRS International Journal of Geo-Information. 2025; 14(7):279. https://doi.org/10.3390/ijgi14070279

Chicago/Turabian Style

Jiang, Liangcun, Jiacheng Ma, Han Zhou, Boyi Shangguan, Hongyu Xiao, and Zeqiang Chen. 2025. "Large-Scale Point Cloud Semantic Segmentation with Density-Based Grid Decimation" ISPRS International Journal of Geo-Information 14, no. 7: 279. https://doi.org/10.3390/ijgi14070279

APA Style

Jiang, L., Ma, J., Zhou, H., Shangguan, B., Xiao, H., & Chen, Z. (2025). Large-Scale Point Cloud Semantic Segmentation with Density-Based Grid Decimation. ISPRS International Journal of Geo-Information, 14(7), 279. https://doi.org/10.3390/ijgi14070279

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Large-Scale Point Cloud Semantic Segmentation with Density-Based Grid Decimation

Abstract

1. Introduction

2. Related Works

2.1. Deep Learning Methods for Extracting Point Cloud Features

2.2. Preprocessing of Point Clouds Before Training

3. Methodology

3.1. Network Architecture

3.2. Local Attention Aggregation

3.3. Attention Residual

3.4. Density-Based Grid Decimation

4. Experiments and Results

4.1. Experiment Details

4.2. Benchmark Assessment of Semantic Segmentation Models

4.3. Efficiency of Density-Based Grid Decimation

4.4. Ablation of RT-Net Framework

4.5. Ablation of AR

4.6. Efficiency of RT-Net Framework

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI