SeqConv-Net: A Deep Learning Segmentation Framework for Airborne LiDAR Point Clouds Based on Spatially Ordered Sequences

Guo, Bin; Yao, Chunjing; Ma, Hongchao; Wang, Jie; Xu, Junhao

doi:10.3390/rs17111927

Open AccessArticle

SeqConv-Net: A Deep Learning Segmentation Framework for Airborne LiDAR Point Clouds Based on Spatially Ordered Sequences

by

Bin Guo

,

Chunjing Yao

^*

,

Hongchao Ma

,

Jie Wang

and

Junhao Xu

School of Remote Sensing and Information Engineering, Wuhan University, Wuhan 430072, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(11), 1927; https://doi.org/10.3390/rs17111927

Submission received: 16 April 2025 / Revised: 30 May 2025 / Accepted: 30 May 2025 / Published: 1 June 2025

Download

Browse Figures

Versions Notes

Abstract

Point cloud data provide three-dimensional (3D) information about objects in the real world, containing rich semantic features. Therefore, the task of semantic segmentation of point clouds has been widely applied in fields such as robotics and autonomous driving. Although existing research has made unprecedented progress, achieving real-time semantic segmentation of point clouds on airborne devices still faces challenges due to excessive computational and memory requirements. To address this issue, we propose a novel sequence convolution semantic segmentation architecture that integrates Convolutional Neural Networks (CNN) with a sequence-to-sequence (seq2seq) structure, termed SeqConv-Net. This architecture views point cloud semantic segmentation as a sequence generation task. Based on our unique perspective of spatially ordered sequences, we use Recurrent Neural Networks (RNN) to encode elevation information, then input the structured hidden states into a CNN for planar feature extraction. The results are combined with the RNN’s encoded outputs via residual connections and are fed into a decoder for sequence prediction in a seq2seq manner. Experiments show that the SeqConv-Net architecture achieves 75.5% mean Intersection Over Union (mIOU) accuracy on the DALES dataset, with the total processing speed from data preprocessing to prediction being several to tens of times faster than existing methods. Additionally, SeqConv-Net can balance accuracy and speed by adjusting the hyperparameters and using different RNNs and CNNs, providing a new solution for real-time point cloud semantic segmentation in airborne environments.

Keywords:

point cloud; semantic segmentation; recurrent neural network; convolutional neural network; seq2seq

1. Introduction

Light Detection and Ranging (LiDAR) devices can directly generate three-dimensional (3D) coordinate information of spatial points, known as point clouds. Due to the accurate representation of 3D shapes of objects, point cloud semantic segmentation is widely used in environmental perception and terrain analysis in remote sensing and mapping. In recent years, with the rapid development of deep learning, research on processing 3D remote sensing data has also achieved unprecedented breakthroughs. Airborne point clouds, as an important category of 3D remote sensing data, can capture more comprehensive and richer information compared to ground-based point clouds, thanks to the high-altitude advantage of drones.

Given the importance and widespread application of point cloud semantic segmentation, many studies have attempted to approach point cloud data from different perspectives for segmentation. The technology of point cloud semantic segmentation has gone through several developmental stages. Traditional methods include Support Vector Machines [1] and Random Forests [2]. Later, researchers proposed methods based on Multi-Layer Perceptrons (MLP) [3], graph-based methods [4], point convolution methods [5], and transformer-based methods [6]. MLP-based methods apply point-wise MLP operations to point clouds and use algorithms like k-nearest neighbors (KNN) for information fusion between points. Graph-based methods model points as graph structures for learning. Point convolution methods, inspired by image convolution, apply 3D point convolution for information extraction and fusion. Recently, due to the success of transformers in natural language processing (NLP) tasks, their structure has also been applied to point cloud semantic segmentation, achieving good results.

1.1. Related Work

1.1.1. Traditional Segmentation Methods

Traditional point cloud segmentation techniques include multi-scale analysis, region growing, clustering, graph cuts, model fitting, and supervoxel methods. Multi-scale analysis methods analyze point cloud features at different resolutions to achieve finer segmentation. Pauly et al. [7] proposed a multi-scale analysis method for point cloud segmentation by analyzing geometric features at different scales. Region growing algorithms start from seed points and merge neighboring points based on local geometric features (e.g., normal vectors and curvature), suitable for planar region segmentation. T. Rabbani et al. [8] proposed a region growing algorithm with smoothness constraints for segmenting planar regions in point clouds. Clustering methods (e.g., K-means and DBSCAN) divide point clouds into clusters for segmentation, with DBSCAN being widely used due to its robustness to noisy data. Graph cut methods build adjacency graphs of point clouds and use minimum cut algorithms to divide the graph into subgraphs, suitable for complex scene segmentation. Golovinskiy et al. [9] proposed a graph cut-based point cloud segmentation method. Model fitting methods extract target structures from noisy data by fitting geometric models (e.g., planes and cylinders). Schnabel et al. [10] proposed a RANSAC-based point cloud segmentation method for extracting geometric shapes like planes and cylinders. Additionally, supervoxel methods divide point clouds into small regions with similar properties, combining geometric and color information for segmentation, providing an effective means for complex scene processing. Papon et al. [11] proposed a supervoxel-based point cloud segmentation method.

After machine learning gained traction, methods relying on manually designed features and traditional machine learning became mainstream. These methods extract geometric features, shape descriptors, or histogram features from point clouds to build classifiers for segmentation. Specifically, geometric features include point cloud structure, normal vectors, and local geometric properties, allowing objects to be modeled as point, line, or surface structures. Rusu et al. [12] proposed a point cloud segmentation method based on geometric features. They computed surface normal, curvature, and point density to build feature vectors and used Support Vector Machines (SVM) for classification, achieving good results in indoor scene segmentation. Shape descriptors model point clouds based on local structural features. Guo et al. [13] proposed a point cloud segmentation method using “Spin Image” shape descriptors to capture local shape information, combined with Random Forest classifiers, showing excellent performance in 3D object recognition and scene segmentation. Histogram features are also widely used. Lai et al. [14] proposed a point cloud segmentation method based on histogram features, using histograms of local geometric properties (e.g., normal direction and curvature) to train classifiers, performing well in outdoor scene segmentation.

1.1.2. Deep Learning-Based Methods

The rapid development of deep learning has significantly advanced point cloud semantic segmentation. In recent years, many excellent point cloud segmentation methods have emerged, which can be broadly categorized into the following types:

Point-based methods: These networks treat point clouds as unordered point sets and directly apply computations to the points without preprocessing, resulting in higher accuracy. PointNet aggregates global features through max pooling, while PointNet++ [15] uses farthest point sampling for downsampling. RandLA-Net [16] introduces random sampling for layer-wise downsampling, and PointCNN [17] proposes an innovative X-transformation that learns to spatially align local points into a canonical order, enabling effective convolution operations. ConvPoint [18] introduces continuous convolutions using MLP-based weight functions, enabling adaptive feature aggregation across irregular point distributions. Point convolution networks like KPConv and PointConv [19] apply convolution operations directly on points for learning, making them a current research focus. Point-based methods, while often achieving higher accuracy, can be highly time-consuming when processing dense point clouds or large-scale scenes due to their computational speed being heavily dependent on the number of points.

Projection and multi-view methods: In these networks, researchers project 3D point clouds onto one or more two-dimensional (2D) images using specific projection methods, then process the projected 2D images with CNNs to obtain results. SqueezeSeg [20] projects point clouds onto a spherical surface and inputs them into a CNN, while RangeNet++ [21] uses range images as intermediate representations. Multi-View CNN [22] projects point clouds from multiple views and concatenates them for segmentation. Despite their efficiency, projection-based methods suffer from several inherent limitations. The projection process inevitably causes information loss, particularly in occluded regions and areas with complex geometries. These methods also struggle to fully exploit the 3D spatial relationships present in raw point clouds, limiting their performance in fine-grained segmentation tasks.

Voxel-based methods: These networks convert point clouds into fixed-resolution grids based on distance or point count and apply 3D convolution or other operations. During voxelization, PointGrid [23] ensures a constant number of points per voxel to retain the point cloud density features. VV-Net [24] subdivides each voxel into sub-voxels and uses smooth radial basis functions to reconstruct density. SSCNs [25] introduce a convolution operation to alleviate the sparsity issue in voxelization. However, the inherent sparsity of point clouds results in inefficient computations, as most voxels remain empty. Techniques like sparse convolutions can mitigate but not fully eliminate these issues, and the performance still heavily depends on the chosen voxel size, creating a trade-off between accuracy and efficiency.

Other methods: Other approaches include hybrid segmentation networks, graph-based networks, and transformer-based methods. Hybrid segmentation networks combine the advantages of point and voxel methods. Point–Voxel CNN [26] uses points to capture high-resolution geometric features and voxel convolution to extract low-resolution features. Graph-based networks use Graph Convolutional Neural Networks (GCNs) for processing, but GCNs are complex and are still in the research phase. Recently, transformer-based networks ([27,28,29]) have presented a new direction. With their excellent long-range feature-capture capabilities, they often perform better in point cloud semantic segmentation tasks. Self-attention mechanisms can capture global dependencies in point clouds and combine local feature extraction modules to achieve semantic segmentation, showing good performance on multiple datasets. However, the high computational overhead of the self-attention module is still one of the problems to be solved. To address this issue, a novel network based on the Mamba architecture has recently been proposed ([30,31]), aiming to resolve the high computational demands of attention mechanisms. Many researchers are actively exploring the potential of the Mamba architecture.

1.2. Motivations

Despite the availability of the widely used point-based models (PointNet++, RandLA-Net), voxel-based models (PointGrid, VV-Net), and transformer-based models (Point Cloud Transformer) for point cloud semantic segmentation, these methods often fail to meet real-time processing requirements on low-power, computationally constrained devices, such as drones. Effectively modeling point cloud data while reducing dimensionality and extracting spatial information from clouds containing millions or even billions of points remains a challenge, particularly on devices with strict computational and memory limitations. This paper proposes a novel point cloud modeling concept—spatially ordered sequences—and develops an efficient point cloud semantic segmentation architecture, SeqConv-Net, to achieve the real-time processing of resource-constrained airborne point clouds.

1.3. Contributions

The proposed concept of spatially ordered sequences and the SeqConv-Net architecture address the critical challenge of real-time processing in point cloud analysis. Unlike existing methods, SeqConv-Net achieves an order-of-magnitude speed improvement (up to ten times faster) without sacrificing accuracy compared to mainstream models. Its highly modular design ensures compatibility with diverse technologies—from traditional RNNs and CNNs to modern architectures like Transformers and Mamba [32]—enabling seamless integration with existing frameworks. This flexibility allows users to adapt the model for specific applications by swapping components (e.g., encoders/decoders) to balance computational efficiency and performance.

In our experiments, we designed the first point cloud semantic segmentation network based on the SeqConv-Net framework, with a 2-layer Gated Recurrent Unit (GRU) as the encoder and decoder and a UNet as the CNN. Experiments show that our network can complete inference in just a few seconds for point clouds with tens of millions of points, even on resource-constrained devices. For point clouds with millions of points, it can perform real-time inference while maintaining good segmentation accuracy. The SeqConv-Net architecture, designed from the perspective of spatially ordered sequences, provides a new approach to point cloud data processing and semantic segmentation tasks. In summary, the contributions of this paper are as follows:

(1).: Spatially Ordered Sequence Perspective: We innovatively propose the idea of spatially ordered sequences, where different elevation points at the same planar position can be viewed as a sequence from low to high, with the sequence values containing elevation information. This operates point cloud semantic segmentation as a sequence generation task of the same length, providing a new way to process point cloud data.
(2).: SeqConv-Net Point Cloud Semantic Segmentation Architecture: Based on the spatially ordered sequence perspective, we design an RNN+CNN point cloud semantic segmentation architecture called SeqConv-Net and innovatively use RNN hidden states as CNN inputs to fuse planar spatial information.
(3).: Construction and Validation of SeqConv-Net: We design the first network based on the SeqConv-Net architecture and validate its feasibility. Experiments show that our SeqConv-Net design is not only efficient and reliable but also interpretable. Compared to previous methods, it significantly improves the speed of point cloud semantic segmentation in large scenes while maintaining accuracy.

2. Point Cloud Semantic Segmentation Framework

2.1. Architecture Overview

To better balance the trade-off between segmentation accuracy and speed, aiming for efficient real-time segmentation on airborne devices, we propose the SeqConv-Net architecture. This architecture models point clouds from the unique perspective of spatially ordered sequences, treating point cloud segmentation as a sequence-to-sequence generation task. Compared with the existing methods, this enables our model to achieve a 10 to 100 times speed improvement while maintaining high accuracy.

The overall framework is illustrated in Figure 1. Given a range of point clouds, they are first spatially serialized, then an RNN is used to encode elevation information, extracting valid information and mapping it to structured hidden states. The CNN takes this hidden state matrix as input, further fusing information between sequences. Both networks work together to extract information from the XY plane and Z elevation directions, with residual connections. Finally, the decoder generates predictions for each position in sequence.

2.2. Spatially Ordered Sequences

2.2.1. The Concept of Spatially Ordered Sequences

In the field of point cloud processing, voxelization methods typically downsample point clouds at a fixed spatial resolution, converting them into regular 3D voxel grids. Although this method can transform unstructured point cloud data into a structured representation for 3D convolution operations, its limitations are evident. First, the voxelization process inevitably introduces a large number of empty values (voxels not containing point clouds), which not only fail to provide effective geometric information but also cause a large number of invalid inputs and redundant computations during CNN training and inference. Second, since 3D convolution operations are computationally expensive, the model’s efficiency is further reduced.

However, the advantages of voxelization are also significant. It provides an efficient and structured way to process data. By downsampling point clouds in a regularized manner, it reduces data volume while retaining key geometric features of the point cloud, which is particularly important for processing large-scale point cloud data. Additionally, voxelization converts point clouds into regular 3D grids, where the absolute and relative positions of each voxel in space are explicit and fixed. This inherent regularity eliminates the need for additional KNN algorithms to extract spatial relationships between points, simplifying the feature extraction process. Moreover, the structured point cloud after voxelization can be processed using more mature convolution algorithms, making it highly compatible with the existing deep learning frameworks and facilitating efficient parallel computing.

To address the issue of empty values in traditional voxelization methods while leveraging the advantages of voxelization in downsampling and structured representation, we propose a novel perspective on point cloud voxel representation—spatially ordered sequences. This idea is inspired by the way humans perceive 3D space: when observing a 3D scene, if there are gaps between objects in the elevation direction, the brain typically ignores these gaps and naturally expresses the effective points in space as a sequence from low to high. This cognitive approach provides important insights: during voxelization, invalid voxels caused by gaps between objects at different elevations do not need to be explicitly stored or processed. Instead, only the valid voxels are organized into sequences of varying lengths to effectively represent vertical spatial changes.

Specifically, let vector

V_{(x, y)} = (v_{0}, v_{1}, 0, 0, v_{2}, 0 {, v}_{3}, 0 {, v}_{4})

(1)

represent the voxel distribution along the vertical direction at a planar position

(x, y)

after voxelization (note that the values here are in bold black to indicate the difference from the numbers), where

v_{i}

represents the

(i - 1)

-th valid voxel from bottom to top, and

0

represents invalid voxels caused by spatial discontinuities. The generated spatial sequence is as follows:

S_{(x, y)} = (v_{0}, v_{1}, v_{2}, v_{3} {, v}_{4}, e n d, 0, 0, 0, 0)

(2)

where

e n d

is a special voxel artificially added after the last valid voxel in each spatial sequence, marking the end of valid voxels (the fast generation algorithm for spatial sequences is detailed in the next section).

Spatially ordered sequences arrange valid voxels at each planar position in elevation order, forming a dynamically sized sequence. This approach effectively down-samples point clouds while retaining spatial structural information, improving data compactness and computational efficiency. During serialization, all valid positions are pushed to the front, allowing the RNN to avoid processing intermediate padding positions. The special end voxel also ensures that the RNN correctly learns the valid parts of the sequence, excluding the influence of padding voxels at the end.

In traditional voxelization, the average elevation of points in a voxel is often used to represent the voxel’s position. However, to enable the use of embedding techniques from NLP tasks, the elevation representation in spatially ordered sequences is entirely different. Specifically, we use the voxel’s index as the elevation representation in the spatially ordered sequence. For the spatially ordered sequence

S_{(x, y)}

, the elevation representation using voxel indices is as follows:

S_{(x, y)} = (1, 2, 5, 7, 9, 10, 0, 0, 0, 0)

(3)

where

0

is used as a padding value to fill the sequence to the required length, and each number represents the index of the valid voxel in the original plus one. Figure 2 also illustrates the serialization process for a voxel matrix of shape (4,3,4) (following the deep learning convention, matrix rows and columns are at the back).

This index-based elevation representation has four significant advantages:

(1).: Avoids issues with absolute elevation: Traditional methods using elevation coordinates may face inconsistencies or computational complexity due to variations in elevation range or noise. Using indices as elevation representations converts elevation information into relative positional relationships, avoiding instability caused by absolute elevation values and enhancing model robustness.
(2).: Fast generation of spatially related sequences: The index-based elevation representation allows for accelerated sequence generation using sorting algorithms. By sorting the sequence and using the indices of valid positions as input, spatial sequences can be quickly generated. This serialized representation facilitates efficient computer processing and significantly reduces preprocessing time compared to methods like KNN, especially for large-scale point cloud data.
(3).: Utilizes NLP embedding methods: By using integer indices as elevation representations, we can leverage embedding techniques from NLP, mapping indices to high-dimensional vector spaces for computation. This embedding representation captures relationships between elevations and provides richer feature representations for deep learning models, enhancing their expressive power.
(4).: Efficient voxel-to-point cloud recovery: During prediction, the network can sequentially output predictions for each valid position and use the indices to restore the correspondence between predictions and original voxels. This recovery process is computationally efficient and does not introduce additional losses.

2.2.2. Generation Algorithm for Spatially Ordered Sequences

Spatially ordered sequences not only facilitate RNN encoding but also have a simple and fast generation algorithm.

Let vector

V_{(x, y)} = (v_{0}, v_{1}, 0, 0, v_{2}, 0 {, v}_{3}, 0 {, v}_{4})

(4)

represent the voxel distribution at a planar position

(x, y)

after voxelization. By setting valid positions to 1, a mask vector,

{M a s k}_{(x, y)} = (1, 1, 0, 0, 1, 0, 1, 0, 1)

(5)

is generated. Adding an end marker at the end of the mask vector results in the padded mask vector:

{M a s k}_{(x, y)} = (1, 1, 0, 0, 1, 0, 1, 0, 1, 1)

(6)

To push valid positions to the front, we use an order-preserving sorting algorithm to sort

{M a s k}_{(x, y)}

obtaining the sorted result and the corresponding indices:

{R e s u l t}_{(x, y)}, {I n d e x}_{(x, y)} = S o r t ({M a s k}_{(x, y)})

(7)

where

{R e s u l t}_{(x, y)} = (1, 1, 1, 1, 1, 1, 0, 0, 0, 0)

(8)

{I n d e x}_{(x, y)} = (0, 1, 4, 6, 8, 9, 2, 3, 5, 7)

(9)

The generated spatial sequence is

{S p a t i a l S e q u e n c e}_{(x, y)} = {R e s u l t}_{(x, y)} * ({I n d e x}_{(x, y)} + 1) \begin{matrix} = (1, 2, 5, 7, 9, 10, 0, 0, 0, 0) \end{matrix}

(10)

The purpose of +1 is to leave index 0 as the padding position.

With the spatially ordered sequence, a corresponding label sequence is needed for training. For the label sequence, let vector

C_{(x, y)} = (C_{0}, C_{1}, 0, 0, C_{2}, 0 {, C}_{3}, 0 {, C}_{4})

(11)

Similarly, a 0 is appended at the end to match the length of

{M a s k}_{(x, y)}

. Thus,

C_{(x, y)} = (C_{0}, C_{1}, 0, 0, C_{2}, 0 {, C}_{3}, 0 {, C}_{4}, 0)

(12)

where

C_{i}

is the mode of the labels of all points in the

(i - 1)

th valid voxel.

To generate the sequence-form label

S e q u e n c e L a b e l

and the teacher-forcing input

T e a c h e r F o r c e i n g

, the sorting indices Index from the spatially ordered sequence generation process is used:

S e q u e n c e L a b e l = T a k e A l o n g W i t h (C_{(x, y)}, {I n d e x}_{(x, y)}) = (C_{0}, C_{1}, C_{2}, C_{3} {, C}_{4}, 0, 0, 0, 0, 0)

(13)

T e a c h e r F o r c e i n g = (S t a r t, C_{0}, C_{1}, C_{2}, C_{3} {, C}_{4}, 0, 0, 0, 0)

(14)

Using sorting algorithms, the generation of spatially ordered sequences can be accelerated with hardware-friendly operations, leveraging sorting algorithms with a time complexity of

Θ (n l o g n)

.

2.2.3. Differences Between Spatially Ordered Sequences and NLP Sequences

Although voxelated point clouds share similarities with sequences in NLP, there are significant differences due to the inherent characteristics of point cloud data.

First, in NLP, tokens can appear multiple times at different positions, and each token has its semantic meaning, making NLP sequence processing more flexible and diverse. However, in spatially ordered sequences, the values represent elevation information, which follows a strict order. This strict ordering leads to sequences that are strictly increasing in elevation, making them strictly ordered. This strictness plays an important role. It simplifies the model’s learning task, as the model does not need to handle complex contextual repetitions or long-range dependencies as in NLP. The strict ordering also makes the generation and processing of spatial sequences more efficient. In sorting and indexing operations, the elevation order allows for rapid sequence generation.

Second, in NLP, the effective length of sequences is usually long, and in some large-scale tasks, the sequence length can reach thousands or even tens of thousands. This long-sequence characteristic requires NLP models to have strong context capture and long-range dependency modeling capabilities. In contrast, in real-world scenarios, the effective length of point cloud sequences at the same planar position is usually short, typically not exceeding a few dozen to a few hundred. This short-sequence characteristic makes the learning task for point cloud sequence data relatively simpler, allowing the model to more easily learn elevation and spatial distribution patterns while reducing computational complexity.

2.3. Spatial Information Processing

2.3.1. Elevation Information Extraction Using RNNs

Point cloud data differs fundamentally from common 2D image data, as it not only contains 2D planar coordinate information but also elevation information. This 3D characteristic makes point cloud data richer and more complex in expressing spatial structures while also posing significant challenges for data processing. How to effectively process and fuse information from these three dimensions is a core issue in point cloud data processing. Current popular methods rely on KNN, 3D CNNs, projection, point convolution, transformers, etc., which attempt to simultaneously extract and fuse 3D information. However, these methods often struggle to balance computational efficiency and feature extraction accuracy.

To address this issue, we adopt a divide-and-conquer strategy for 3D information: first, we fuse features in the elevation direction, and then we fuse the already fused features in the planar direction. This divide-and-conquer strategy not only reduces computational complexity but also better captures unique information in each dimension.

Recurrent Neural Networks [33], with their unique structural design, have shown significant advantages in processing variable-length sequence data. Through their recurrent mechanism, RNNs can process each element in the sequence step by step, passing information from previous steps to the current step, effectively capturing contextual information in the sequence.

In natural language processing, RNNs have been proven successful in various tasks such as language modeling, machine translation, and text generation. In these tasks, text sequences of different lengths are mapped to fixed-length hidden states after being processed by RNNs. These hidden states not only represent local features of the sequence but also capture global contextual information, comprehensively characterizing the semantic and structural features of the entire sequence.

In NLP, for an input

X_{t} ϵ R^{n \times d}

at time step

t

with batch size

n

and embedding dimension

d

, the RNN encoder updates the current hidden state as follows:

H_{t} = ϕ (H_{t - 1}, X_{t})

(15)

O_{t} = H_{t} W_{h q} + b_{q}

(16)

where

ϕ i s R N N,

H_{t}

is the current hidden state,

H_{t - 1}

is the previous hidden state,

O_{t}

is the output at the current time step, and

W_{h q}

and

b_{q}

are the weights and biases of the output layer. This means that RNNs have the ability to integrate sequence information, and the final hidden state will possess the ability to characterize the entire sequence.

Inspired by the successful application of RNNs in NLP tasks, we can model and process point clouds from the perspective of spatially ordered sequences. By embedding each valid position’s elevation index into a dense vector representation and sequentially inputting these embedded vectors into the RNN, the recurrent structure can process each valid voxel’s position information step by step, passing feature information from previous positions to subsequent processing steps. Finally, the RNN outputs a fixed-length hidden state that captures both local geometric features of each valid position in the sequence and the global spatial structure information of the entire sequence (Figure 3). In this way, unstructured spatially ordered sequences can be encoded into structured hidden states for further processing.

2.3.2. Planar Information Fusion and Extraction Using CNNs

While RNNs encode hidden states, they only consider the sequential relationships between valid voxels in the elevation dimension and ignore the spatial relationships between adjacent voxels in the same planar space. In other words, RNNs can only fuse and extract information from valid voxels along the elevation dimension (i.e., the sequence direction) and map this information into a fixed-length hidden state. Although this hidden state can characterize local and global features of valid voxels in the elevation direction, it completely ignores the distribution information of point clouds in the planar direction.

Therefore, to more comprehensively extract the spatial structural features of point cloud data, it is necessary to further extract and fuse planar position information. In the extraction and fusion of planar position information, CNNs [34] have been widely used and proven to be one of the best solutions. Through their local receptive fields and weight-sharing mechanisms, CNNs can efficiently capture local patterns in 2D space, making them highly effective in image semantic segmentation and classification tasks. Based on this mature technology, we can combine the elevation information extracted by RNNs with the planar information extracted by CNNs to achieve 3D feature fusion of point cloud data.

To enable CNNs to process these hidden states, the hidden variables can be arranged according to their original planar positions to form a hidden variable matrix. Each element in this matrix is a hidden state vector generated by the RNN from that position (Figure 4). Through this operation, the planar spatial structure is restored while retaining the elevation information.

After the CNN processes the hidden variable matrix, the simplest approach is to feed the output directly into the RNN decoder. However, this can be improved. CNNs often use downsampling layers to reduce feature map resolution, lowering computational complexity and expanding the receptive field. Yet, this operation sacrifices fine-grained details, which are critical for small objects or complex structures in point clouds.

Instead of directly feeding the output to the RNN decoder, we propose a better strategy: connecting the CNN’s output with the initial hidden variables via a residual structure [35] before inputting them into the decoder.

D e c o d e r I n p u t S t a t e = R N N (X) + C N N (R N N (X))

(17)

RNNs often face steep loss function spaces, leading to vanishing or exploding gradients during backpropagation. The residual structure provides a more direct gradient flow from the decoder to the encoder.

Moreover, the initial hidden variables, generated without downsampling, retain rich detail information from the elevation direction. By combining the CNN’s output with these initial hidden variables through a residual connection, the model compensates for detail loss during downsampling, enhancing its ability to reconstruct small objects and complex structures.

2.3.3. Prediction

In seq2seq [36] models, the prediction process is typically performed recursively (Figure 5), where the model uses the previous output as the next input, combined with the hidden state, to generate the output sequence step by step until the model predicts the end token. This mechanism is common in tasks such as machine translation and text generation, where the output sequence length is usually variable, and the model needs to dynamically decide when to stop generating.

Figure 5 illustrates the decoding behavior of an RNN. The hidden states encoded by the encoder are first used to initialize the decoder. A special fixed start token is then concatenated with the final layer’s hidden state and fed into the decoder together, ensuring that each RNN input incorporates all historical information. Subsequently, the hidden state gets updated, and each generated prediction is reused as input for the next step. This iterative process continues until the RNN outputs a special end token, at which point the entire prediction process terminates.

In our SeqConv-Net structure, the prediction process differs from traditional seq2seq models due to the nature of semantic segmentation tasks. In semantic segmentation tasks, there is a strict correspondence between input and output: each valid voxel must be assigned a corresponding prediction label, while invalid voxels should not participate in the prediction. In other words, semantic segmentation tasks require that the effective lengths of the input and output sequences at each spatial position must be exactly the same. This strict constraint leads to some differences between the SeqConv structure and the seq2seq models during training and prediction.

Seq2seq models need to learn how to dynamically determine the length of the output sequence, particularly when to generate the end token to terminate the prediction process. In the SeqConv structure, since the effective lengths of the input and output are fixed, the model does not need to learn how to predict the end token but only needs to focus on accurately predicting the first part of the effective length. Specifically, during training, the SeqConv-Net structure does not need to include the end token as part of the output, and during prediction, the model does not need to concern itself with whether the sequence should end. The model must predict a specified number of times, and predictions beyond that number should be invalid. Figure 5 illustrates the prediction process of the SeqConv structure, where the RNN decoder is initialized with the hidden state output by the encoder and the Start token to generate the first prediction. Subsequently, the previous prediction and the updated hidden state are used for the next prediction until the prediction length matches the effective length of the input sequence.

H_{t} = ϕ (H_{t - 1}, {C o n c a t (C}_{i - 1}, H_{l a s t l a y e r}))

(18)

C_{i} = H_{t} W_{h q} + b_{q}

(19)

where

{ϕ i s R N N, H}_{0} {= D e c o d e r I n p u t S t a t e, C}_{0} = S t a r t, i, t \geq

1

The fixed-length to fixed-length sequence mapping characteristic of semantic segmentation tasks makes the SeqConv structure more efficient and straightforward in semantic segmentation tasks. It avoids the complexity and uncertainty introduced by dynamic length prediction and makes batch prediction more convenient.

During batch prediction, the model can generate corresponding prediction vectors by specifying a sufficient number of prediction steps. However, since the effective lengths of input sequences vary, the model needs to use the maximum length as the prediction length during batch prediction. Therefore, for sequences with shorter effective lengths, the model will generate results beyond the actual effective length. To filter out irrelevant prediction results, we can use the one-to-one correspondence between input and prediction in semantic segmentation tasks. Assuming the prediction sequence at a planar position is

{P r e d}_{(x, y)} = (C_{0}, C_{1}, C_{2}, C_{3} {, C}_{4}, C_{8}, C_{3}, C_{6}, C_{2})

(20)

Using the sorted mask vector

{R e s u l t}_{(x, y)} = (1, 1, 1, 1, 1, 0, 0, 0, 0)

(21)

from the spatial sequence generation process, multiplying the two yields the prediction results only for valid positions. Additionally, to maintain length consistency, a

0

is appended at the end:

{P r e d}_{(x, y)} = C o n c a t ({P r e d}_{(x, y)} \times {R e s u l t}_{(x, y)}, 0) = (C_{0}, C_{1}, C_{2}, C_{3} {, C}_{4}, 0, 0, 0, 0, 0)

(22)

However, although masking can remove irrelevant prediction values, the prediction vector still cannot directly correspond to each point in the original point cloud. This is because, during serialization, all valid voxels are pushed to the front of the sequence, so the prediction results cannot correctly correspond to the original voxels. However, since elevation is represented using indices, we can use the sorting indices obtained during serialization to perform an inverse operation for recovery.

For the sorting indices,

{I n d e x}_{(x, y)} = (0, 1, 4, 6, 8, 9, 2, 3, 5, 7)

(23)

is obtained during sequence generation; in order to use these indices for deserialization, we need to sort them again:

{R e s u l t}_{(x, y)}^{2}, {I n d e x}_{(x, y)}^{2} = S o r t ({I n d e x}_{(x, y)})

(24)

where

{I n d e x}_{(x, y)}^{2} = (0, 1, 6, 7, 2, 8, 3, 9, 4, 5)

(25)

The deserialization operation is then

{P r e d}_{(x, y)} = T a k e A l o n g W i t h (C_{(x, y)}, {I n d e x}_{(x, y)}^{2}) = (C_{0}, C_{1}, 0, 0, C_{2}, 0 {, C}_{3}, 0 {, C}_{4}, 0)

(26)

Discarding the 0 at the end used for padding, we obtain the true prediction vector:

{P r e d}_{(x, y)} = (C_{0}, C_{1}, 0, 0, C_{2}, 0 {, C}_{3}, 0 {, C}_{4})

(27)

Comparing this with the original voxel vector

V_{(x, y)} = (v_{0}, v_{1}, 0, 0, v_{2}, 0 {, v}_{3}, 0 {, v}_{4})

(28)

we can see that we have restored the correspondence between the category values in the prediction vector and the original voxels. Then, using the coordinates of points in the voxel matrix, we can obtain the correspondence between each point and its predicted value.

3. Experimental Results and Analysis

3.1. Data Preprocessing and Augmentation

3.1.1. Sequence Loss

Spatial serialization is the foundation of the SeqConv-Net structure, and its approach is a crucial factor in determining the model’s capabilities. The results of serialization at different resolutions may have a significant impact on the final convergence. In the experiment, we selected the following two datasets as our benchmark datasets.

ISPRS Vaihingen 3D: The ISPRS Vaihingen 3D Dataset is a benchmark dataset for 3D point cloud classification, semantic segmentation, and urban scene understanding released by the International Society for Photogrammetry and Remote Sensing (ISPRS). It serves as a standard for evaluating algorithms in tasks like building extraction, vegetation detection, and terrain modeling.

DALES: A large-scale aerial LiDAR data set for semantic segmentation [37]. This is a new large-scale aerial LiDAR dataset with over 500 million manually labeled points, covering an area of ten square kilometers and eight object categories. DALES is the most extensive publicly available ALS, with 400 times more points than other currently available annotated aerial point cloud datasets and six times the resolution.

To quantitatively measure the loss caused by serialization, we performed serialization on the DALES dataset at three common resolutions and obtained corresponding labels. We then used these labels to deserialize and restore the category of each point, comparing them with the true categories. The results are shown in Table 1.

From the table, it can be observed that the deserialization loss is closely related to the actual shape and position of the objects. Among small objects, power lines and fences, both linear objects, show significant differences in deserialization loss. At a one-meter resolution, the IOU difference between them can reach approximately 0.15. This is because power lines are usually suspended in the air and maintain a relatively independent spatial relationship with surrounding objects, allowing them to be separated during serialization. In contrast, fences, as low ground objects, are often close to ground vegetation and buildings, leading to higher loss during low-resolution serialization.

Among large objects, the IOU difference between buildings and vegetation after deserialization can reach 0.07. This difference is mainly due to the geometric features and spatial distribution patterns of these two classes. Buildings typically have regular geometric shapes (rectangular or regular polygons) and clear boundary features, with relatively independent spatial distributions. Vegetation, on the other hand, often has irregular contours and complex spatial interactions with roads and buildings, leading to some loss during serialization, though the loss is less severe compared to small objects like fences.

When a larger resolution is employed, due to the presence of fixed loss, even if the network achieves 100% prediction accuracy for the serialized data, there will still be a loss of precision after deserialization. Through quantitative loss analysis, it is possible to reflect the sensitivity of different features to resolution and determine the appropriate resolution that should be adopted for the dataset. The quantitative resolution loss analysis leads to the following conclusions:

(1).: From the global accuracy perspective, even at the maximum resolution of one meter, the model’s mIOU remains at 0.93.
(2).: The impact of serialization on object classification accuracy is within an acceptable range, and higher resolutions lead to greater serialization loss.
(3).: The accuracy differences between different object classes are mainly due to their inherent geometric features and spatial distribution characteristics, rather than the serialization process itself.

3.1.2. Elevation Truncation

In the real world, terrain elevation is unbounded, but to meet the input requirements of deep learning models, a reasonable upper threshold must be set for elevation data. This study uses the truncation method to handle data exceeding the preset elevation range. Specifically, for any point exceeding the maximum elevation threshold (i.e., with an index greater than the threshold) after serialization, its elevation value is set to the maximum elevation value. During the prediction phase, the predicted labels for these truncated positions will be assigned to all points above this elevation. By truncating the sequence rather than the point cloud itself, the model can still predict all points even if it cannot input the entire elevation range.

Figure 6 qualitatively demonstrates the effect of the truncation method, all within an actual area of 100 m × 100 m. In the first example, some vegetation is truncated due to exceeding the preset 50 m elevation limit, but these trees are still classified as vegetation from the truncation layer to the canopy. In the second example, a high-rise building is truncated during serialization, but it is still classified as a building from the truncation layer to the top. The effectiveness of this method is based on an important geospatial phenomenon: objects significantly higher than their surroundings usually have vertical continuity.

To quantitatively evaluate the impact of truncation, we set a truncation elevation of 50 m at a 0.5 m resolution and compared the IOU accuracy with and without truncation in regions where truncation occurred. As shown in Table 2, the truncation operation has no impact on the IOU accuracy of the DALES dataset at the third decimal place. It can be concluded that the truncation method is both effective and efficient for data preprocessing. It not only meets the length requirements for embedding but also has a negligible impact on final classification accuracy.

3.2. Experiments

3.2.1. Network Structure

When implementing the model using the SeqConv architecture, we designed the first model based on a balance between efficiency and accuracy, as shown in Figure 7. The RNN encoder and decoder of this model both use 2-layer Gated Recurrent Units [38]. The reason why we choose GRU instead of LSTM [39] is that LSTM often performs better when processing long sequences. However, due to the second characteristic of spatially ordered sequences (they are generally short in length), it is better to use a lighter GRU. For the CNN part, UNet [40], as a classic encoder–decoder structure, is widely used in image segmentation tasks. Its skip connections effectively preserve multi-scale spatial information, so we chose the well-tested UNet structure. This design leverages the lightweight characteristics of both GRU and UNet to maintain low computational complexity while using GRU’s gating mechanism to more effectively retain key information in sequences.

The complete architecture of our model is detailed in Figure 7. The GRU–UNet consists of three key components: a GRU encoder, a UNet encoder, and a GRU decoder. In this design, the input point cloud is first serialized into spatial sequences. The GRU encoder then processes these variable-length sequences and encodes them into fixed-length hidden states. These hidden state vectors are subsequently reorganized according to their original spatial positions to form hidden state matrices.

Since we employ a 2-layer GRU structure, this process naturally generates two distinct hidden-state matrices. The UNet encoder then processes these matrices in parallel, performing spatial feature fusion and extraction. Following residual connections with the original hidden state matrices, the refined features are finally fed into the GRU decoder for initialization. The decoder then sequentially generates predictions for each valid position in the sequence.

Since spatially ordered sequences are usually short, the information contained in their hidden states is not as extensive as in NLP tasks, so the hidden state dimension does not need to be as large as hundreds or thousands. From an efficiency perspective, excessively long hidden state dimensions significantly increase computational burden. In practice, we chose hidden state dimensions between 16 and 48.

3.2.2. Elevation Embedding

Although spatial sequences are similar in form to sentences in NLP tasks (i.e., both are composed of a series of elements), there are fundamental differences in encoding methods. In NLP, pre-trained word vector models, such as Word2Vec [41] and GloVe [42], usually map semantically similar words to nearby positions in the vector space to achieve better convergence. However, the values in spatial sequences represent elevation information, not semantic information.

In the encoding of elevation sequences, using NLP word embedding methods directly can lead to serious problems. Word embedding methods assume that similar words have semantic similarity, so they map distant tokens in the vocabulary to nearby vectors. However, elevation values do not have semantic meanings and should not be assumed to have specific relationships. Therefore, if word embedding methods are directly used to encode elevations, the model will fail to capture the physical positional relationships between elevations and instead will force certain elevations to have specific relationships, which leads to severe overfitting.

Fortunately, there are mature solutions for encoding positional information. In transformer models, to utilize positional information in sequences, researchers propose the concept of positional encoding. The following method is used to encode elevations:

For a given input sequence

S e q \in R^{n}

of length

n

, its

d

-dimensional positional encoding [43] is represented as

P \in R^{n \times d}

, where the elements at row

i

, column

j

, and column

2 j + 1

are

p_{i, 2 j} = \sin (\frac{i}{10000^{\frac{2 j}{d}}})

(29)

p_{i, 2 j + 1} = \cos (\frac{i}{10000^{\frac{2 j}{d}}})

(30)

Positional encoding adds absolute or relative positional information to the input representation, allowing the model to perceive the positional relationships of each element in the sequence. This positional encoding method aligns well with the encoding needs of elevation sequences. Positional encoding generates fixed encoding vectors using sine and cosine functions, with frequency changes representing absolute positions. Since positional encoding is fixed, it does not introduce additional trainable parameters and can accelerate the training process.

In experiments, we tried two different elevation encoding methods: trainable random embeddings and fixed cosine encoding. When using trainable random embeddings, the model performed well on the training set but showed severe overfitting on the test set, with the mIOU difference between the training and test sets reaching up to 0.2–0.3. This phenomenon occurs because trainable random embeddings cause the network to attempt to memorize the distribution of elevation values through training data and assign non-existent relationships, completely ignoring the physical positional relationships of elevation values. Therefore, when encountering unseen elevation distributions in the test set, the prediction accuracy drops significantly.

When we changed the encoding method to fixed cosine encoding, the overfitting problem was significantly alleviated, improving the model’s accuracy on the test set by up to 0.1–0.2, while also speeding up convergence. The elevation encoding vectors generated by sine and cosine functions effectively represent the physical positional relationships between elevations without introducing additional trainable parameters. This forces the network to avoid memorizing elevation distributions by altering embedding vectors, ultimately preventing overfitting and enabling the model to better capture regular relationships between elevations, improving robustness and convergence speed. This strongly confirms the first characteristic of spatially ordered sequences: the elevation positional nature of spatially ordered sequences requires that the embedding vector represents elevation positions rather than semantics.

3.2.3. Implementation Details and Evaluation Metrics

In the experiments, we used the PyTorch (2.20) framework to implement our network. During data preprocessing, the only required preprocessing step was dividing the dataset into 100 square meters areas. For the training process, after serialization, each block was input as a 160 × 160 matrix. For a 0.5 m resolution, this means inputting a ground area of 80 square meters at a time. Thanks to the compactness of our network and the advantages of spatial sequences, even with a hidden state dimension of 32, our network only requires a 16 GB RTX 4070 Ti for training and a 4 GB GPU for inference.

Additionally, we used the Adam optimizer to minimize the objective function, with mIOU as our evaluation metric. The training lasted for 100 epochs, and data augmentation was turned off in the final few epochs. The loss function used a combination of CrossEntropyLoss and DiceLoss, and the UNet network had layers of (64,128,256,512,1024).

3.2.4. Results

To comprehensively evaluate the performance of various models in practical applications, we specifically tested their running speed on mobile devices with limited computational resources. These devices typically have low computational power and limited memory, placing higher demands on model efficiency. In experiments, we selected a complete 500-square-meter test set containing 12,219,779 points as a benchmark to measure the processing speed of different models under the same mobile hardware environment. Table 3 and Table 4 show the mIOU metrics and time consumption (in seconds) of various models. Figure 8 and Figure 9 show the results of the classification.

As can be observed from Table 3, GRU–UNet exhibits interesting variations in class-wise accuracy compared to other methods, owing to its unique architecture and spatial sequence characteristics. Specifically, GRU–UNet performs relatively worse on the Powerline and Facade categories but achieves better results on the Fence, Shrub, and Tree classes.

Through analysis, we speculate that these differences primarily stem from the following factors: The seq2seq structure demonstrates stronger robustness only when trained with sufficiently large sample sizes. However, the Powerline and Facade categories in the IPSRS dataset have limited spatial coverage and fewer points, leading to less effective model training and consequently inferior performance. In contrast, the Fence, Shrub, and Tree categories benefit from their distinct elevation-aligned distribution patterns and richer point counts, allowing the spatial sequences to effectively capture their structural features and enabling the model to learn more effectively.

3.3. Ablation Studies

3.3.1. Hidden State Dimension

The dimension of hidden states limits the amount of information the model can retain after encoding, thus having a decisive impact on the model’s learning ability. In NLP tasks, embedding lengths and hidden state dimensions are often long, reaching hundreds or even thousands. However, the SeqConv-Net structure does not require such long dimensions. For different hidden state dimensions, we evaluated their final convergence results on the DALES dataset. Table 5 and Figure 10 evaluate the final convergence results of the network under different hidden state dimensions.

It is clear that as the hidden state dimension increases, the model’s final convergence accuracy also improves. However, as the dimension continues to grow, the improvement in accuracy gradually diminishes and saturates. This phenomenon validates the second important characteristic of spatially ordered sequences: their short length. Since spatially ordered sequences are short and carry relatively limited information, excessively high hidden state dimensions are not necessary for efficient encoding and information retention.

Additionally, by analyzing the performance of different classes at various hidden state dimensions, we observe that for larger and more prominent objects (e.g., ground, buildings, and poles), the improvement in accuracy with increasing hidden state dimension is minimal. In contrast, for small target classes (e.g., cars, trucks, poles, and fences), the improvement in accuracy is significant. This indicates that for difficult-to-classify small targets, appropriately increasing the hidden state dimension can effectively enhance the model’s ability to learn fine-grained features, significantly improving the segmentation accuracy.

3.3.2. Spatial Sequence Resolution

The spatial resolution used during serialization has a decisive impact on loss. Higher resolutions lead to more severe loss of small targets during serialization. Therefore, we also evaluated model training at three common resolution settings. The hidden state dimension was set to 32, and other hyperparameters remained consistent (Table 6).

However, higher resolutions did not lead to the expected accuracy improvement. Further analysis shows that at lower resolutions, the sequence length increases significantly, introducing many valid voxels with similar elevations. These elevations, after continuous encoding, often result in significant information redundancy. Additionally, the two-layer GRU’s learning ability is limited, making it difficult to filter out more valuable elevation information and leading to reduced model accuracy.

Lower resolutions make small target detection more difficult, thus reducing accuracy. However, this reduction is not without benefits. When the resolution is halved, the total number of sequences generated decreases to one-fourth, resulting in nearly four times the speed improvement. In practice, this means that if a slight sacrifice in small target accuracy is acceptable, the speed can easily reach tens of millions of points processed per second.

3.3.3. Exploring Hidden Variables

In CNNs, each channel of an image represents a feature channel that characterizes a specific feature. Does the SeqConv-Net structure exhibit similar behavior? The answer is yes. Since the input sequences are arranged by elevation, it can be inferred that these 16 channels should have a significant correlation with the Digital Surface Model (DSM). Therefore, we generated DSM images from the point cloud data and compared them with the visualization results of the 16 channels (Figure 11).

The results show that these channels exhibit visible similarities with the DSM and changes in object categories. In particular, in the 4th channel, ground in the DSM strongly corresponds to the light-colored regions, indicating that the model successfully identified and distinguished ground points from non-ground points during encoding and embedded this information into these channels. In the 1st and 2nd channels, the response for the fence category, which is close to the ground elevation, is almost zero. This shows that the model not only learned to embed information in individual channels but also learned to decompose and reorganize information across different channels.

These strong correspondences with the DSM demonstrate that the GRU encoder and CNN’s processing of hidden variables essentially constitute a “high-dimensional projection” mechanism. This mechanism effectively compresses and reorganizes multi-dimensional information from the original 3D data through specific feature extraction methods, embedding different types of key feature information into different channels of the hidden variables.

Compared to general projection methods, the effectiveness of this high-level abstract information projection mechanism lies in the following aspects:

(1).: The focus of different channels on specific object features indicates that the model does not simply encode elevations but learns spatial information perception through the CNN, which enables the model to learn feature separation methods.
(2).: The complementary feature distribution across channels shows that the model achieves effective information decomposition and reorganization, not just simple encoding.
(3).: The correspondence between features and the DSM confirms that the hidden variable mapping process has practical physical significance.

These findings validate the rationality of the SeqConv-Net structure from both theoretical and practical perspectives. Theoretically, the GRU encoder uses its gating mechanism to filter and memorize elevation information, extracting representative features. The CNN, through its convolutional kernels’ spatial perception ability, further captures the spatial correlations of local features. The synergy between the two achieves multi-level, multi-scale feature extraction of 3D data.

Practically, the SeqConv-Net structure can rapidly extract and map information from point clouds, achieving efficient encoding of 3D data. The model’s interpretability is strong. By analyzing the feature response patterns of different channels, we can clearly observe how the model transforms raw data into feature representations with physical significance. This transformation process retains the key spatial information of the original data while improving information usability through feature reorganization. Therefore, the design of the SeqConv-Net structure is not only an effective solution for point cloud segmentation but also an innovative exploration of 3D data processing theory.

4. Conclusions

This paper addresses the issues of high computational resource consumption and insufficient real-time performance in large-scale airborne point cloud semantic segmentation tasks by proposing an innovative lightweight architecture: SeqConv-Net. This architecture voxelates point clouds into spatially ordered sequences, combining the strengths of RNNs and CNNs to achieve efficient 3D feature extraction and semantic segmentation. SeqConv-Net treats points at the same planar location but different elevations as ordered sequences using RNNs to capture long-range dependencies in the vertical direction and CNNs to extract planar spatial features. Finally, residual connections and a decoder are used for end-to-end prediction. Experiments show that this architecture achieves 75.5 mIOU on the DALES dataset while significantly improving speed (5 s to process 12 million points), with strong interpretability.

The potential of spatial sequences and the SeqConv-Net architecture is immense. Spatial sequences introduce a novel approach to point cloud modeling, enabling rapid processing and structured mapping of point clouds in an extremely short time. This method provides a new paradigm for downstream point cloud applications—such as end-to-end DEM generation and point cloud reconstruction—by efficiently organizing unstructured 3D data into interpretable, ordered representations.

The potential of the SeqConv-Net architecture is vast, offering a new approach to point cloud semantic segmentation that balances speed and accuracy and opening up new possibilities for processing 3D point cloud data. SeqConv-Net offers great flexibility. By adjusting structures and hyperparameters, it can adapt to different computational resource requirements. The SeqConv-Net architecture also is inherently highly modular, allowing for the replacement of different RNN encoders and CNN structures. With sufficient computational resources, combining the long-range modeling capabilities of Transformers and mature semantic segmentation networks like DeepLabV3+ [44] or ViT [45] can further improve segmentation accuracy in complex scenes.

In the future, we will focus on unlocking the potential of spatial sequences and the SeqConv-Net architecture by extending the concept of spatial sequences to other point cloud processing applications, including exploring their utility in DEM generation and point cloud-based tasks. Currently, the SeqConv-Net architecture cannot process RGB color information from point clouds; we will further investigate methods to integrate color, elevation, and other attributes while incorporating the latest modules and techniques to maximize its capabilities.

Author Contributions

Conceptualization, B.G. and J.W.; methodology, B.G.; software, B.G.; validation, J.W., J.X. and B.G.; formal analysis, B.G.; investigation, B.G.; resources, B.G.; data curation, B.G.; writing—original draft preparation, B.G.; writing—review and editing, C.Y. and H.M.; visualization, B.G.; supervision, C.Y. and H.M.; project administration, C.Y.; funding acquisition, C.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key Research and Development Program of China [Grant No. 2024YFC3810802], [National Key R&D] grant number [2018YFB0504500], [National Natural Science Foundation of China] grant number [41101417], and [National High Resolution Earth Observations Foundation] grant number [11-H37B02-9001-19/22].

Data Availability Statement

This research did not create a new dataset.

Conflicts of Interest

The authors declare no conflict of interest.

References

Suykens, J.A.K.; Vandewalle, J. Least squares support vector machine classifiers. Neural Process. Lett. 1999, 9, 293–300. [Google Scholar] [CrossRef]
Svetnik, V.; Liaw, A.; Tong, C.; Culberson, J.C.; Sheridan, R.P.; Feuston, B.P. Random forest: A classification and regression tool for compound classification and QSAR modeling. J. Chem. Inf. Comput. Sci. 2003, 43, 1947–1958. [Google Scholar] [CrossRef] [PubMed]
Qi, R.; Su, H.; Mo, K.; Guibas, L.J. Pointnet: Deep learning on point sets for 3D classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 652–660. [Google Scholar]
Wang, Y.; Sun, Y.; Liu, Z.; Sarma, S.E.; Bronstein, M.M.; Solomon, J.M. Dynamic graph CNN for learning on point clouds. ACM Trans. Graph. 2019, 38, 1–12. [Google Scholar] [CrossRef]
Thomas, H.; Qi, C.R.; Deschaud, J.-E.; Marcotegui, B.; Goulette, F.; Guibas, L.J. KPConv: Flexible and deformable convolution for point clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6411–6420. [Google Scholar]
Robert, D.; Raguet, H.; Landrieu, L. Efficient 3D semantic segmentation with superpoint transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 17195–17204. [Google Scholar]
Pauly, M.; Keiser, R.; Gross, M. Multi-scale feature extraction on point-sampled surfaces. Comput. Graph. Forum 2003, 22, 281–289. [Google Scholar] [CrossRef]
Rabbani, T.; Van Den Heuvel, F.; Vosselmann, G. Segmentation of point clouds using smoothness constraint. Int. Arch. Photogramm. Remote Sens. Spatial Inf. Sci. 2006, 36, 248–253. [Google Scholar]
Golovinskiy, A.; Funkhouser, T. Min-cut based segmentation of point clouds. In Proceedings of the IEEE 12th International Conference on Computer Vision Workshops (ICCV Workshops), Kyoto, Japan, 27 September–4 October 2009; pp. 39–46. [Google Scholar]
Schnabel, R.; Wahl, R.; Klein, R. Efficient RANSAC for point-cloud shape detection. Comput. Graph. Forum 2007, 26, 214–226. [Google Scholar] [CrossRef]
Papon, J.; Abramov, A.; Schoeler, M.; Worgotter, F. Voxel cloud connectivity segmentation-supervoxels for point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Portland, OR, USA, 23–28 June 2013; pp. 2027–2034. [Google Scholar]
Rusu, R.B.; Marton, Z.C.; Blodow, N.; Dolha, M.E.; Beetz, M. Functional object mapping of kitchen environments. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Nice, France, 22–26 September 2008; pp. 3525–3532. [Google Scholar]
Guo, Y.; Bennamoun, M.; Sohel, F.; Lu, M.; Wan, J. 3D object recognition in cluttered scenes with local surface features: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 36, 2270–2287. [Google Scholar] [CrossRef]
Lai, K.; Bo, L.; Ren, X.; Fox, D. A large-scale hierarchical multi-view RGB-D object dataset. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Shanghai, China, 9–13 May 2011; pp. 1817–1824. [Google Scholar]
Qi, R.; Yi, L.; Su, H.; Guibas, L.J. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar] [CrossRef]
Hu, Q.; Yang, B.; Xie, L.; Rosa, S.; Guo, Y.; Wang, Z.; Trigoni, N.; Markham, A. RandLA-Net: Efficient semantic segmentation of large-scale point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11108–11117. [Google Scholar]
Li, Y.; Bu, R.; Sun, M.; Wu, W.; Di, X.; Chen, B. PointCNN: Convolution on X-transformed points. Adv. Neural Inf. Process. Syst. 2018, 31. [Google Scholar] [CrossRef]
Boulch, A. ConvPoint: Continuous convolutions for point cloud processing. Comput. Graph. 2020, 88, 24–34. [Google Scholar] [CrossRef]
Wu, W.; Qi, Z.; Fuxin, L. PointConv: Deep convolutional networks on 3D point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 9621–9630. [Google Scholar]
Wu, B.; Wan, A.; Yue, X.; Keutzer, K. SqueezeSeg: Convolutional neural nets with recurrent CRF for real-time road-object segmentation from 3D LiDAR point cloud. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, QLD, Australia, 21–25 May 2018; pp. 1887–1893. [Google Scholar]
Milioto, A.; Vizzo, I.; Behley, J.; Stachniss, C. RangeNet++: Fast and accurate LiDAR semantic segmentation. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 3–8 November 2019; pp. 4213–4220. [Google Scholar]
Su, H.; Maji, S.; Kalogerakis, E.; Learned-Miller, E. Multi-view convolutional neural networks for 3D shape recognition. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 945–953. [Google Scholar]
Le, T.; Duan, Y. PointGrid: A deep network for 3D shape understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 9204–9214. [Google Scholar]
Meng, H.Y.; Gao, L.; Lai, Y.K.; Manocha, D. VV-Net: Voxel VAE net with group convolutions for point cloud segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8500–8508. [Google Scholar]
Graham, B.; Engelcke, M.; Van Der Maaten, L. 3D semantic segmentation with submanifold sparse convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 9224–9232. [Google Scholar]
Liu, Z.; Tang, H.; Lin, Y.; Han, S. Point-voxel CNN for efficient 3D deep learning. Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar] [CrossRef]
Zhao, H.; Jiang, L.; Jia, J.; Torr, P.H.; Koltun, V. Point transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 16259–16268. [Google Scholar]
Guo, M.H.; Cai, J.X.; Liu, Z.N.; Mu, T.J.; Martin, R.R.; Hu, S.M. PCT: Point cloud transformer. Comput. Vis. Media 2021, 7, 187–199. [Google Scholar] [CrossRef]
Akwensi, P.H.; Wang, R.; Guo, B. Preformer: A memory-efficient transformer for point cloud semantic segmentation. Int. J. Appl. Earth Obs. Geoinf. 2024, 128, 103730. [Google Scholar] [CrossRef]
Liang, D.; Zhou, X.; Xu, W.; Zhu, X.; Zou, Z.; Ye, X.; Tan, X.; Bai, X. PointMamba: A simple state space model for point cloud analysis. arXiv 2024, arXiv:2402.10739. [Google Scholar]
Zhang, T.; Yuan, H.; Qi, L.; Zhang, J.; Zhou, Q.; Ji, S.; Yan, S.; Li, X. Point Cloud Mamba: Point cloud learning via state space model. arXiv 2024, arXiv:2403.00762. [Google Scholar] [CrossRef]
Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv 2023, arXiv:2312.00752. [Google Scholar]
Elman, J.L. Finding structure in time. Cogn. Sci. 1990, 14, 179–211. [Google Scholar] [CrossRef]
LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Sutskever, I.; Vinyals, O.; Le, Q.V. Sequence to sequence learning with neural networks. Adv. Neural Inf. Process. Syst. 2014, 27. [Google Scholar] [CrossRef]
Varney, N.; Asari, V.K.; Graehling, Q. DALES: A large-scale aerial LiDAR data set for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 14–19 June 2020; pp. 186–187. [Google Scholar]
Cho, K.; van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv 2014, arXiv:1406.1078. [Google Scholar]
Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention (MICCAI); Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient estimation of word representations in vector space. arXiv 2013, arXiv:1301.3781. [Google Scholar]
Pennington, J.; Socher, R.; Manning, C.D. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar] [CrossRef]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]

Figure 1. Architecture overview.

Figure 2. Serialization and assignment of a voxel matrix of shape (4,3,4). For clarity, invalid voxels are not drawn, and each row is separated. Since index 0 is used as the elevation for padding voxels, the bottom voxel has a value of 1 instead of 0, and the end voxel’s index is the maximum index in the Z-direction plus one (here, 4 + 1).

Figure 3. Encoding process of a spatial sequence (1,2,5,0). Each elevation is embedded into an n-dimensional vector, and the RNN encodes these elevation vectors sequentially, ultimately storing all information in a fixed-length m-dimensional hidden state vector. Thus, the RNN achieves selective retention of elevation information and maps it into a structured m-dimensional vector.

Figure 4. Planar information extraction using CNNs.

Figure 5. Prediction process of the SeqConv-Net structure. The decoder is initialized with the hidden state output by the encoder, and the first prediction uses a special Start token. Subsequently, the hidden state is updated and the current category (

C_{0}

,

C_{1}, C_{2}

,

C_{3}

) is predicted; then, the previous output category is connected with the hidden state of the last layer as the next input for continuous prediction.

Figure 5. Prediction process of the SeqConv-Net structure. The decoder is initialized with the hidden state output by the encoder, and the first prediction uses a special Start token. Subsequently, the hidden state is updated and the current category (

C_{0}

,

C_{1}, C_{2}

,

C_{3}

) is predicted; then, the previous output category is connected with the hidden state of the last layer as the next input for continuous prediction.

Figure 6. Impact of truncation on model prediction. The truncation surface is marked with a red ellipse. Due to the vertical continuity of objects, points on the truncation surface often have the same category as the truncation point. Therefore, the truncation method allows the network to predict all voxels in the sequence even if it cannot input the entire sequence.

Figure 7. GRU–UNet segmentation model. The encoder and decoder use two layers of GRU.

Figure 8. Segmentation results of the GRU–UNet network on the DALES dataset.

Figure 9. Segmentation results of the GRU–UNet network on the ISPRS Vaihingen 3D.

Figure 10. Segmentation results with different hidden state dimensions. As the hidden state dimension increases, the segmentation effect for small objects gradually improves. The enlarged red circle indicated by the arrow displays the differences in details of the categories predicted by the model.

Figure 11. Sixteen-channel images composed of sixteen-dimensional hidden variables. It can be observed that the channels exhibit some degree of correspondence with the DSM.

Table 1. IOU precision after label deserialization.

Resolution	Ground	Vegetation	Buildings	Cars	Fences	Powerlines	Trucks	Poles	mIOU
0.25	0.991	0.974	0.997	0.995	0.976	0.995	0.995	0.994	0.990
0.5	0.978	0.936	0.994	0.983	0.926	0.986	0.983	0.981	0.971
1	0.956	0.886	0.985	0.934	0.815	0.962	0.951	0.952	0.930

Table 2. IOU Accuracy After Deserialization With and Without Truncation.

Truncation	Ground	Vegetation	Cars	Trucks	Powerlines	Fences	Poles	Buildings	mIOU
NO	0.915	0.943	0.927	0.977	0.955	0.828	0.973	0.990	0.938
YES	0.915	0.943	0.927	0.977	0.955	0.828	0.973	0.990	0.938

Table 3. mIOU Accuracy and Speed of Different Models on ISPRS Vaihingen 3D.

Method	Powerline	Low Vegetation	Impervious Surfaces	Car	Fence	Roof	Facade	Shrub	Tree	OA	Speed
PointNet++	57.9	79.6	90.6	66.1	31.5	91.6	54.3	41.6	77.0	65.58	24.6 s
KPConv	63.1	82.3	91.4	72.5	25.2	94.4	60.3	44.9	81.2	68.37	6.27 s
DGCNN	44.6	71.2	81.8	42.0	11.8	93.8	64.3	46.4	81.7	59.73	6.81 s
PointCNN	61.5	82.7	91.8	75.8	35.9	92.7	57.8	49.1	78.1	69.49	-
ConvPoint	58.8	80.9	90.7	65.9	34.3	90.3	52.4	39.1	77.0	65.49	-
RandLA-Net	68.8	82.1	91.3	76.6	43.8	91.1	61.9	45.2	77.4	70.91	3.2 s
Ours	38.5	82.7	94.1	78.9	67.5	89.3	46.3	57.5	85.1	71.10	2.1 s

Table 4. mIOU Accuracy and Speed of Different Models on DALES Dataset.

Method	Input Points	mIOU	Speed
PointNet++	8192	0.683	726.6 s
KPConv	8192	0.726	186.9 s
DGCNN	8192	0.665	203.2 s
PointCNN	8192	0.584	-
ConvPoint	8192	0.674	-
PointTransformer	8192	0.749	698.7 s
PReFormer	8192	0.709	-
PointMamba	8192	0.733	90.7 s
PointCloudMamba	8192	0.747	115.6 s
Ours	-	0.755	8.01 s

Table 5. IOU of Different Classes for SeqConv-Net with Different Hidden State Dimensions.

	Ground	Vegetation	Buildings	Cars	Trucks	Poles	Powerlines	Fences	mIOU	Speed
16	0.942	0.839	0.930	0.655	0.396	0.679	0.607	0.406	0.682	4.05 s
24	0.950	0.851	0.931	0.685	0.412	0.681	0.658	0.453	0.702	4.26 s
32	0.953	0.870	0.932	0.732	0.453	0.704	0.717	0.495	0.732	5.56 s
40	0.952	0.896	0.932	0.763	0.522	0.701	0.704	0.501	0.747	6.02 s
48	0.964	0.919	0.934	0.751	0.532	0.706	0.724	0.511	0.755	8.01 s

Table 6. IOU of SeqConv-Net with Different Serialization Resolutions.

	Ground	Vegetation	Buildings	Cars	Trucks	Poles	Power Lines	Fences	mIOU	Speed
1	0.943	0.839	0.930	0.512	0.214	0.682	0.651	0.210	0.622	1.51 s
0.5	0.953	0.870	0.932	0.732	0.453	0.704	0.717	0.495	0.732	8.01 s
0.25	0.955	0.865	0.925	0.713	0.421	0.692	0.653	0.512	0.717	30.9 s

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Guo, B.; Yao, C.; Ma, H.; Wang, J.; Xu, J. SeqConv-Net: A Deep Learning Segmentation Framework for Airborne LiDAR Point Clouds Based on Spatially Ordered Sequences. Remote Sens. 2025, 17, 1927. https://doi.org/10.3390/rs17111927

AMA Style

Guo B, Yao C, Ma H, Wang J, Xu J. SeqConv-Net: A Deep Learning Segmentation Framework for Airborne LiDAR Point Clouds Based on Spatially Ordered Sequences. Remote Sensing. 2025; 17(11):1927. https://doi.org/10.3390/rs17111927

Chicago/Turabian Style

Guo, Bin, Chunjing Yao, Hongchao Ma, Jie Wang, and Junhao Xu. 2025. "SeqConv-Net: A Deep Learning Segmentation Framework for Airborne LiDAR Point Clouds Based on Spatially Ordered Sequences" Remote Sensing 17, no. 11: 1927. https://doi.org/10.3390/rs17111927

APA Style

Guo, B., Yao, C., Ma, H., Wang, J., & Xu, J. (2025). SeqConv-Net: A Deep Learning Segmentation Framework for Airborne LiDAR Point Clouds Based on Spatially Ordered Sequences. Remote Sensing, 17(11), 1927. https://doi.org/10.3390/rs17111927

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SeqConv-Net: A Deep Learning Segmentation Framework for Airborne LiDAR Point Clouds Based on Spatially Ordered Sequences

Abstract

1. Introduction

1.1. Related Work

1.1.1. Traditional Segmentation Methods

1.1.2. Deep Learning-Based Methods

1.2. Motivations

1.3. Contributions

2. Point Cloud Semantic Segmentation Framework

2.1. Architecture Overview

2.2. Spatially Ordered Sequences

2.2.1. The Concept of Spatially Ordered Sequences

2.2.2. Generation Algorithm for Spatially Ordered Sequences

2.2.3. Differences Between Spatially Ordered Sequences and NLP Sequences

2.3. Spatial Information Processing

2.3.1. Elevation Information Extraction Using RNNs

2.3.2. Planar Information Fusion and Extraction Using CNNs

2.3.3. Prediction

3. Experimental Results and Analysis

3.1. Data Preprocessing and Augmentation

3.1.1. Sequence Loss

3.1.2. Elevation Truncation

3.2. Experiments

3.2.1. Network Structure

3.2.2. Elevation Embedding

3.2.3. Implementation Details and Evaluation Metrics

3.2.4. Results

3.3. Ablation Studies

3.3.1. Hidden State Dimension

3.3.2. Spatial Sequence Resolution

3.3.3. Exploring Hidden Variables

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI