3.1. CMPNet
Cylinder3D serves as the baseline architecture for this study. It projects irregular LiDAR point clouds into a cylindrical coordinate system to mitigate the uneven point density problem and employs voxel-based feature encoding and 3D sparse convolution to perform semantic segmentation. However, the original Cylinder3D framework faces three main limitations: (1) its voxel feature encoder, built upon standard MLPs, struggles to capture complex local spatial relationships; (2) it lacks an effective mechanism for multi-scale feature extraction and long-range dependency modeling; and (3) its normalization scheme exhibits limited robustness when handling sparse or highly unstructured outdoor scenes. To address these issues, we propose an improved point cloud semantic segmentation framework named CMPNet, which systematically enhances the feature representation capability and robustness of the Cylinder3D architecture. As shown in
Figure 1, CMPNet retains the cylindrical partitioning strategy of Cylinder3D while integrating three functional modules: PointMamba [
31], KAT [
32] and MS3DAM.
Specifically, the input raw point cloud is first processed by the KAT encoder, which replaces the conventional Voxel Feature Encoder (VFE) in Cylinder3D. The KAT module introduces the GR-KAN mechanism to replace standard multilayer perceptrons (MLPs) with group rational functions, allowing it to capture intricate local spatial dependencies that conventional VFEs fail to model effectively. GR-KAN employs rational functions instead of B-splines, making it more compatible with GPU architectures and improving computational efficiency. In addition, GR-KAN reduces the number of model parameters and computational cost through parameter sharing, while its variance-preserving initialization enhances training stability. Within the KAT module, layer normalization is applied to stabilize the input distribution and maintain consistent feature scaling, thereby preserving the spatial structure of point cloud features during training.
After feature encoding, the processed features are redistributed through cylindrical partitioning to achieve a balanced spatial distribution. Next, an asymmetric 3D sparse convolutional network is employed to extract local geometric structures. Within this network, we introduce the MS3DAM, a newly designed component that applies multi-scale dilated convolutions and attention fusion to capture both fine-grained local details and global contextual cues, compensating for the limited multi-scale modeling capability of the original Cylinder3D.
Finally, to enhance global context perception and computational efficiency, the PointMamba module is incorporated. Its sequence-based feature aggregation enables long-range dependency modeling with linear complexity, significantly reducing the computational cost in large-scale outdoor traffic scenes.
Through these targeted enhancements, CMPNet effectively addresses the architectural limitations of Cylinder3D, achieving superior segmentation accuracy, robustness, and efficiency.
3.2. MS3DAM
In point cloud semantic segmentation tasks, objects often appear in various sizes and shapes. Conventional convolutional layers employ a fixed receptive field, which substantially restricts their ability to capture multi-scale features. To address this limitation, we design the MS3DAM that applies multiple dilation rates in parallel and integrates channel and spatial attention mechanisms to enhance feature representation [
33].
First, MS3DAM enlarges the receptive field. Traditional convolutional networks expand the receptive field by stacking multiple convolutional layers, which significantly increases computational cost and may dilute important information. In contrast, dilated convolution, particularly with varying dilation rates, effectively enlarges the receptive field without reducing resolution, thereby enabling the model to capture broader contextual information.
Second, MS3DAM enhances the capture of multi-scale information. In tasks such as semantic segmentation and object detection, where objects differ greatly in size and shape, parallelizing convolutions with multiple dilation rates allows the network to simultaneously extract features at different scales. This improves the model’s adaptability and recognition of features across diverse spatial ranges.
Global average pooling [
34] effectively captures the global features of an entire image, which is crucial for understanding the overall scene structure. Integrating this global information with local features enables the network to better interpret the relationship between individual image regions and the whole, thereby facilitating more accurate predictions.
In convolutional neural networks, different channels typically represent distinct features or patterns. However, when processing complex images, some channels contain more useful information than others, and conventional convolutional operations cannot dynamically adjust their importance. Similarly, spatial locations vary in significance; for instance, regions containing target objects are more critical than background areas. Traditional convolutions, however, treat all spatial locations equally and fail to emphasize key regions. To address these limitations, channel attention is introduced to evaluate and adjust the importance of each channel, thereby strengthening informative feature channels and suppressing less relevant ones. In addition, spatial attention is applied to highlight crucial regions within the image, enabling the model to focus on local areas that exert the greatest influence on the task outcome.
Finally, to enhance overall model performance, multi-scale features must be effectively integrated with the attention mechanism, as using either technique alone is insufficient. By concatenating the output features of multi-scale dilated convolutions along the channel dimension and applying both channel and spatial attention mechanisms for weighting, the module fuses features from multiple perspectives. Combining different dilated convolution layers with global features ensures that the network captures both local details and global context. This design produces more comprehensive network outputs and improves the model’s generalization across diverse scenarios. Additionally, feature dimensionality reduction via the final 1 × 1 × 1 convolution reduces computational cost and integrates information from different branches, generating a more effective feature representation for the downstream task.
The MS3DAM module, as illustrated in
Figure 2, is designed to enhance feature representation by integrating multiple dilation rates and combining channel and spatial attention mechanisms. This architectural design effectively addresses the requirement to capture both fine-grained local information and broad contextual information within an image, which is essential for tackling complex visual tasks, including semantic segmentation in outdoor traffic scenarios and object detection applications.
This module represents an integrated combination of multi-scale dilated convolutions with channel and spatial attention mechanisms. It performs feature extraction across multiple scales through five parallel convolutional branches, each configured with distinct dilation rates to expand the receptive field and capture spatial information at varying ranges. The first branch employs a 1 × 1 × 1 convolutional kernel to directly extract features without altering spatial resolution. The second branch uses a 3 × 3 × 3 convolutional kernel with a dilation rate of 3, moderately enlarging the receptive field. The third branch applies a 3 × 3 × 3 convolutional kernel with a dilation rate of 12, further expanding the receptive field to capture a broader range of contextual information. The fourth branch achieves the widest receptive field using a 3 × 3 × 3 convolutional kernel with a dilation rate of 39. Finally, the fifth branch incorporates a global average pooling operation to extract global contextual features, thereby enhancing the model’s comprehension of the overall scene layout.
Subsequently, the multi-scale features extracted from the five distinct branches are concatenated along the channel dimension to form a unified feature map [
35]. This combined feature map is then simultaneously refined using both the channel attention mechanism and the spatial attention mechanism. Within the channel attention module, global average pooling is applied to the synthesized feature map to obtain channel-wise global features. The importance weights for each channel are then learned through two fully connected layers employing ReLU and Sigmoid activation functions, respectively. These weights are finally multiplied by the original feature map in a channel-wise manner to implement channel weighting. In the spatial attention module, the combined feature map is globally pooled along the channel dimension to generate a spatial feature map. The importance weights for each spatial location are subsequently learned using a 1 × 1 × 1 convolution followed by a Sigmoid activation function. These weights are applied to the original feature map in an element-wise manner to achieve spatial weighting.
Finally, the output feature maps obtained from the channel attention and spatial attention mechanisms are fused through an element-wise addition operation to produce the enhanced feature maps. At this stage, the feature maps calibrated by the channel and spatial attention mechanisms are each combined with the original merged feature map via element-wise addition to integrate and reinforce the relevant features. The resulting enhanced feature maps are subsequently passed through a 1 × 1 × 1 convolutional layer for dimensionality reduction and feature integration, generating the final output feature representations.
By combining multi-scale dilated convolutions with channel and spatial attention, MS3DAM effectively captures both local details and global contextual information. This design addresses the limitations of Cylinder3D in multi-scale feature extraction and global perception, providing more comprehensive and robust features for semantic segmentation in complex outdoor traffic scenarios.
3.3. PointMamba
Cylinder3D employs cylindrical partitioning and 3D sparse convolutions to extract local geometric features from point clouds. However, it lacks an efficient mechanism for modeling long-range dependencies across large-scale point cloud sequences, which limits its capability to capture global contextual information in complex outdoor scenarios. To address this limitation, we introduce PointMamba, which offers linear computational complexity while maintaining the ability to model long sequences, as illustrated in
Figure 3. Mamba [
36] is based on a state space model [
37] (SSM) that incorporates a selectivity mechanism, enabling the model to capture contextual information relevant to the inputs. The formulation of SSM is defined as follows:
where
denotes the state matrix, and
and
represent the mapping parameters. Both
and
are determined using the zero-order hold (ZOH) rule:
Building on SSM, the SelectiveSSM (S6) framework makes the parameters , and functions of the inputs, effectively converting the SSM into a time-varying model and allowing the network to selectively focus on or filter out information. However, due to the inherent irregularity and sparsity of point clouds, Mamba’s unidirectional modeling encounters difficulties when processing point cloud data. These challenges can be effectively mitigated through the reordering strategy implemented in PointMamba.
When processing 3D point cloud data, PointMamba employs two types of space-filling curves, Hilbert [
38] and Trans-Hilbert, to scan the key points, thereby preserving spatial localization effectively. Based on this, K-nearest neighbors (KNN) is applied to generate point patches, which are subsequently fed into the token embedding layer to produce serialized point tokens. To efficiently handle the inherently disordered structure of point clouds, a reordering strategy is applied to the point tokens, providing a more geometrically coherent scanning order. The reordered tokens are then input into an encoder composed of stacked Mamba modules [
39], where feature extraction is performed via selective state-space modeling. This approach enhances the model’s global modeling capability and significantly reduces computational resource consumption, while retaining linear computational complexity.
Within CMPNet, PointMamba works synergistically with the KAT encoder and the MS3DAM module to complement local geometric feature extraction with global sequence modeling, forming an integrated and improved point cloud semantic segmentation framework based on Cylinder3D.
3.4. KAT
Traditional Voxel Feature Encoders (VFE) employ standard multilayer perceptron (MLP) in point cloud segmentation tasks, but they often struggle to capture the intricate spatial relationships inherent within point clouds, thereby limiting the model’s feature representation capability. Furthermore, conventional normalization methods exhibit limited robustness when applied to sparse point clouds. To address these limitations, we introduce the KAT module and replace the traditional MLP with the GR-KAN [
40] module, as illustrated in
Figure 4. GR-KAN utilizes rational functions in place of the B-splines [
41] employed in the conventional KAN [
42], which more effectively aligns with GPU architectures and enhances both computational efficiency and compatibility. In addition, GR-KAN [
43] substantially reduces the number of model parameters and computational requirements through parameter sharing. Concurrently, variance-preserving initialization within GR-KAN further improves the stability of model training. The formula of GR-KAN is defined as follows:
where
denotes the index of the input channel. With
groups, each group contains
channels, where
represents the group index. After transformation, the above equation can be rewritten as:
In this form, the parameters can be shared across input channels. Consequently, the rational function can be directly applied to the input vectors, that is, to each grouped edge.
Within the KAT module, layer normalization is applied to stabilize the input distribution of each layer, preserving the spatial structure of the point cloud features while maintaining consistent feature scaling during training. Compared to standard MLPs, KAT can effectively capture local spatial dependencies in point clouds, thereby enhancing feature representation and reducing computational cost.