1. Introduction
With the rapid development of intelligent transportation, autonomous driving technology has emerged as a key solution to address traditional transportation issues such as traffic accidents and congestion. In autonomous driving, semantic segmentation serves as a vital component of the perception module, enabling pixel-level scene understanding with finer granularity than object detection. By accurately identifying categories such as roads, vehicles, and pedestrians, semantic segmentation provides essential information for path planning and obstacle avoidance systems.
As the core data modality for autonomous driving perception, 3D point clouds comprehensively represent objects’ geometric shapes and spatial structures. They effectively mitigate interference from viewpoint variations and illumination changes in complex scenes—advantages unattainable by traditional 2D image data. Using LiDAR-acquired point clouds, autonomous vehicles can construct high-precision 3D environmental models for safety-critical decision-making [
1].
The rapid advancement of deep learning has driven continuous innovation in point cloud semantic segmentation methods, achieving remarkable improvements in segmentation accuracy. Current point cloud processing approaches can be categorized into three types: point-based methods, projection-based methods, and voxel-based methods. Point-based methods directly process raw point cloud data without complex data transformations. These methods have evolved from the pioneering PointNet [
2] to its enhanced version PointNet++ [
3]. RandLA-Net [
4] significantly improves computational efficiency through random sampling strategies, while SCF-Net [
5] strengthens feature extraction capabilities by integrating shape-aware contextual feature fusion modules. Projection-based methods map 3D point clouds onto 2D planes, process them using mature 2D convolutional networks, and then remap the results back to 3D space. SqueezeSeg [
6] pioneered spherical projection for point cloud segmentation. CENet [
7] further enhances model performance through auxiliary segmentation head optimization, and RangeViT [
8] achieves state-of-the-art results by innovatively introducing Transformer architectures into point cloud semantics. Voxel-based methods process point clouds by converting them into regular 3D grids. VoxNet [
9] first applied 3D convolutional networks to voxelized data. OctNet [
10] optimizes computational resources and memory allocation through octree structures, while Cylinder3D [
11] employs cylindrical partitioning to address the uneven distribution issue in point clouds.
Despite significant progress in point cloud semantic segmentation, current methods still face two main limitations: First, they predominantly rely on convolutional operations for feature dimension interactions during feature extraction, lacking the effective modeling of cross-dimensional relationships. Second, the use of fixed-receptive-field convolutions in feature extraction stages limits model adaptability to multi-scale targets, thereby constraining further improvements in segmentation performance.
Recent research has made substantial progress in addressing these challenges. The multi-scale sparse convolution neural network [
12] leverages multi-scale feature fusion modules and attentive channel feature filtering mechanisms to effectively capture richer features at different scales, enhancing the recognition of objects with similar structures but varying dimensions. The semantic-based local aggregation and multi-scale global pyramid approach [
13] augments local features through semantic similarity while extracting discriminative multi-scale global features from multi-resolution point cloud scenes, providing novel perspectives for feature extraction in point cloud segmentation. Additionally, the Confluent Triple-Flow Network [
14] demonstrates the effectiveness of multi-scale feature extraction through its Residual Atrous Spatial Pyramid Module, which enlarges receptive fields to capture contextual information at various scales—a strategy particularly effective for datasets with heterogeneous distribution characteristics, similar to the uneven density patterns in point cloud data.
Research on adversarial samples is of great significance in the field of point cloud semantic segmentation, especially considering the application of relevant models in safety—critical scenarios such as autonomous driving. Adversarial samples are carefully designed perturbations that, when added to the original data, can mislead segmentation models into making incorrect predictions, posing a serious threat to the performance and reliability of the models. Therefore, it is necessary to explore the impact of adversarial samples on point cloud semantic segmentation models. A new method [
15] has been proposed to generate untargeted adversarial samples in an unrestricted black-box environment using a Generative Adversarial Network (GAN). This method manipulates the latent space of images by leveraging StyleGAN and conducts experiments on the CelebA-HQ dataset. The results show that after 3000 iterations, the attack success rate reaches 100%, and the generated samples have a low distortion degree, fully demonstrating the potential of GAN in generating adversarial samples.
Recent advances in computer vision have demonstrated the effectiveness of Transformer architectures in capturing long-range dependencies through self-attention mechanisms [
16]. Unlike traditional convolutional networks that rely on layer stacking to expand receptive fields, Transformer directly establishes global feature connections, allowing spatially distant points to interact with each other. Additionally, approaches like Atrous Spatial Pyramid Pooling (ASPP) [
17] have shown promise in addressing scale variations by employing parallel convolutions with multiple dilation rates, effectively capturing multi-scale contextual information while maintaining spatial resolution.
To address the limitations of existing point cloud semantic segmentation methods in modeling long-range dependencies and adapting to multi-scale targets, we propose MT-CylNet (Multi-scale Transformer enhanced Cylindrical Network) based on multi-scale feature fusion. The primary contributions of this study are as follows:
- (1)
A Transformer-enhanced voxel-based network is proposed to capture global contextual information, enabling comprehensive feature interaction across different spatial locations in point cloud data, thereby improving the understanding of complex scenes in autonomous driving environments.
- (2)
A multi-scale feature extraction architecture is designed, including an improved DDCM module with tri-directional dilated convolutions, which effectively expands the receptive field while maintaining parameter efficiency, enhancing the model’s adaptability to objects of varying scales and heterogeneous density distributions.
Experimental results on the SemanticKITTI dataset demonstrate that our proposed MT-CylNet achieves an mIoU of 66.8%, outperforming the baseline Cylinder3D by 2.4% and showing significant improvements over other methods. This confirms that by combining Transformer’s global modeling capability with the adaptability of multi-scale feature extraction, the accuracy of point cloud semantic segmentation can be significantly improved, providing a more reliable scene understanding for autonomous driving perception.
The structure of MT-CylNet is organized as follows: First, Chapter I introduces the background, significance, and challenges of point cloud semantic segmentation. Chapter II elaborates on the network architecture design of MT-CylNet, including the data processing and feature extraction procedures, encoder–decoder structure, and loss function design. Chapter III conducts experimental evaluations, validating the effectiveness of the proposed method through comparisons with existing approaches and ablation studies. Chapter IV discusses the overall content, and finally, Chapter V summarizes the work and outlines directions for future research.
2. Materials and Methods
Traditional point cloud semantic segmentation methods face the dual challenges of an insufficient global context modeling capability and poor multi-scale target adaptability when processing large-scale scenes. To address these issues, we developed a new network architecture aimed at improving segmentation accuracy through enhanced global feature interaction and multi-scale feature extraction, particularly for recognizing targets of different scales in complex autonomous driving scenarios.
2.1. Overview of MT-CylNet Architecture
To address the limited accuracy of traditional point cloud semantic segmentation methods when processing large-scale LiDAR data, we propose MT-CylNet. Our approach strategically integrates a Transformer [
16] module at the high-level feature extraction stage, significantly enhancing the model’s capacity to capture long-range dependencies and global contextual information. Simultaneously, we incorporate an ASPP [
17] module for multi-scale feature representation, effectively expanding the receptive field range while enriching feature extraction capabilities. The architecture is further optimized through a joint point-level and voxel-level loss function, which produces finer segmentation boundaries while boosting overall model performance.
As illustrated in
Figure 1, the MT-CylNet architecture comprises four main components: (1) a data processing and feature extraction module that converts raw point clouds into structured voxel representations; (2) an encoder–decoder-based feature learning network that incorporates a Transformer module in high-dimensional feature layers for global context modeling; (3) an improved DDCM module that enhances the feature extraction capability through tri-directional multi-scale convolutions; and (4) a dual-level loss optimization strategy that combines point-level and voxel-level supervision signals to improve segmentation accuracy.
From a design perspective, the data processing module employs cylindrical projection to address the uneven distribution of point clouds; the encoder–decoder structure extracts features at different scales and preserves detail information through skip connections; the Transformer module breaks through the local receptive field limitations of traditional convolutions, enabling global information exchange; the multi-scale DDCM module enhances the model’s adaptability to targets of different scales by combining convolutions with different dilation rates; and finally, the multi-level loss function design enables the model to simultaneously focus on macroscopic modules and microscopic details.
The specific processing flow of MT-CylNet is as follows: First, the point cloud is preprocessed using cylindrical projection. Then, multi-scale feature extraction is performed through an encoder–decoder module, with a Transformer module integrated into the high-dimensional feature layer. The improved DDCM module is enhanced by incorporating dilated convolutions with different rates in tri-directional convolution paths. Finally, the model is optimized using both point-level and voxel-level losses to achieve accurate segmentation results.
2.2. Data Processing and Feature Extraction
The raw point cloud data are initially represented using a Cartesian coordinate system [
18]. Due to the distance-dependent sampling density characteristic in LiDAR scanning, it can cause the data distribution to be severely uneven in the traditional Cartesian coordinate system. We adopt a coordinate transformation approach that addresses the challenges of uneven point cloud sampling density.
Figure 2 illustrates the complete processing flow of point cloud data: the raw point cloud is initially represented in the Cartesian coordinate system (x, y, z), then mapped to a cylindrical coordinate system through coordinate transformation, which helps address the distance-dependent sampling density issue in LiDAR point clouds. Next, a cylindrical cut coordinate transformation is performed, dividing the space into regular grids to solve the challenge of the uneven point cloud distribution. In each coordinate system, feature extraction and enhancement are conducted through multi-layer perceptrons (MLPs) [
19], ultimately generating 32-dimensional feature representations. This combination of three-stage coordinate transformation and feature extraction significantly improves the data distribution and feature representation, laying the foundation for subsequent semantic segmentation tasks.
The first transformation maps the Cartesian coordinate system data
to the cylindrical coordinate system, while the second transformation maps the cylindrical coordinate system data
to the cylindrical split coordinate system
. The first transformation coordinate system formula is shown in Equation (1).
The second conversion is shown in Equation (2).
where
denotes limiting the value
within the range
;
denotes rounding down
. The grid resolution is controlled by the following parameters, as shown in Equation (3).
where
,
, and
denote the number of grids in radial, angular, and height directions, respectively. This division further ensures the uniform distribution of the point cloud data in each voxel, providing a more reliable data base for subsequent processing.
2.3. Encoder and Decoder
The traditional feature extraction network module needs to capture both local detail features and global context information when dealing with 3D spatial information like point cloud data. In order to solve these problems, a new network architecture is designed: the Transformer module is introduced in the high-level features, as shown in
Figure 3.
This diagram illustrates the Transformer-enhanced U-Net architecture designed for multi-scale feature fusion. The encoder pathway (left side) progressively reduces spatial resolution while increasing feature channels from 32 to 512 dimensions through a series of ResBlocks. At the bottleneck, the 512-dimensional feature layer incorporates a Transformer module where each voxel representation is treated as an independent token, enabling global information exchange across different spatial locations. The decoder pathway (right side) gradually recovers spatial resolution while reducing feature dimensions back to 32 through UpBlocks. Skip connections (horizontal paths) link corresponding encoder and decoder layers, preserving detailed geometric information throughout the feature extraction process. This architecture combines the advantages of U-Net’s multi-scale feature representation with Transformer’s global modeling capability, creating a comprehensive feature extraction framework that captures both fine-grained details and long-range dependencies in point cloud data, as shown in Equation (4):
where
represents the query matrix, key matrix, and value matrix, respectively, and
represents the scaling factor.
The introduction of the Transformer module in the high-level features is strategically designed based on three key considerations: First, traditional 3D convolutional networks mainly rely on the local receptive field for feature extraction, and we input each voxel representation into the Transformer as an independent token, which enables the model to learn the global associations among the different voxels; second, through the self-attention mechanism of Transformer, the features of any two voxels can directly interact with each other, which breaks through the limitation that traditional 3D convolutional operation is restricted to the local receptive field; finally, the 512-dimensional feature layer after multiple downsampling is the level with the highest information entropy in the whole network, and the information density of the representation layer reaches its peak here. Global feature associations are established at the level where semantic information is most concentrated, thus maximizing the effect of global modeling while maintaining computational efficiency. This specific integration approach differs fundamentally from previous works that either apply attention mechanisms uniformly across all feature levels or use them as stand-alone components and represents a carefully optimized solution for point cloud semantic segmentation.
2.4. Improved DDCM
Aiming at the unique structural properties of point cloud data in different spatial dimensions, we proposes an improved DDCM (Dense Dimensional Context Module) module, as shown in
Figure 4. The module uses three kinds of convolutional kernels (3 × 1 × 1, 1 × 3 × 1, 1 × 1 × 3) for feature extraction in length, width, and height dimensions, respectively. The length direction branch focuses on the continuity of horizontal spatial features, which is suitable for extracting ductile structures such as roads and buildings; the width direction branch captures the hierarchical information of vertical space, which helps to differentiate between targets of different heights; and the depth direction branch handles distance-related features, which is particularly important for the recognition of near and far objects.
In order to enhance the module’s ability to adapt to targets of different scales, each direction branch contains two convolution operations with different dilation rates (dilation = 1, 2). The standard convolution (dilation = 1) is suitable for extracting fine features of near objects, while the dilation convolution (dilation = 2) helps to capture the overall module of distant objects. This dual dilation design effectively expands the receptive field without significantly increasing the number of parameters. After function activation, the features in the three directions are summed and fused to form a comprehensive understanding of the scene.
As shown in
Figure 4, the fused features are supervised by the voxel-level classification loss and simultaneously fed into a point-level classification network consisting of a three-layer MLP. This design enables the module to learn feature representations at both voxel and point cloud levels simultaneously. At the voxel level, the three-way convolution is designed to effectively capture spatial structural information in different directions; at the point cloud level, the multi-layer MLP network can further extract and refine the features, and finally output point-level semantic segmentation results. This multi-scale and multi-level feature learning strategy enables the model to better handle the geometric details and semantic information in the point cloud data and improve the segmentation accuracy.
For input features, each dilated convolution branch is represented as shown in Equation (5):
where
represents the current position,
is the convolution kernel offset,
is the dilation rate, and
is the set of sampling positions for the convolution kernel.
By introducing the design of the double expansion rate and multi-directional convolution, the improved DDCM module effectively addresses the key challenges in point cloud data processing. Traditional methods often adopt a unified convolution strategy when dealing with features of different spatial dimensions, which makes it difficult to accurately capture the structural properties of the point cloud in all directions; meanwhile, the design of the fixed sensing field also limits the flexibility of feature extraction when facing the difference in scale between near and far targets. The improved scheme proposed realizes the orthogonal decomposition of spatial features through multi-directional convolution and provides an adaptive range of receptive fields by using the double expansion rate design, which significantly improves the model’s ability to understand complex scenes.
2.5. Loss Function Design
In terms of loss function design, a two-level supervision strategy is used. The total loss function contains two parts, voxel level and point level, as shown in Equation (6):
The voxel-level loss
uses a combination of two loss functions: weighted cross-entropy loss is used to improve the accuracy of single-point classification; Lovász-Softmax loss is used to optimize the intersection–parity ratio metric in semantic segmentation. For point-level loss
, only the weighted cross-entropy loss is used for supervision, as shown in Equations (7) and (8):
where
is the label value after applying label smoothing and
is the label smoothing parameter. The point-level loss is directly applied to the point cloud feature refinement process, allowing the model to capture local geometric details more accurately.
During the training process, the total loss is a combination of voxel-level loss and point-level loss. This dual-level loss function design enables the model to focus on both the macroscopic spatial structure and microscopic local details, which significantly improves the segmentation performance, especially in the category boundary region. The Adam optimizer is used to optimize the training with the initial learning rate.
3. Results
3.1. Experimental Setup
Based on the need to evaluate our proposed method’s performance in real-world autonomous driving scenarios, we selected the SemanticKITTI dataset as our experimental platform. SemanticKITTI offers diverse application environments spanning urban roads, residential areas, and highways, recording driving scenarios under various weather and lighting conditions. It also captures temporal changes in both static modules and dynamic objects, providing a foundation for research on segmenting moving objects. The dataset supports multiple tasks beyond basic semantic segmentation, including instance segmentation, panoramic segmentation, and semantic scene completion, allowing researchers to comprehensively evaluate algorithm performance.
SemanticKITTI dataset is a representative benchmark for point cloud semantic segmentation in autonomous driving, containing over 43,000 point cloud scans across 22 sequences (00-21), with more than 4 billion annotated points from German urban road scenarios. Sequences 00-07 and 09-10 are designated as the training set (19,130 frames), sequence 08 serves as the validation set (4071 frames), while sequences 11–21 form the test set (20,351 frames). This division ensures the test environment remains independent from the training data, accurately measuring the algorithm’s generalization capability.
Table 1 details the specific experimental equipment and environment settings.
In data processing, we adopted a cylindrical coordinate grid representation with a grid size of 480 × 360 × 32 and spatial range defined as the x-axis [0, 50], y-axis [−180, 180], and z-axis [−4, 2]. Data augmentation employed the GlobalAugment_LP strategy to enhance the model’s generalization capability.
In the model configuration, we modified the Cylinder3D network as the base model, with an input feature dimension of nine and the point cloud refinement module enabled. To prevent overfitting, we set the dropout probability to zero and disabled smoothing.
We chose the stochastic gradient descent (SGD) as the optimizer, with a learning rate of 0.015, weight decay of 0.0001, momentum of 0.9, and Nesterov momentum enabled. The training process adopted a linear warm-up and cosine decay learning rate scheduling strategy, with a warm-up period of two epochs and a total of thirty-six training epochs. The batch size was set to two samples per GPU, with gradient norm clipping applied (maximum norm of ten).
3.2. Evaluation Metrics
Our research method primarily aims to address two key issues in point cloud semantic segmentation: first, it addresses the lack of effective correlation modeling between feature dimensions in traditional methods; second, it addresses the limitation of fixed receptive fields in adapting to multi-scale targets. To address these challenges, we need to evaluate model performance in three aspects: segmentation accuracy and computational efficiency. Since semantic segmentation is essentially a pixel/point-level classification problem that must account for class imbalance, we primarily adopt the Intersection over Union (IoU) and mean Intersection over Union (mIoU) as evaluation metrics, which comprehensively measure segmentation accuracy, particularly in class-imbalanced scenarios.
The intersection and merger ratio is used as the evaluation metric, as shown in Equation (9):
where
denotes the positive cases of correct category prediction;
denotes the positive cases of incorrect category prediction;
denotes the negative cases of incorrect category prediction; and
denotes the number of semantic categories.
3.3. Evaluation Results and Comparison
The results of testing the proposed method on SemanticKITTI [
20] are shown in
Figure 5 and
Table 2. As shown in
Figure 5, through quantitative analysis, MT-CylNet exhibits good classification ability when dealing with both dynamic targets (e.g., motorcycles, bicycle riders) and static structures (e.g., roads, vegetation), which demonstrates its generalization performance in complex scenarios.
This visualization presents the quantitative analysis of MT-CylNet’s performance across different semantic categories. The graph displays the IoU scores for various classes, including vehicles (car, bicycle, motorcycle), pedestrians, road infrastructure, vegetation, and buildings. As shown in the results, MT-CylNet achieves state-of-the-art performance with an average mIoU of 66.8%, significantly outperforming baseline methods, particularly in recognizing dynamic objects like motorcycles (79.2%) and other vehicles (68.1%). The model demonstrates balanced performance across both foreground objects and background elements, showcasing its robustness in complex autonomous driving scenarios. Specifically, our method achieves substantial improvements in challenging categories: motorcycles (79.2% vs. 69.7% in Cylinder3D), bicycles (56.8% vs. 47.2%), and pedestrians (77.4% vs. 75.6%).
As shown in
Table 2, through quantitative analysis, MT-CylNet achieves an average intersection and merger ratio (mIoU) of 66.8% on the SemanticKITTI dataset. The MT-CylNet has shown 46.8% and 16.6% improvement over the MLP algorithms such as PointNet++ [
2] and RandLA-Net [
4], which deal with point clouds directly; a 31% and 7.4% improvement over TangentConv [
21] and Salsanext [
22], which deal with point clouds through projection; there are also 8.1%, 5.6%, 3.8%, and 3.8% improvements compared to the voxel partitioning and 3D convolution based methods KPConv [
23], FusionNet [
24], KPRNet [
25], and TORANDONet [
26]. Notably, MT-CylNet outperforms the recent Transformer-based method RangeViT [
8], with a 2.8% mIoU improvement. The experimental results show that the proposed model outperforms most of the state-of-the-art methods.
3.4. Error Analysis and Failure Cases
To gain insight into the model’s classification behavior,
Table 3 presents the confusion matrix of MT-CylNet on the SemanticKITTI test set. The matrix quantifies the matching relationship between true and predicted classes as sample proportions (%), where diagonal values represent correct classification rates and off-diagonal elements indicate misclassification directions and frequencies.
As shown in
Table 3, the model’s semantic discrimination ability across different classes, feature confusion patterns, and data distribution impacts can be obtained.
First, there are frequently misclassified categories. The mutual misclassification between a bicycle and motorcycle highlights the dual challenges of geometric similarity and sparsity in point cloud semantic segmentation, with 2.3% of bicycle samples misclassified as motorcycles and 0.8% of motorcycle samples as bicycles. This stems from the structural overlap of the two categories in cylindrical projection, both exhibiting elongated point distributions and a point density below 0.1 points/m3 in long-range scenarios (>50 m), which makes it difficult for the model to capture key differential features like wheel count and body proportion. More critically, for motorcyclist samples, only a 19.6% classification accuracy is achieved, with 26.5% of samples misclassified as a person and 20.9% as a motorcycle, reflecting feature fragmentation between dynamic objects (riders) and static backgrounds/vehicles; single-frame point clouds lack temporal motion cues, and the model cannot effectively distinguish the spatial relationship between “rider-motorcycle” combinations and independent pedestrians.
Next, there are segmentation bottlenecks for rare classes and small objects. Sample scarcity and geometric complexity jointly degrade performance in rare classes, with motorcyclists and traffic signs suffering misclassification rates of 80.4% and 47.0%, respectively. Among traffic signs, 28.9% are misclassified as poles and 8.7% as vegetation, indicating insufficient semantic discrimination between “vertical slender structures” and “natural vegetation”. Poles have an omission rate of 23.7%, with 9.9% misclassified as vegetation and 5.2% as trunks; the core issue is its diameter (<0.1 m) being smaller than the model’s minimum effective receptive field (0.5 m), making it impossible for traditional 3D convolutions to capture sub-voxel geometric details, even with multi-scale dilated convolutions.
Finally, large-scale classes validate the model architecture. Large-scale classes like cars (98.7% accuracy) and roads (97.0% accuracy) provide positive validation: cars leverage a high point cloud density for the Transformer module to capture cross-regional semantic correlations, while roads benefit from the DDCM module’s multi-scale dilated convolutions, whose dual-dilation design (dilation = 1/2) perceives both near-road textures (within 0.5 m) and far-road orientations (beyond 50 m). These results confirm the effectiveness of the “global dependency modeling + multi-scale feature fusion” architecture, where long-range semantic associations via Transformer and hierarchical feature extraction via dilated convolutions significantly enhance segmentation reliability for classes with clear structures.
3.5. Ablation Experiments
In order to validate the effectiveness of the proposed modules, multiple module combinations are attempted on the SemanticKITTI dataset using the proposed method in the ablation experiments and the performance of different configurations is evaluated by the metrics versus the inference time (ms/frame), and the results are shown in
Table 3. We selected these evaluation dimensions based on the following considerations: first, the mIoU directly reflects the segmentation accuracy of the model, which is the core metric for evaluating semantic segmentation tasks; second, inference latency represents the computational efficiency of the model in practical applications, which is critical for real-time systems such as autonomous driving. The combination of these two metrics provides a comprehensive assessment of the model’s balance between accuracy and efficiency. The four configuration combinations designed in the experiment each have different focuses: the basic Cylinder3D model serves as the baseline, reflecting the performance level of traditional methods; the Cylinder3D + Transformer combination primarily verifies the enhancement effect of global modeling capability; the Cylinder3D + ASPP combination focuses on evaluating the contribution of multi-scale feature extraction; and the complete MT-CylNet model demonstrates the synergistic effect of the two enhancement mechanisms. This combination design effectively isolates the contributions of different modules and clearly shows the impact of each component on the final performance.
The ablation experiments reveal important insights about the synergistic effect of our proposed modules. When using only Cylinder3D, the mIoU reaches 64.4%; when the Transformer module is integrated with Cylinder3D, the performance increases to 65.8%, demonstrating a 1.4% improvement; when the ASPP module and Cylinder3D are combined, the mIoU reaches 66.3%, showing a 1.9% enhancement. Notably, incorporating both the ASPP module and Transformer module into Cylinder3D achieves the best performance of 66.8%, which represents a 2.4% improvement over the baseline. This improvement exceeds the sum of their individual contributions (1.4% + 1.9% = 2.3%), clearly indicating that these components work synergistically rather than independently. This super-additive effect validates that our architecture is not a simple combination of existing modules but a carefully designed framework where each component complements and enhances the others, creating an integrated solution greater than the sum of its parts.
Meanwhile, we tested the inference delay time for each model configuration. As shown in
Table 4, the introduction of the Transformer module and the ASPP module led to an increase in the inference time, although it brought about a performance improvement. Compared to the base Cylinder3D model, the complete MT-CylNet model has an increase in inference latency of about 33.5%. In real-world applications, this latency increase is still within acceptable limits, especially for autonomous driving perception tasks that do not require strict real-time processing. In addition, with further engineering optimization and model pruning, it is expected that the inference latency can be reduced while maintaining performance, which will be one of the key directions of our future work.
In real-world scenarios such as autonomous driving, the GPU memory consumption (memory) and parameter size (params) of a model directly determine its hardware compatibility and deployment feasibility. As shown in
Table 3, the baseline Cylinder3D model require 980 MB of memory and 56.3 M parameters, which can achieve real-time inference (75.1 ms latency) on mid-tier GPUs (e.g., NVIDIA RTX 2080 Ti). However, with the integration of Transformer and ASPP modules, the memory demand increases significantly (up to 1024 MB for MT-CylNet), and the parameter size grows to 68.2 M, necessitating high-end hardware (e.g., NVIDIA A100) to maintain real-time performance. For resource-constrained edge devices (e.g., Jetson AGX Xavier), techniques such as model pruning, quantization, or knowledge distillation must be applied to compress the model, reducing memory usage below 600 MB and parameters under 40 M, thereby meeting the requirements of low power consumption and high frame rates. Additionally, the 3 × 3 × 3 Conv model, despite having similar parameters (57.0 M), exhibits much higher memory consumption (1095.1 MB) and inferior performance, further validating the resource efficiency of the asymmetric design and cylindrical partition.
Through this series of ablation experiments, we verify the effectiveness of the proposed Transformer and ASPP modules in improving the performance of semantic segmentation of point clouds, and also evaluate the impact of the introduction of these modules on the computational efficiency, which provides a reference basis for practical applications.
4. Discussion
This research focuses on the feature fusion problem in 3D point cloud semantic segmentation, primarily addressing two limitations of traditional methods: global context modeling and multi-scale feature representation. Our proposed MT-CylNet, through the introduction of a Transformer module and an improved DDCM module, achieves the effective modeling of global dependencies in point cloud data and the adaptive extraction of multi-scale features.
During our research, we discovered that Transformer and dilated convolution mechanisms have strong complementarity in point cloud processing. Transformers excel at capturing global dependencies, helping the model understand semantic connections between distant objects, while multi-scale dilated convolutions effectively solve the feature extraction problem for objects of different scales, especially in adaptively processing objects at varying distances. This complementary advantage was fully verified in ablation experiments, where using either mechanism alone improved baseline performance, while their combination achieved the best results.
This research provides several important insights for the development of point cloud semantic segmentation technology that distinguish our approach from existing methods: First, correlation modeling between feature dimensions is crucial for improving segmentation performance, and our strategic placement of the Transformer at the high-dimensional feature layer represents a novel approach compared to methods like RangeViT [
8], which apply attention mechanisms to projected 2D representations, or KPConv [
19], which relies solely on local kernel-based convolutions. Second, the adaptive adjustment capability of receptive fields is key to handling point cloud scale variations, and our tri-directional dual-rate design offers a more flexible and efficient solution than approaches such as FusionNet [
20], which use computationally expensive multi-modal fusion, or TORANDONet [
22], which employs fixed multi-view projections. Finally, our dual-level supervision strategy balances computational efficiency and accuracy in a way that previous methods have not achieved, demonstrating that architectural optimization can yield significant performance gains without excessive computational overhead. These innovations collectively address the specific challenges of point cloud semantic segmentation in autonomous driving environments, offering solutions that go beyond simple component integration.
Although MT-CylNet has achieved significant improvements in segmentation accuracy, the increase in inference latency reminds us of the need to further optimize model structure for practical applications. Future work will focus on developing more lightweight global modeling mechanisms, exploring feature compression and parallel computing strategies, as well as model pruning methods for specific application scenarios, thereby finding a better balance between high-precision segmentation and real-time processing.
5. Conclusions
MT-CylNet implements feature learning through an attention mechanism and multi-scale convolution in a novel, integrated framework. In the feature extraction part, different from the widely used convolution method with a single sensory field, we adopt adaptive fusion, which not only realizes the global modeling of high-level features through the Transformer module, but also carries out the feature extraction of standard convolution and dilated convolution separately, and finally selects the important features in an adaptive way. Our approach differs from existing methods by strategically combining global modeling capabilities of Transformer with the spatial adaptability of multi-scale dilated convolutions, resulting in a synergistic architecture specifically tailored to point cloud data characteristics. The experimental results confirm that this carefully designed integration produces performance improvements that exceed what would be achieved by simply stacking existing components, validating the architectural innovation of our approach.
Therefore, the MT-CylNet model can more accurately describe the differences of different targets in the feature space and has stronger feature expression ability. The strategic integration of complementary modules enables comprehensive scene understanding across varying scales and distances, crucial for autonomous driving applications. However, the computational complexity of the algorithm is high and prone to large computational overhead, for which the next step will be to improve the efficient extraction and utilization of features. Future research will focus on model compression techniques, parallel computing optimization, and hardware–software algorithm hardware-aware algorithm design to make our approach more suitable for real-time applications while maintaining its superior segmentation performance.