Robust 3D Semantic Segmentation Method Based on Multi-Modal Collaborative Learning

: Since camera and LiDAR sensors provide complementary information for the 3D semantic segmentation of intelligent vehicles, extensive efforts have been invested to fuse information from multi-modal data. Despite considerable advantages, fusion-based methods still have inevitable limitations: field-of-view disparity between two modal inputs, demanding precise paired data as inputs in both the training and inferring stages, and consuming more resources. These limitations pose significant obstacles to the practical application of fusion-based methods in real-world scenarios. Therefore, we propose a robust 3D semantic segmentation method based on multi-modal collaborative learning, aiming to enhance feature extraction and segmentation performance for point clouds. In practice, an attention based cross-modal knowledge distillation module is proposed to effectively acquire comprehensive information from multi-modal data and guide the pure point cloud network; then, a confidence-map-driven late fusion strategy is proposed to dynamically fuse the results of two modalities at the pixel-level to complement their advantages and further optimize segmentation results. The proposed method is evaluated on two public datasets (urban dataset SemanticKITTI and off-road dataset RELLIS-3D) and our unstructured test set. The experimental results demonstrate the competitiveness of state-of-the-art methods in diverse scenarios and a robustness to sensor faults.


Introduction
With the continual progression of intelligent driving technology, the safety of intelligent vehicles (IVs) has attracted significant attention and interest [1].In the field of intelligent driving technology, efficient, effective, and robust environmental perception serves as a foundational prerequisite for subsequent tasks, such as the precise positioning, reliable planning, and secure control of these IVs [2].As the core module of environmental perception methods, 3D semantic segmentation is able to densely allocate specific semantic labels to individual pixel points, including entities like drivable surfaces and backgrounds, and has emerged as a focal point of concern in recent years.
LiDAR-only semantic segmentation approaches utilize a diverse set of techniques to fully harness geometric information and have managed to achieve competitive results in structured scenarios, such as indoor environments and typical urban traffic scenes [3].Nevertheless, when confronted with complex and dynamically changing surroundings characterized by sparse and visually similar geometric attributes, these methods encounter limitations inherent to LiDAR sensors, leading to below-expectation performance [4].
A promising method to overcome such limitations lies in the incorporation of camera images, which provide a wealth of dense semantic features, including color and texture Remote Sens. 2024, 16, 453 2 of 18 information.Consequently, LiDAR-camera fusion is a strategic approach to enhance the accuracy and robustness of 3D semantic segmentation methods in challenging environmental conditions [5,6].Utilizing sensor calibration matrices, current LiDAR-camera fusion approaches typically adopt one of two primary strategies: either projecting image pixels onto LiDAR coordinates and performing feature fusion approaches within the sparsely populated LiDAR domain [7][8][9][10][11], or projecting point clouds onto image planes using perspective projection to merge corresponding multi-modal features [4,[12][13][14][15][16].
Despite the significant advantages offered by multi-modal fusion, these methods still have following inherent limitations that cannot be circumvented: (a) Field of View Disparity: The LiDAR and camera sensors typically possess differing field-of-view characteristics, with only a small overlap area (as depicted in Figure 1).Consequently, it becomes infeasible to establish point-to-pixel mapping for point clouds located outside this overlap area, which significantly restricts the broader application of fusion-based methods.
Remote Sens. 2023, 15, x FOR PEER REVIEW 2 of 18 A promising method to overcome such limitations lies in the incorporation of camera images, which provide a wealth of dense semantic features, including color and texture information.Consequently, LiDAR-camera fusion is a strategic approach to enhance the accuracy and robustness of 3D semantic segmentation methods in challenging environmental conditions [5,6].Utilizing sensor calibration matrices, current LiDAR-camera fusion approaches typically adopt one of two primary strategies: either projecting image pixels onto LiDAR coordinates and performing feature fusion approaches within the sparsely populated LiDAR domain [7][8][9][10][11], or projecting point clouds onto image planes using perspective projection to merge corresponding multi-modal features [4,[12][13][14][15][16].
Despite the significant advantages offered by multi-modal fusion, these methods still have following inherent limitations that cannot be circumvented: (a) Field of View Disparity: The LiDAR and camera sensors typically possess differing field-of-view characteristics, with only a small overlap area (as depicted in Figure 1).Consequently, it becomes infeasible to establish point-to-pixel mapping for point clouds located outside this overlap area, which significantly restricts the broader application of fusion-based methods.Fusion-based methods critically rely on the availability of accurately paired data, specifically precise point-pixel mapping between LiDAR and camera data.This mapping is crucial for both the training and inference stages.Thus, any data error or sensor malfunction could have detrimental impacts on segmentation performance and might even lead to algorithm failures.Figure 2 illustrates this vulnerability; for example, cameras are susceptible to light interference, which can result in issues like image confusion, blurriness, overexposure, and other anomalies, while LiDAR sensors can be affected by weather conditions like rain, snow, and fog, leading to phenomena such as "ghost" points or a significant reduction in the amount of point cloud data.Fusion-based methods critically rely on the availability of accurately paired data, specifically precise point-pixel mapping between LiDAR and camera data.This mapping is crucial for both the training and inference stages.Thus, any data error or sensor malfunction could have detrimental impacts on segmentation performance and might even lead to algorithm failures.Figure 2 illustrates this vulnerability; for example, cameras are susceptible to light interference, which can result in issues like image confusion, blurriness, overexposure, and other anomalies, while LiDAR sensors can be affected by weather conditions like rain, snow, and fog, leading to phenomena such as "ghost" points or a significant reduction in the amount of point cloud data.A promising method to overcome such limitations lies in the incorporation of camera images, which provide a wealth of dense semantic features, including color and texture information.Consequently, LiDAR-camera fusion is a strategic approach to enhance the accuracy and robustness of 3D semantic segmentation methods in challenging environmental conditions [5,6].Utilizing sensor calibration matrices, current LiDAR-camera fusion approaches typically adopt one of two primary strategies: either projecting image pixels onto LiDAR coordinates and performing feature fusion approaches within the sparsely populated LiDAR domain [7][8][9][10][11], or projecting point clouds onto image planes using perspective projection to merge corresponding multi-modal features [4,[12][13][14][15][16].
Despite the significant advantages offered by multi-modal fusion, these methods still have following inherent limitations that cannot be circumvented: (a) Field of View Disparity: The LiDAR and camera sensors typically possess differing field-of-view characteristics, with only a small overlap area (as depicted in Figure 1).Consequently, it becomes infeasible to establish point-to-pixel mapping for point clouds located outside this overlap area, which significantly restricts the broader application of fusion-based methods.Fusion-based methods critically rely on the availability of accurately paired data, specifically precise point-pixel mapping between LiDAR and camera data.This mapping is crucial for both the training and inference stages.Thus, any data error or sensor malfunction could have detrimental impacts on segmentation performance and might even lead to algorithm failures.Figure 2 illustrates this vulnerability; for example, cameras are susceptible to light interference, which can result in issues like image confusion, blurriness, overexposure, and other anomalies, while LiDAR sensors can be affected by weather conditions like rain, snow, and fog, leading to phenomena such as "ghost" points or a significant reduction in the amount of point cloud data.Fusion-based approaches require the simultaneous processing of both point cloud and image data, leading to increased demands on computing resources and storage space.Even if efforts have been made to mitigate these challenges through multi-tasking or cascading, such resource demands can pose a substantial burden, especially on devices with limited resources, when deploying real-time applications.
In order to address the aforementioned challenges, this paper presents a robust 3D semantic segmentation method based on multi-modal collaborative learning.It comprehensively considers the complementarity between point cloud and image data at the feature level and output level during training, overcoming the limitation of LiDAR-only methods; benefiting from multi-modal collaborative learning, it can conduct 3D semantic segmentation without image inputs during inference, overcoming the limitations of multi-modal fusion-based methods.Extensive evaluations were conducted across diverse datasets, including the urban dataset SemanticKITTI [17], the off-road dataset RELLIS-3D [18], and our unstructured test set with sparse LiDAR points.The experimental results affirm that, by leveraging the synergies between point cloud and image data, our proposed method can achieve efficient, accurate, and robust 3D semantic segmentation performance in diverse and complex scenarios, especially when the raw data are corrupted.The main contributions of this paper are summarized as follows: (a) This paper proposes a robust 3D semantic segmentation method based on multi-modal collaborative learning, which effectively deals with the limitations and restrictions of fusion-based 3D semantic segmentation methods.(b) An attention-based cross-modal knowledge distillation module is proposed to assist 3D feature extraction using 2D image features with higher contributions, which further helps distill multi-modal knowledge to single point-cloud modality for accurate and robust semantic segmentation.(c) A late fusion strategy guided by a confidence map is proposed to emphasize the strengths of each modality by dynamically assigning per-pixel weights of outputs and further optimizing segmentation results.
The rest of this paper consists of the following sections: Section 2 reviews the related works, Section 3 presents the methodology, Section 4 presents and analyzes the experiments, and Section 5 is the work's conclusion.

LiDAR-Based 3D Semantic Segmentation Methods
In general, LiDAR-based 3D semantic segmentation methods can be divided into the following three categories based on distinct data representations.
Point-based methods [19][20][21] directly process unordered point clouds using MLPbased (multi-layer-perceptron-based) techniques.These methods have demonstrated excellent segmentation results on small-scale and dense point clouds.However, their application to sparse inputs in large-scale scenarios is often limited by factors such as poor locality, high computational costs, and substantial memory requirements, resulting in lower accuracy and slower reasoning speeds.
Voxel-based methods [22][23][24][25][26] transform point clouds into dense voxels and employ 3D convolution to extract and reconstruct the features in each voxel, which achieves superior segmentation results.However, the redundancy in dense voxel representation and the computational inefficiency of 3D convolution contribute to an exponential increase in the complexity of these methods, leading to poor real-time performance.
Projection-based methods can keep a good balance between segmentation performance and real-time performance, benefiting from the compactness of inputs and the lightness of 2D CNNs.They typically employ top-down or spherical projection for point cloud preprocessing, resulting in the formation of Bird's-Eye Views (BEVs) [27][28][29] and Range Views (RVs) [30][31][32].Nevertheless, the BEV approach remains sparse while preserving the size of objects, and the RV approach disrupts the original topological relationships.
Considering the demand of accurate feature extraction and real-time application, we utilized our former work [33] as the 3D branch backbone, which is a multi-projection fusion method and leverages rich complementary information between different views.

Knowledge Distillation Methods
Knowledge distillation (KD) was originally proposed for network model compression [34], that is, transferring rich hidden information from complex and large teacher networks to lightweight and compact student networks, aiming to reduce the performance gap between the two models.It was initially designed for image classification tasks, taking various forms of knowledge as distillation targets, including intermediate outputs [35,36], visual attention maps [37,38], interlayer similarity maps [39], and sample-level similarity maps [40,41].
Recent advancements have extended knowledge distillation to semantic segmentation tasks for intermediate feature extraction.For instance, ref. [42] simultaneously extracts pixellevel knowledge, paired-similarity knowledge, and global knowledge, achieving high-order consistency between fine-grained and comprehensive network outputs.Ref. [43] facilitates student model learning by reinterpreting the teacher network's output as a new potential domain and proposes an affinity distillation module to capture the long-term dependencies of the teacher network.Ref. [44] introduces a point-to-voxel knowledge distillation method and a difficulty-sensing sampling strategy to enhance distillation efficiency.
With the rapid progress of multi-modal computer vision technology, more and more research has applied knowledge distillation to the prior information on transmission between different modalities.For example, refs.[45][46][47] use additional two-dimensional image information during training to enhance algorithm performance in the inferring stage; ref. [48] introduces 2D-assisted pre-training; ref. [49] expands 2D convolution into 3D convolution; and ref. [50] proposes a dense foreground-guided feature imitation method and sparse instance distillation method to transfer spatial knowledge from LiDAR to multiple camera images for 3D target detection.
Nevertheless, in contrast to dense and regular camera images, LiDAR point clouds have inherent characteristics like sparsity, randomness, and variable density.This substantial disparity between the two modalities poses a formidable challenge for knowledge distillation across modalities.The direct application of knowledge distillation between the two modalities will pollute the specific modal information.Therefore, we propose an attention-based cross-modal knowledge distillation module, enhancing the feature extraction of the 3D branch without losing its specific modality information.

Methods
In this section, we introduce a robust 3D semantic segmentation method based on multi-modal collaborative learning, as shown in Figure 3. First, the efficient semantic segmentation backbone (including 2D and 3D branches) is utilized to leverage rich complementary information and offer reliable intermediate features for later multi-modal collaborative learning.Then, a cross-modal knowledge distillation module is proposed to enhance the feature representation of the 3D branch in multiple scales using prior feature information from the 2D branch.Finally, a late fusion strategy driven by confidence mapping is proposed to weight the prediction results of the two modal branches in a direct and explicit manner, which highlights the advantages of each modal branch while weakening the interference of incorrect data inputs, so as to generate the final accurate and robust prediction results.
In the following subsections, the architecture of the proposed method will be described in detail.In the following subsections, the architecture of the proposed method will be described in detail.

Semantic Segmentation Backbone
There are two primary objectives of 2D and 3D semantic segmentation backbones: first, to offer reliable semantic information and geometrical features to the latter proposed cross-modal knowledge distillation module; second, to utilize their outputs to further constrain and enhance the final results of 3D semantic segmentation.
To achieve these goals, we simply employ HRNet [51] as the 2D branch and our former work [33] as the 3D branch for efficient and effective feature extraction and semantic segmentation.To be specific, the 2D branch, HRNet, maintains high-resolution representations through the whole process, providing semantically richer and spatially more precise 2D semantic features; the 3D branch combines RV and BEV at both the feature-level and output-level, which significantly mitigates information loss during the projection.

Attention-Based Cross-Modal Knowledge Distillation Module
The proposed cross-modal knowledge distillation module (see Figure 4) first fuses the paired 2D and 3D features {F C ,F R ,F B } based on the attentional mapping (AM); then, it distills the enhanced fusion features F C fe and the enhanced 3D features F R e , F B e in a unidirectional alignment.In this manner, we can transfer the comprehensive information from multi-modal data into the LiDAR model for its feature enhancement, while retaining its specific characteristics.Below, we take Image-RV as an example to analyze the process.

Semantic Segmentation Backbone
There are two primary objectives of 2D and 3D semantic segmentation backbones: first, to offer reliable semantic information and geometrical features to the latter proposed cross-modal knowledge distillation module; second, to utilize their outputs to further constrain and enhance the final results of 3D semantic segmentation.
To achieve these goals, we simply employ HRNet [51] as the 2D branch and our former work [33] as the 3D branch for efficient and effective feature extraction and semantic segmentation.To be specific, the 2D branch, HRNet, maintains high-resolution representations through the whole process, providing semantically richer and spatially more precise 2D semantic features; the 3D branch combines RV and BEV at both the feature-level and output-level, which significantly mitigates information loss during the projection.

Attention-Based Cross-Modal Knowledge Distillation Module
The proposed cross-modal knowledge distillation module (see Figure 4) first fuses the paired 2D and 3D features {F C , F R , F B } based on the attentional mapping (AM); then, it distills the enhanced fusion features F fe C and the enhanced 3D features F e R , F e B in a unidirectional alignment.In this manner, we can transfer the comprehensive information from multi-modal data into the LiDAR model for its feature enhancement, while retaining its specific characteristics.Below, we take Image-RV as an example to analyze the process.
Remote Sens. 2023, 15, x FOR PEER REVIEW 5 of 18 In the following subsections, the architecture of the proposed method will be described in detail.

Semantic Segmentation Backbone
There are two primary objectives of 2D and 3D semantic segmentation backbones: first, to offer reliable semantic information and geometrical features to the latter proposed cross-modal knowledge distillation module; second, to utilize their outputs to further constrain and enhance the final results of 3D semantic segmentation.
To achieve these goals, we simply employ HRNet [51] as the 2D branch and our former work [33] as the 3D branch for efficient and effective feature extraction and semantic segmentation.To be specific, the 2D branch, HRNet, maintains high-resolution representations through the whole process, providing semantically richer and spatially more precise 2D semantic features; the 3D branch combines RV and BEV at both the feature-level and output-level, which significantly mitigates information loss during the projection.

Attention-Based Cross-Modal Knowledge Distillation Module
The proposed cross-modal knowledge distillation module (see Figure 4) first fuses the paired 2D and 3D features {F C ,F R ,F B } based on the attentional mapping (AM); then, it distills the enhanced fusion features F C fe and the enhanced 3D features F R e , F B e in a unidirectional alignment.In this manner, we can transfer the comprehensive information from multi-modal data into the LiDAR model for its feature enhancement, while retaining its specific characteristics.Below, we take Image-RV as an example to analyze the process.

Feature Alignment
The feature alignment between camera images and LiDAR RV images is introduced to generate pairwise matching features of the two modalities, so as to facilitate the subsequent knowledge distillation.It is implemented by calculating the geometric transformation matrices M C2R (mapping from camera image to range image) and M R2C (mapping from range image to camera image).The two metrics are inverses of each other: M C2R and M R2C .Here, we take M C2R as an example to analyze the process of fusion.
During the transformation, we utilize the original point cloud as an intermediary agent.We first calculate the matrix M C2P ∈ Z H c ×W c in Equation ( 1), which aligns the features of camera images to original point clouds, as shown in Figure 5.
where (H c , W c ) are the width and height of the 2D camera images, and is the n (i,j)th point which projects on (i, j) coordinates.
Then, the transformation matrix M P2R ∈ Z N×2 from original points to RV images is formed as follows: where N is the number of points, and {r k = (u k , v k )|0 ≤ k ≤ N-1} represents the projected pixel coordinates of the 2D RV image, corresponding to the k th point.By calculating the M C2P and M P2R , we obtain the geometric transformation matrix M C2R ∈ Z H r ×W r ×2 : Remote Sens. 2023, 15, x FOR PEER REVIEW 6 of 18

Feature Alignment
The feature alignment between camera images and LiDAR RV images is introduced to generate pairwise matching features of the two modalities, so as to facilitate the subsequent knowledge distillation.It is implemented by calculating the geometric transformation matrices M C2R (mapping from camera image to range image) and M R2C (mapping from range image to camera image).The two metrics are inverses of each other: M C2R and M R2C .Here, we take M C2R as an example to analyze the process of fusion.
During the transformation, we utilize the original point cloud as an intermediary agent.We first calculate the matrix M C2P ∈Z H c ×W c in Equation (1), which aligns the features of camera images to original point clouds, as shown in Figure 5.
where (H c ,W c ) are the width and height of the 2D camera images, and {n i,j |0≤i≤H c -1, 0≤j≤W c -1} is the n i,j th point which projects on i,j coordinates.Then, the transformation matrix M P2R ∈Z N×2 from original points to RV images is formed as follows: where N is the number of points, and {r k =(u k ,v k )|0 ≤ k ≤ N-1} represents the projected pixel coordinates of the 2D RV image, corresponding to the k th point.By calculating the M C2P and M P2R , we obtain the geometric transformation matrix M C2R ∈Z H r ×W r ×2 :

Fusion and Distillation
After feature alignment, we can utilize paired features from the 2D and 3D branches for the fusion and distillation block.
Considering the huge feature gap introduced by different modal networks, it is inappropriate to fuse 3D features and their corresponding 2D features directly.Therefore, we design a 2D-Learner based on MLP to narrow the gap between different modal features.It can be formulated as follows: Then, we design a fusion method based on spatial attention to achieve the enhanced fusion features, which could improve the feature representation by focusing on important features and suppressing unimportant features.It can be formulated as follows:

Fusion and Distillation
After feature alignment, we can utilize paired features from the 2D and 3D branches for the fusion and distillation block.
Considering the huge feature gap introduced by different modal networks, it is inappropriate to fuse 3D features and their corresponding 2D features directly.Therefore, we design a 2D-Learner based on MLP to narrow the gap between different modal features.It can be formulated as follows: Then, we design a fusion method based on spatial attention to achieve the enhanced fusion features, which could improve the feature representation by focusing on important features and suppressing unimportant features.It can be formulated as follows: where ⊙ represents point-wise multiplication; & represents channel concatenation; and A represents the attentional map which takes 2D and 3D features into comprehensive consideration.This can be formulated as follows: where F ∈ R Ch×H F ×W F represents the feature map; P(F) ∈ R H F ×W F represents the result of average-pooling the absolute values along the channel dimensions of F; N(F) ∈ R H F ×W F represents the attention derived from softmax standardization of values at all spatial locations; and τ represents the hyperparameter that regulates the distribution entropy.
After that, we operate feature distillation between enhanced fusion feature F fe C and enhance 3D feature Specifically, we design a feature-level distillation loss Loss dis as the supplement of the segmentation task loss, which comprises multi-scale feature imitation loss Loss fea and attention imitation loss Loss att .Feature imitation loss aims at narrowing distribution differences between the two modal features.Attention imitation loss aims at enabling F e R to learn and generate attention patterns similar to F fe C , thus focusing more attention on spatial positions that F fe C considers more important.The overall distillation loss is expressed as follows: where λ represents the hyperparameter that controls the relative importance between the two loss functions and balances them at the same scale.
Through the above analysis, we can see that F convert is generated from 3D point cloud features, while also be influenced by the 2D image branch with enhanced fusion features F fe C as input.Therefore, as the intermediary between enhanced fusion features and 3D point cloud features, the 2D-Learner could effectively prevent the image modality from contaminating specific information on point cloud modality in the distillation process, while simultaneously providing rich color, texture, and semantic information for the point cloud modality.
In addition, the fusion branch is adopted only in the training stage, and the 2D image branch can be discarded in the inference stage.Compared with multi-modal-fusionbased methods, our method could process raw point clouds, avoiding the large blind area of image field-of-view, and effectively avoiding additional computational burdens in practical applications.

Confidence-Map-Driven Late Fusion Strategy
Through the above multi-modal knowledge distillation module, the 3D branch can learn additional semantic features from the 2D branch.However, the advantages of these features may not be fully reflected in the segmentation results, that is, the predicted results of the fusion methods are often not as good in some aspects as predictions based solely on images.For example, image-only segmentation methods have an absolute advantage in small target segmentation and object contour extraction in complex backgrounds, but the performance may decrease when fused with sparse point clouds.
Furthermore, influenced by the diversity of scene elements and the accuracy of sensor devices, the data-quality level of multi-modal fusion inputs is uneven.For example, cameras are susceptible to lighting interference, leading to phenomena such as image blur and overexposure; LiDAR is prone to the impact of weather conditions like rain or snow, resulting in a sharp decrease in the number of point clouds.This unevenness is also reflected in the output results of their respective modal branches.Therefore, when the image quality is low, it is advisable to rely more on the geometric and depth information from point clouds to mitigate the interference of erroneous color and texture information.Conversely, low-quality point clouds often struggle to accurately represent the spatial geometric information of the scenario.
In summary, the impact of different modalities on the prediction results should not be equal.Therefore, inspired by the idea of decision-level fusion, we propose a late fusion strategy based on confidence mapping.This strategy directly and explicitly weights the prediction results of the two modal branches, highlighting the respective advantages of each modality branch while mitigating the interference of erroneous data inputs, so as to output the final accurate and robust predictions.Specifically, a pixel-by-pixel confidence weight map is calculated using the probability of the predicted segmentation results, which is used to measure the reliability of the output segmentation results of each modality branch.For a segmentation network, its output consists of Class channels, each representing the probability that a pixel belongs to a particular category in the Class categories.The category with the highest probability is chosen as the final segmentation result.Generally, when the prediction for a pixel has an extremely high probability for a particular category and low probabilities for the others, this prediction can be considered with high confidence; conversely, when the probability distribution between categories is close to uniform, it indicates low confidence in the prediction for that pixel.Inspired by this, we designed the calculation of output confidence as follows: where W 2D and W 3D represent the confidence maps of the 2D branch and 3D branch, respectively; Pr 2D and Pr 3D represent the segmentation results of the 2D branch and 3D branch, respectively; and Pr f represents the final fusion results.

Joint Learning
In the optimization of the 2D branch, traditional supervised learning methods are not unsuitable due to the lack of dense image annotations.Consequently, we adopt the concept of transfer learning and introduce a 2D semantic discriminator D s to differentiate between predicted semantic labels and ground truth (GT) semantic labels.Specifically, D s incorporates both global and Markov discriminators, which enables the consideration of local texture information as well as ensuring global consistency.The adversarial loss Loss 2D can be formulated as follows: where GT 2D represents the sparse 2D GT semantic labels generated by projecting the corresponding 3D GT semantic labels using point-to-pixel mapping, Pr f is the corresponding predicted probability, and E represents the expectation operation.
For 3D branch optimization, we combine the weighted cross-entropy loss and Lovaszsoftmax loss to optimize the point cloud branch: where v i is the frequency of each category (the number of points in each category), GT 3D and Pr 3D are the GT and corresponding predicted probability, J is the Lovasz extension of IoU (Intersection-over-Union), and e(class k ) is the vector of errors for category class k .
We amalgamate the loss functions from the two segmentation branches and the distillation loss to optimize the entire network through end-to-end training, aiming to maximize the IoU index for each category.The final loss function can be formulated as follows: Loss total = Loss 2D + Loss 3D + Loss dis (11)

Dataset
To assess the effectiveness of our method at improving accuracy and robustness, we utilized the urban dataset SemanticKITTI and the off-road dataset RELLIS-3D, which provide diverse scenarios allowing us to comprehensively evaluate the performance of our method.The details are as follows: SemanticKITTI is a widely used benchmark dataset for semantic segmentation tasks in autonomous driving.The dataset provides images and point clouds with semanticlevel 3D annotations.It contains 19,130 frames for training, 4071 frames for validity, and 24,892 frames for testing.We treated train-valid-test sequences and 19 categories which are consistent with the benchmark algorithms.
RELLIS-3D was collected from three unpaved roads on the Texas A&M University RELLIS campus, containing images and point clouds with semantic-level 3D annotations.It contains 7800 frames for training, 2413 frames for validity, and 3343 frames for testing.We treated train-valid-test sequences and 14 categories which are consistent with the benchmark algorithms.
Additionally, we collected 100 frames of the unstructured scene, where outdoor parking lots and roads without clear road boundaries or lacking marking lines are considered as unstructured scenes.The test set was gathered using a vehicle equipped with a Velodyne 32-line LiDAR and a forward-view monocular camera for visualization.Notably, the two sensors were synchronized in time, but external parameter calibration was not conducted.The lack of sensor calibration matrices disabled the implement of fusion-based methods, and the sparsity in beam numbers simulated the fault of LiDAR inputs.These distinctive characteristics make our unstructured test set particularly suitable for testing the segmentation performance and robustness of these methods.

Implement Details
Cross-modal knowledge distillation was applied to the middle and the last layer of the encoders.We set the spatial-attention-related hyper-parameters as τ = 0.5, referring to [34,52,53], and the loss related hyper-parameters as λ = 2.5 × 10 −3 , referring to [34,54].
Our network was trained for 50 epochs with a batch size of 16.We utilized stochastic gradient descent (SGD) as the optimizer with a weight decay of 0.001, a momentum of 0.9, and an initial learning rate of 0.02.All experiments were on NVIDIA RTX 3090 GPUs.
We verified the performance of the proposed methodology using the common evaluation index (IoU and mIoU) in semantic segmentation tasks.

Comparative Results and Discussion of SemanticKITTI
We compared the results of our proposed method with typical and representative LiDAR segmentation methods on the SemanticKITTI benchmark.To be specific, RandLA-Net and KPConv were on behalf of SOTA point-based methods, while SPVNAS and Cylinder3D were on behalf of voxel-based methods; PolarNet and SalsaNext were on behalf of single-projection-based methods; and MPF, GFNet, AMVNet, and our 3D Branch were on behalf of multi-projection-based methods.These methods are top algorithms in their respective fields on the SemanticKITTI benchmark.Moreover, three representative and open-access LiDAR-camera fusion segmentation methods (RGBAL, xMUDA, and PMF) were used as the comparison.
The quantitative comparison results are shown in Tables 1 and 2, where the bold numbers indicate the best results, and the green bold numbers indicate the second-best results.It is obvious that our method outperformed all the methods in terms of mIoU; to be specific, there was a 4.6% improvement from the best LiDAR-only method, Cylinder3D, and 4.7% from the best fusion-based method, PMF.Moreover, our method could still guarantee the real-time performance, since an 83 ms processing time is less than 100 ms (calculated from 10 Hz LiDAR collection frequency).
When compared with LiDAR-only methods (see Table 1), our method achieved the best performance in 9 of all 19 categories and the second-best performance in 6 of all categories.This outperformance shows that our method can effectively merge comprehensive multimodal features (including the dense semantic feature information of images) into point clouds to make up for the deficiency of the performance, especially in those categories with sparse features (e.g., motorcyclist, fence, trunk, pole, traffic sign) or similar geometric features (e.g., building, parking, sidewalk), which cannot be effectively distinguished solely using LiDAR geometric features.The same conclusions can be drawn from Figure 6 with fewer error points, where correct/incorrect predictions are painted in gray/red, respectively, to highlight the differences.
into point clouds to make up for the deficiency of the performance, especially in those categories with sparse features (e.g., motorcyclist, fence, trunk, pole, traffic sign) or similar geometric features (e.g., building, parking, sidewalk), which cannot be effectively distinguished solely using LiDAR geometric features.The same conclusions can be drawn from Figure 6 with fewer error points, where correct/incorrect predictions are painted in gray/red, respectively, to highlight the differences.When compared with fusion methods (see Table 2), our multi-modal collaborative learning method can effectively integrate multi-modal features and eliminate the extreme sensitivity of image modality to complex and variable environments (e.g., diverse illumination intensity, similar color textures), resulting in the best performance in most categories (12 of all 19 categories).Additionally, the absence of dense GT labels for 2D semantic segmentation renders fusion-based methods less adept at distinguishing small, irregular objects.Optimization based on transfer learning compensates for this limitation, achieving a 1.1~21.4% improvement from the second-best performance in those small-scale categories like person, bicyclist, fence, pole, and traffic sign.The same conclusions can be drawn from Figure 6 with fewer error points.For example, the white car/wall in the sunlight and objects in the shadow were misclassified by fusion-based methods, while our method could accurately distinguish them.

Comparative Results and Discussion of RELLIS-3D
The models were also evaluated on RELLIS-3D and the comparison results on the test set are shown in Table 3.It is obvious that there were sharp decreases in the IoU and mIoU for the existing methods due to the dataset's complexity and the similarity of available feature information.The reasons are as follows: When feature information in the dataset is highly similar, it means that distinguishing characteristics among different objects or regions might be subtle.If the features that define different categories are not well discriminated, the methods may struggle to precisely delineate object boundaries, thus making it challenging for them to accurately differentiate between categories, leading to lower IoU and mIoU scores.
However, our method still excelled in overall performance, benefitting from the effectiveness of the proposed multi-modal collaborative learning approach in combining comprehensive multi-modal features.Specifically, it outperformed the best LiDAR-only method, Cylinder3D, by 3.2%, and the best fusion-based method, PMF, by 4.4%.
Furthermore, our method exhibited superior performance in 9 out of 14 classes, particularly excelling in small objects (e.g., pole, log, and fence) and classes with similar geometric features (e.g., grass, concrete, mud, and rubble), where LiDAR point features are typically insufficient.This conclusion is further supported by the qualitative comparison results shown in Figure 7.

Comparative Results and Discussion of Our Test Set
To demonstrate the robustness and adaptivity of our method in various and complex scenarios, especially with LiDAR faults, we performed extra experiments on our unstructured test set with sparse LiDAR data inputs.We tested Cylinder3D (voxel-based), SalsaNext (single-projection-based), and our 3D Branch (multi-projection-based) as representative methods.
The results show that the models are most severely affected by sparse inputs; specifically, SalsaNext was the most affected because its predictions are mainly determined by the size of dense range image inputs, while the impact to Cylinder3D was relatively low since it is specially optimized for sparse point clouds.However, none of these methods could extract accurate objects and continuous flats.In contrast, our method could accurately classify the cars and performed better on drivable areas, e.g., the predicted drivable area in these methods was either discontinuous or misclassified as vegetation (see the red circles for each scenario in Figure 8).This validates that our method can effectively integrate the advantages of both modalities and achieve the best performance in robustness evaluation.4, it is evident that the direct introduction of image features after feature alignment ("FA") had a negative impact on the overall performance, resulting in a 0.2% decrease.This indicates that certain image features may interfere with the 3D branch, providing further evidence of the necessity for knowledge distillation.
Comparing the first and fourth rows, a notable improvement of 2.4% was achieved after employing fusion and distillation.This improvement primarily stemmed from the knowledge provided by the more robust fusion prediction.Moreover, comparing the third and fourth rows, the simple distillation ("D") between two modalities without the fusion approach led to a 1.7% mIoU decrease.This is attributed to the contamination between modalities resulting from straightforward distillation, akin to traditional distillation methods.
Comparing the fourth and fifth rows, it is apparent that the attentional map ("AM") improved the mIoU by 0.6% through simple channel concatenation, which validates its effectiveness.

Effects of Late Fusion Strategy
Comparing the fifth and seventh rows, the incorporation of the confidence-map-driven late fusion strategy harnessed the strengths of both the image and point cloud for different categories, resulting in a 1.1% improvement in the mIoU.
Furthermore, comparing the sixth and seventh rows, "LF" achieved a 0.5% higher mIoU compared to the traditional equal weight-based late fusion method ("E").This improvement highlights the superiority of the "LF" method in autonomously addressing and complementing the advantages and disadvantages of each modality branch.

Discussion
Despite the above advantages, there are still areas where our work can be improved.The IoU of our method is not the best when it comes to segmenting some objects with richer 3D geometric information, such as cars, bicycles, motorcycles, and people.This is because the projection-based method loses some of the 3D spatial information.Additionally, the present method relies on high-end GPUs to achieve real-time performance.Therefore, our future work will focus on further optimizing the segmentation accuracy of these categories by introducing point-based or voxel-based branches, as well as optimizing the proposed method from the perspective of model compression and single modality knowledge distillation for applications in resource-constrained intelligent vehicles.

Conclusions
This paper introduces a robust 3D semantic segmentation method based on multimodal collaborative learning and addresses the challenges that impede the performance of fusion-based 3D semantic segmentation methods.The proposed attention-based crossmodal knowledge distillation module leverages attentional fusion to selectively integrate multi-modal features and utilizes feature distillation to enrich 3D point cloud features via 2D image priors.The confidence-map-driven late fusion strategy dynamically assigns weights for both branches to accentuate the strengths of each modality.Through the integration of these modules, our method is capable of acquiring richer semantic and geometric information from multi-modal data, thereby effectively enhancing the performance and robustness of a pure LiDAR semantic segmentation network.
We evaluated our proposed method on three datasets: the urban dataset SemanticKITTI, the off-road dataset RELLIS-3D, and our self-created unstructured test set.Extensive experiments showed that the proposed method is competitive with state-of-the-art methods in diverse scenarios and is more robust to sensor fault conditions.The ablation experiments served to further validate the contributions of our designed modules.

Figure 1 .
Figure 1.A sample field-of-view difference between LiDAR and camera.(a) Original point cloud; (b) point cloud in camera field-of-view shown in red.(b) Dependency on Precise Paired Data:

Figure 1 .
Figure 1.A sample field-of-view difference between LiDAR and camera.(a) Original point cloud; (b) point cloud in camera field-of-view shown in red.(b) Dependency on Precise Paired Data:

Figure 1 .
Figure 1.A sample field-of-view difference between LiDAR and camera.(a) Original point cloud; (b) point cloud in camera field-of-view shown in red.(b) Dependency on Precise Paired Data:

Figure 2 .
Figure 2. A sample sensor fault.Figure 2. A sample sensor fault.

Figure 2 .
Figure 2. A sample sensor fault.Figure 2. A sample sensor fault.

Figure 3 .
Figure 3. Overall architecture of the proposed method, comprising two key components: the crossmodal knowledge distillation module and the late fusion strategy driven by confidence mapping.

Figure 4 .
Figure 4.The architecture of the cross-modal knowledge distillation module.

Figure 3 .
Figure 3. Overall architecture of the proposed method, comprising two key components: the crossmodal knowledge distillation module and the late fusion strategy driven by confidence mapping.

Figure 3 .
Figure 3. Overall architecture of the proposed method, comprising two key components: the crossmodal knowledge distillation module and the late fusion strategy driven by confidence mapping.

Figure 4 .
Figure 4.The architecture of the cross-modal knowledge distillation module.Figure 4. The architecture of the cross-modal knowledge distillation module.

Figure 4 .
Figure 4.The architecture of the cross-modal knowledge distillation module.Figure 4. The architecture of the cross-modal knowledge distillation module.

Figure 5 .
Figure 5.The architecture of feature alignment.

Figure 5 .
Figure 5.The architecture of feature alignment.

4. 6 . 1 .
Effects of Cross-Modal Knowledge Distillation ModuleComparing the first two rows in Table

Table 1 .
Results and time of typical LiDAR-only methods and our method on SemanticKITTI test set.

Table 2 .
Results and time of typical LiDAR-camera fusion methods and our method on SemanticKITTI validity set, where we only used point clouds in camera field-of-view.

Table 3 .
Results and time of typical methods and our method on RELLIS-3D.* indicates results using point clouds in camera field-of-view.