Next Article in Journal
Reducing the Contact Erosion of AC Contactors Based on Novel Control Circuits
Previous Article in Journal
Correction: Etemad et al. A New Approach for the Incorporation of the End-User’s Smart Power–Electronic Interface in Voltage Support Application. Electronics 2022, 11, 765
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

RMP: Robust Multi-Modal Perception Under Missing Condition

1
University of Electronic Science and Technology of China, Chengdu 611731, China
2
Shenzhen Institute for Advanced Study, University of Electronic Science and Technology of China, Shenzhen 518110, China
3
School of Artificial Intelligence, Shenzhen Polytechnic University, Shenzhen 518055, China
*
Author to whom correspondence should be addressed.
Electronics 2026, 15(1), 119; https://doi.org/10.3390/electronics15010119
Submission received: 27 November 2025 / Revised: 20 December 2025 / Accepted: 22 December 2025 / Published: 26 December 2025
(This article belongs to the Special Issue Hardware and Software Co-Design in Intelligent Systems)

Abstract

Multi-modal perception is a core technology for edge devices to achieve safe and reliable environmental understanding in autonomous driving scenarios. In recent years, most approaches have focused on integrating complementary signals from diverse sensors, including cameras and LiDAR, to improve scene understanding in complex traffic environments, thereby attracting significant attention. However, in real-world applications, sensor failures frequently occur; for instance, cameras may malfunction in scenarios with poor illumination, which severely reduces the accuracy of perception models. To overcome this issue, we propose a robust multi-modal perception pipeline designed to improve model performance under missing modality conditions. Specifically, we design a missing feature reconstruction mechanism to reconstruct absent features by leveraging intra-modal common clues. Furthermore, we introduce a multi-modal adaptive fusion strategy to facilitate adaptive multi-modal integration through inter-modal feature interactions. Extensive experiments on the nuScenes benchmark demonstrate that our method achieves SOTA-level performance under missing-modality conditions.

1. Introduction

3D perception is a core pillar of modern autonomous driving systems, allowing edge devices to achieve reliable environmental understanding and decision-making in complex traffic scenarios [1,2]. Among various three-dimensional perception tasks, semantic segmentation plays a crucial role. By assigning semantic labels to each pixel or point, it enables fine-grained scene comprehension, which further supports high-level reasoning and navigation in autonomous systems [3,4]. Traditional perception systems typically rely on single-modality sensors, such as LiDAR-based [5,6] or camera-based [7,8] approaches, to construct spatial awareness of the environment. However, single-modal methods face constraints that are rooted in the physical characteristics of their corresponding sensors. LiDAR provides accurate geometric and structural information but lacks rich semantic and texture cues, while cameras capture detailed visual appearance yet remain vulnerable to illumination changes, occlusion, and depth ambiguity [9]. These inherent limitations have motivated increasing research interest in multi-modal perception, which aims to fuse information across different sensing sources to achieve a better understanding of the environment [10].
Multi-modal perception involves using information from various heterogeneous sensors, such as cameras and LiDAR, to construct a more reliable understanding of the environment [11]. Unlike single-modality perception, multi-modal methods can exploit the complementary strengths of different sensors, enabling richer semantic understanding and more accurate geometric reasoning [10]. Furthermore, multi-modal methods can achieve more stable perception capabilities under challenging conditions such as low light, occlusions, and adverse weather, thereby improving model robustness. Beyond conventional visual and LiDAR sensors, research in wireless imaging suggests that non-visual modalities can offer additional environmental information, particularly under adverse conditions [12]. Although such modalities are not considered in this work, they highlight a promising avenue for improving the robustness of multi-modal perception systems. Therefore, multi-modal perception has become a core foundation of modern autonomous driving systems.
Existing perception methods that utilize multiple sensing sources are typically grouped into early fusion [13], middle fusion [11], and late fusion [14]. Early fusion integrates information directly at its initial data stage, middle fusion learns joint representations from processed features, while late fusion combines task-specific outputs at the decision level. Although these fusion strategies have significantly improved perception accuracy and robustness, they still encounter substantial challenges in real-world autonomous driving scenarios. In L3/L4 autonomous driving systems, multi-modal perception models are typically deployed on edge computing platforms that must satisfy strict real-time and safety constraints. However, factors such as sensor malfunction, occlusion, or adverse weather conditions may cause partial camera inputs to become unavailable, thereby disrupting cross-modal feature alignment and degrading model robustness. Consequently, missing camera modalities have emerged as one of the critical issues that must be addressed in the development of reliable multi-modal perception research.
To address the aforementioned challenges, several recent studies have explored robust fusion strategies for multi-modal perception [10,15]. For instance, MSeg3D [15] proposes a joint framework that performs both intra-modal feature extraction and inter-modal feature fusion, effectively mitigating the multi-sensor alignment issues outside the field of view (FOV) and maintaining reasonable performance under missing-modality conditions. Similarly, MetaBEV [16] enhances fusion robustness by introducing modality-specific layers within cross-modal attention blocks, which alleviates the severe performance degradation caused by sensor corruption or partial failure. Although these approaches are capable of handling incomplete modalities to some extent, they fundamentally operate on an incomplete feature space. Since no mechanism explicitly infers the representation of the absent view, the fused multi-modal feature becomes biased toward the remaining modalities, causing unstable cross-modal interactions. As a result, robust fusion alone cannot recover the semantic information that would have been contributed by the missing viewpoint. Consequently, an increase in the number of absent camera inputs leads to a notable drop in the model’s accuracy. As illustrated in Figure 1, when all camera inputs are missing, MSeg3D achieves an mIoU of only 74.5%, representing a 4.7% drop compared to the complete-modality scenario. In contrast, our proposed method attains an 80.8% mIoU, with a notably smaller performance decline of only 3.2%, demonstrating stronger robustness under severe missing conditions.
To this end, we present the Robust Multi-modal Perception (RMP) method, which explicitly models missing features using intra-modal information to generate effective replacement values, thereby improving the robustness of the perception model under missing-view conditions. Specifically, the RMP comprises two principal modules: the Missing Feature Reconstruction Module (MFR) and the Multi-modal Adaptive Fusion Module (MAF). The MFR leverages intra-modal correlations to reconstruct the representations of missing modalities, thereby preserving the semantic consistency of the feature space even when certain camera views are unavailable. The MAF, on the other hand, learns dynamic weighting factors to adaptively balance and fuse information from different modalities, achieving reliable integration between the reconstructed image features and LiDAR representations. By jointly optimizing these two modules, RMP effectively maintains robust perception capability in challenging scenarios with incomplete sensor inputs.
The main contributions of our work can be summarized as follows:
  • We present RMP, a robust multi-modal perception framework designed to address missing-modality scenarios.
  • We design a missing feature reconstruction mechanism that exploits intra-modal feature correlations to recover missing camera representations, thereby alleviating performance degradation. Furthermore, we introduce a cross-modal adaptive fusion strategy that learns adjustable weights to fuse image and LiDAR features, improving the efficiency and reliability of cross-modal information interaction.
  • Experiments on nuScenes [17] show SOTA-level performance and strong robustness under various camera view missing conditions.

2. Related Work

2.1. Single-Modal Perception

Methods of single-modal perception mainly rely on information from visual images or LiDAR points to enable edge devices to perceive and understand the driving environment [18]. LiDAR-based approaches exploit accurate 3D geometric structures to achieve high-precision scene perception. For instance, PointNet [19] and its variants design networks that operate on point cloud inputs while preserving their spatial structures. VoxelNet [20] introduces a trainable end-to-end framework that converts irregular point sets to regular voxel grids, enabling 3D convolutions to effectively learn features. Although LiDAR-based methods exhibit strong robustness against illumination and weather variations, they often lack the ability to capture rich semantic and texture information [21]. In contrast, camera-based approaches rely on image data with abundant texture details to achieve fine-grained recognition of scene appearance and object categories. For example, PRET [22] integrates 2D visual features with 3D spatial coordinates to generate position-aware representations, facilitating the detection of 3D objects. However, since cameras are highly sensitive to illumination changes and occlusions, the performance of such methods tends to degrade under complex environmental conditions. In summary, single-modal perception approaches are inherently limited by the weaknesses of individual sensors and struggle to maintain completeness under challenging autonomous driving environments.

2.2. Multi-Modal Perception

To better leverage the unique strengths provided by various sensing modalities, an increasing number of studies have focused on multi-modal perception in autonomous driving [10]. According to the stage of information interaction, existing fusion methods are typically divided into early fusion, middle fusion, and late fusion. In early fusion, raw data from different modalities are first projected into a unified coordinate system before feature extraction. For example, AutoAlignV2 [13] introduces a deformable feature aggregation module that enables the model to adaptively align LiDAR points with image pixels according to the geometric structure of the scene. Middle fusion approaches, on the other hand, perform modality-specific feature encoding followed by feature-level integration. BEVFusion [23] maps the outputs of different sensors into a shared BEV (Bird’s-Eye-View) representation space, allowing joint cross-modal encoding and spatially consistent fusion. In late fusion, the integration occurs at the decision level, where the emphasis lies in ensuring consistency across the outputs of different modality-specific detectors. SparseFusion [24] aggregates the detection outputs from LiDAR and camera modalities as candidate points and performs decision-level fusion, achieving efficient and complementary multi-sensor integration. Although multi-modal fusion has significantly improved perception accuracy and environmental understanding, its performance still deteriorates when partial sensor data are missing. This remains one of the major bottlenecks for deploying multi-modal perception systems in real-world autonomous driving scenarios. In addition, the works on wireless imaging by channel state information using ray tracing for next-generation 6G offer promising non-visual cues that remain robust under darkness or fog, potentially serving as complementary modalities for future multi-modal perception frameworks.

2.3. Missing Modality Perception

In response to the challenge of missing modalities, a growing body of research has recently emerged [15,16,25], aiming to enhance the robustness of multi-modal perception systems under incomplete input conditions. MSeg3D [15] introduces a unified projection-based cross-modal alignment framework that maintains effective perception capability even when some camera views are unavailable. MMFusion [26] adopts an attention-weighted strategy to dynamically modify the contribution of each modality, emphasizing reliable features while suppressing the influence of missing or noisy modalities. Although these methods have achieved notable progress in improving robustness against missing inputs, most of them primarily focus on robust fusion rather than explicitly modeling or reconstructing the missing information. Consequently, as the degree of modality absence increases, the overall perception performance still degrades significantly. Recent studies have also explored available unannotated data and incomplete data for learning. NJIMA et al. [27] employ a weighting-based semi-supervised deep learning strategy, together with an alternative localization framework that relies on adversarial generative modeling, to mitigate the challenge posed by limited labeled samples during model training. This method aims to alleviate the problem of insufficient training samples. Our work focuses on the reconstruction problem when camera views are missing. Although the challenges faced by these two problems are different, the underlying ideas are complementary. Based on the above, our work explores a missing-feature reconstruction strategy from the perspective of intra-modal feature correlations, aiming to achieve more robust and informative multi-modal representation learning.

3. Method

3.1. Overall Framework

Figure 2 outlines the pipeline of RMP, which is structured around three modules: feature extraction, missing feature reconstruction, and multi-modal fusion. Firstly, each of the LiDAR and remaining camera modalities is processed by its respective backbone, extracting its initial features (voxel features for LiDAR, image features for available views). The features of the LiDAR voxel are then projected to the point domain through a voxel-to-point transformation. Second, for the missing camera views, the corresponding image features are reconstructed by leveraging intra-modal correlations from neighboring camera perspectives, after which the reconstructed image features are also converted to the point level. Finally, point-wise features from both modalities are jointly aggregated through the multi-modal fusion module, producing the fused multi-modal representations that serve as the final perception features.

3.2. Feature Extraction

Given a set of N raw LiDAR points P R N × C 3 D i n with input dimension C 3 D i n , the 3D backbone extracts voxel features V R N v × C v . Similar to [28], a voxel-to-point mapping M v p is applied to obtain point-wise features F p R N × C 3 D . For multi-view images captured by N c cameras, the 2D backbone extracts image features from the available views, denoted as I r R N c × C c × H c × W c , where C c , H c , and W c represent the channel dimension, height, and width of the feature maps, respectively. To simulate real-world missing-view conditions, one or more camera images are randomly masked during training, and their corresponding original image features are recorded as I m .

3.3. Missing Feature Reconstruction

This module aims to reconstruct the image features of missing camera views based on the information available from existing views. Adjacent camera views often share overlapping regions, which provide valuable intra-modal contextual cues. Building on this observation, we design a Missing Feature Reconstruction (MFR) mechanism that employs the shared information among neighboring views to infer and reconstruct the features corresponding to the missing perspectives.
For clarity of explanation, we assume that the front-view image is missing in this section. To handle this case, we first replace the missing front-view features with shared and learnable mask tokens, and the replaced features are denoted as I t . The learnable mask token is initialized with a zero-mean Gaussian distribution. Let I l e f t and I r i g h t represent the image features extracted by the neighboring cameras positioned at the front-left and front-right sides, respectively. As illustrated in Figure 2, the area to the left of the front camera is highly correlated with the right region of the front-left camera, while the area in the right region is highly correlated with the left region of the front-right camera. Based on this observation, we introduce a cropping ratio η to divide the image features along the width dimension into three parts: left, with column indices [0, η W c ); middle, with column indices [ η W c , ( 1 η ) W c ); and right, with column indices [( 1 η ) W c , W c ). (1) For the left part, we reference the right-region features of the front-left view, denoted as I l e f t [ r i g h t ] ; (2) for the right part, we reference the left-region features of the front-right view, denoted as I r i g h t [ l e f t ] ; (3) for the middle part, since no adjacent view can provide reliable guidance, we retain the corresponding masked tokens of the front view. Therefore, the reconstructed front-view feature consists of the following three parts:
I m a s k = C [ I l e f t [ r i g h t ] , I t [ m i d ] , I r i g h t [ l e f t ] ]
where C denotes the concatenation operation. The merged feature representation I m a s k is processed by a Transformer-based decoder to produce the reconstructed feature of the missing view, denoted as I r e g .
I r e g = D e c o d e r ( I m a s k )
After reconstructing the missing-view features using intra-modal information from adjacent perspectives, the disparity between the reconstructed features I r e g and the original features I m is quantified using the Mean Squared Error (MSE) metric.
L M F R = M S E ( I m , I r e g )
The core of MFR relies on the relative adjacency relationship between cameras, rather than the absolute field-of-view settings. For any missing viewpoint, its left and right adjacent cameras are selected based on the camera positions, and the cropping region division remains consistent with the description in the paper. For boundary viewpoints with only one adjacent camera, the unavailable side is filled with the learnable mask tokens. If the left adjacent camera is missing, the reconstructed features can be represented as I m a s k = C [ I t [ r i g h t ] , I t [ m i d ] , I r i g h t [ l e f t ] ] . The subsequent decoding process remains unchanged.

3.4. Multi-Modal Fusion

The reconstructed front-view image features I r e g are combined with the remaining camera-view features I r to form the reconstructed multi-view image feature I. Similar to the LiDAR point cloud, a point-to-pixel mapping M p c is applied to obtain 2D point-wise features F I R N × C 2 D . Both F I and F p are then processed within the multi-modal fusion module, as shown in Figure 2. Specifically, a linear projection operation is first applied to map both F I and F p into a common feature space with a unified dimension of N × C u . Then, 1 × 1 convolutional layers are used to expand the feature channels of F I and F p to appropriate dimensions. The expanded features are summed element-wise, and a convolution together with a tanh activation generates the feature weighting map w. The tanh function normalizes the cross-modal interaction into the range [−1, 1], allowing the network to express both positive enhancement and negative suppression when combining features. The formula can be expressed as follows:
F I = L i n e a r ( F I ) , F p = L i n e a r ( F p )
w = σ C o n v ( tanh ( C o n v ( F p ) + C o n v ( F I ) ) )
where σ converts the above results into a probabilistic confidence weight w ( 0 , 1 ) , representing the relative reliability of LiDAR and image features for each point. Next, the 3D point-wise features and 2D point-wise features are adaptively fused according to the learned feature weights w, and the fused representation is denoted as F f u . The fusion process can be formulated as follows:
F f u = C a t w F p , ( 1 w ) F I
Finally, the fused feature F f u is passed through an MLP-based segmentation head to produce the final segmentation prediction Y ^ :
Y ^ = M L P ( F f u )
Given the model’s segmentation output Y ^ and the ground-truth labels Y, the model is supervised using a composite segmentation loss composed of the Lovasz term L l and the cross-entropy term L c :
L S e g = L l ( Y ^ , Y ) + L c ( Y ^ , Y )
The final training objective integrates the reconstruction loss and segmentation loss with weighting coefficients λ 1 and λ 2 , defined as follows:
L = λ 1 L M F R + λ 2 L S e g

4. Experiment

4.1. Dataset

We assess the performance and robustness of our method using the nuScenes dataset [17]. This dataset contains synchronized multi-sensor recordings captured in complex urban environments, incorporating six camera views, a LiDAR scanner, and multiple radars. It consists of 1000 driving scenes, each lasting about 20 s, split into 700 scenes for training, 150 for validation, and 150 for testing. Each scene is annotated with semantic labels, ego-vehicle poses, and other metadata, making the dataset suitable for various 3D scene understanding tasks.

4.2. Metrics

Following previous studies [5], the mean Intersection-over-Union (mIoU) is employed as the metric for evaluating the performance of segmentation. For each semantic class C, IoU is computed as I o U C = T P C T P C + F P C + F N C , where T P C , F P C , and F N C denote the number of true positive, false positive, and false negative predictions for the C-th category. mIoU is obtained by averaging IoUs over all valid classes. All results are reported on the nuScenes official class definitions (class 17).

4.3. Implementation Details

Following [15], the LiDAR branch is constructed using a 3D U-Net [29], while the camera branch utilizes HRNet18 [30]. Unless otherwise specified, one camera view is randomly masked during both training and inference to simulate missing-view scenarios. Optimization is performed with Adam [31] with a learning rate of 0.001. Training is carried out with a batch size of 2, 48 epochs, and the cropping ratio is set to 0.2. To accelerate training, each image is cropped to 40% of its original resolution. The Transformer decoder uses a four-layer architecture, with 4 heads for multi-head attention, a hidden dimension of 512, and a dropout setting of 0.2. The weights for L S e g and L M F R are assigned values of 1 and 2. Similar to [15], we used data augmentation techniques such as LiDAR rotation (rotation about the Z axis with angles sampled from [ π 4 , π 4 ] ) and image scaling (scaling using a stochastic factor drawn from the range [1.0, 1.5]) to avoid overfitting. Experiments are run on four NVIDIA RTX 4090 GPUs, and no Test-Time Augmentation (TTA) is employed during either training or inference.

4.4. SOTA Comparison

nuScenes testing. Table 1 reports the performance of different methods on the nuScenes testing set. During both training and inference, our method assumes that one random camera view is missing. As shown in the table, multi-modal approaches generally outperform single-modality ones, demonstrating the advantage of combining heterogeneous data to mitigate sensor-specific weaknesses and enhance perception accuracy in complex driving scenarios. Among the multi-modal methods, our approach achieves competitive performance, only 1.3% mIoU lower than MSeg3D-H48 [15]. We attribute this small gap to two main factors: (1) MSeg3D-H48 adopts a more powerful image backbone, HRNet48, whereas we employ the lighter HRNet18 to meet the efficiency requirements of edge devices; and (2) MSeg3D-H48 is trained and evaluated under fully available modalities, while our method is designed for random missing-view conditions in both training and inference. To ensure fairness, we re-trained Mseg3D using HRNet18 under the same settings as our method. Results show that our method surpasses Mseg3D by 7.2% mIoU, which further validates the robustness and effectiveness of our proposed framework under incomplete modality scenarios.
nuScenes validation. We also conducted experiments on the nuScenes validation set, as presented in Table 2. The experimental protocol mirrors that used for the testing benchmark. As shown in the results, multi-modal methods consistently outperform single-modality ones, highlighting the advantages of leveraging complementary information across different sensors. Notably, our approach attains an mIoU of 80.3%, which stands as the highest score among all evaluated models. Similar to the testing set, we also re-trained MSeg3D under the same experimental configuration for a fair comparison. The result of MSeg3D is surpassed by our method with a 5.6% mIoU advantage, primarily due to the proposed missing-feature completion strategy. Furthermore, we incorporated the MFR module into MSeg , and the results show a performance improvement of 4.3% mIoU, demonstrating that this performance gain primarily stems from the MFR module, rather than differences in backbone capabilities. The MFR module explicitly reconstructs semantic features from missing views, providing a more complete feature space for the fusion network. This effectively alleviates the feature imbalance caused by missing modalities, making the final multi-modal fusion more stable. These observations further validate the robustness and effectiveness of our work under missing-modality conditions.
Different crop ratio & missing camera views. Figure 3 (on the left) evaluates how different crop ratios influence segmentation performance. As the crop ratio increases from 0.16 to 0.20, the mIoU steadily improves and reaches its peak at 0.20. When the ratio continues to rise, the performance slightly declines, suggesting that excessive cropping may introduce noise or remove useful contextual information. A moderate crop ratio therefore contributes to a more improved model performance. The right part of Figure 3 plots the result of our method and MSeg3D under varying numbers of missing cameras. As the number of available cameras decreases from six to zero, MSeg3D exhibits a significant drop in mIoU, while our method experiences only a minor decrease. The relative improvement over MSeg3D grows from 5.1% to 7.8%, demonstrating the strong robustness of our approach under missing-modality conditions. Furthermore, when the number of missing cameras exceeds three, the RMP’s performance decline slows. We attribute this phenomenon to the limits of adjacent information used by MFR. Once most neighboring views are absent, losing additional views provides limited new contextual information. These results confirm that the proposed modality completion and adaptive fusion strategies can effectively compensate for the absence of visual inputs, ensuring stable perception performance in complex autonomous driving scenarios.

4.5. Ablation Study

Module ablation & Different learning rate. Figure 4 shows the performance variation of the model on the nuScenes validation set under different settings. To verify the effectiveness of the MFR module, we removed it and replaced the missing features with zero values, as shown in Figure 4a. The results indicate a degradation of 2.4% mIoU without MFR, demonstrating that explicit modeling has a significant advantage over the simple imputation strategy. Figure 4b compares three fusion strategies. Compared to average fusion and fixed-weight fusion (LiDAR accounting for 60%), the proposed adaptive fusion achieves gains of 1.1% mIoU and 1.8% mIoU, respectively. This highlights that dynamically adjusting the weights of different modal features can achieve superior perceptual performance. Figure 4c and Figure 5 report how different learning rates affect model performance. When the learning rate is set to 0.001, the model achieves optimal performance. In our experiments, this setting yielded more stable results compared to larger learning rates.
Efficiency and Performance Analysis. Figure 6 presents a comprehensive assessment of different methods based on efficiency and performance. As shown, our approach achieves the optimal balance between efficiency and performance compared to other competitors. Specifically, our model attains an mIoU of 80.3% with an inference time of 207 ms. In contrast, MSeg3D-H48, which achieves similar accuracy, incurs increased inference time (207 ms versus 445 ms) and model size (30 M versus 87.3 M). Compared with MSeg3D, which has a parameter scale similar to ours (30 M versus 31.5 M), our method delivers an additional 1.1% percentage-point gain in mIoU. When contrasting MSeg3D-H48 and MSeg3D, although selecting HRNet-48 yields a 0.7% performance boost, it significantly increases both model parameters and inference time. Considering the resource constraints of edge devices, our method thus selects HRNet-18 as the backbone for the image branch.
Visualization. We visualized our approach and MSeg3D on the nuScenes validation set, as illustrated in Figure 7. The figure presents results from three different viewpoints and compares them with the ground truth. As shown, our model produces noticeably cleaner and more reliable segmentation than MSeg3D. Specifically, for dynamic categories such as trailer, our approach produces clearer and more complete boundaries (improved mIoU by 1.7%). Moreover, for large-scale static elements like vegetation, it exhibits superior regional consistency (improved mIoU by 0.5%). In addition, our method successfully corrects segmentation errors made by MSeg3D, such as misclassifying pedestrians near trailer (improved mIoU by 0.4%). Overall, the visual comparisons provide strong evidence of the effectiveness and robustness of our proposed method.

5. Conclusions

In this study, we proposed RMP, a robust multi-modal perception model designed to handle missing modality scenarios in autonomous driving. RMP integrates a missing feature reconstruction module and a cross-modal adaptive fusion strategy to enhance perception robustness under an incomplete sensor environment. A series of experiments on the nuScenes benchmark demonstrates that the proposed method effectively mitigates performance degradation caused by missing camera views and produces results comparable to those of the most advanced contemporary methods. Future work will focus more on practical deployment scenarios and conduct more comprehensive efficiency analyses, such as the peak memory consumption on edge devices with different hardware configurations.

Author Contributions

X.M.: literature search, method design, experiment, writing. X.C.: literature search, figures, data analysis, writing. Y.S.: literature search, data analysis. Y.L.: data analysis, experiment. G.L.: method design, writing. Y.Y.: experiment, writing. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the China Postdoctoral Science Foundation under Grant 2023T160206.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The data presented in this study are available in the nuScenes Dataset at https://www.nuscenes.org/nuscenes(accessed on 10 May 2024), reference number [17].

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Mohapatra, S.; Yogamani, S.; Gotzig, H.; Milz, S.; Mader, P. BEVDetNet: Bird’s eye view LiDAR point cloud based real-time 3D object detection for autonomous driving. In Proceedings of the IEEE International Intelligent Transportation Systems Conference, Indianapolis, IN, USA, 19–22 September 2021; pp. 2809–2815. [Google Scholar]
  2. Ma, R.; Chen, C.; Yang, B.; Li, D.; Wang, H.; Cong, Y.; Hu, Z. CG-SSD: Corner guided single stage 3D object detection from LiDAR point cloud. ISPRS J. Photogramm. Remote Sens. 2022, 191, 33–48. [Google Scholar] [CrossRef]
  3. Wu, X.; Hou, Y.; Huang, X.; Lin, B.; He, T.; Zhu, X.; Ma, Y.; Wu, B.; Liu, H.; Cai, D.; et al. TASeg: Temporal Aggregation Network for LiDAR Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 15311–15320. [Google Scholar]
  4. Yan, S.; Wang, S.; Duan, Y.; Hong, H.; Lee, K.; Kim, D.; Hong, Y. An LLM-Assisted Easy-to-Trigger Backdoor Attack on Code Completion Models: Injecting Disguised Vulnerabilities against Strong Detection. In Proceedings of the 33rd USENIX Security Symposium (USENIX Security), Philadelphia, PA, USA, 14–16 August 2024. [Google Scholar]
  5. Yan, X.; Gao, J.; Zheng, C.; Zheng, C.; Zhang, R.; Cui, S.; Li, Z. 2DPASS: 2D priors assisted semantic segmentation on lidar point clouds. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 677–695. [Google Scholar]
  6. Zhu, X.; Zhou, H.; Wang, T.; Hong, F.; Ma, Y.; Li, W.; Li, H.; Lin, D. Cylindrical and asymmetrical 3D convolution networks for LiDAR segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 9939–9948. [Google Scholar]
  7. Zhang, R.; Qiu, H.; Wang, T.; Guo, Z.; Cui, Z.; Qiao, Y.; Li, H.; Gao, P. MonoDETR: Depth-guided transformer for monocular 3D object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 9155–9166. [Google Scholar]
  8. Zhou, Y.; Zhu, H.; Liu, Q.; Chang, S.; Guo, M. Monoatt: Online monocular 3D object detection with adaptive token transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 17493–17503. [Google Scholar]
  9. Yin, J.; Shen, J.; Chen, R.; Li, W.; Yang, R.; Frossard, P.; Wang, W. Is-fusion: Instance-scene collaborative fusion for multimodal 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 14905–14915. [Google Scholar]
  10. Sun, T.; Zhang, Z.; Tan, X.; Peng, Y.; Qu, Y.; Xie, Y. Uni-to-multi modal knowledge distillation for bidirectional lidar-camera semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 11059–11072. [Google Scholar] [CrossRef] [PubMed]
  11. Li, Y.; Qi, X.; Chen, Y.; Wang, L.; Li, Z.; Sun, J.; Jia, J. Voxel field fusion for 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 1120–1129. [Google Scholar]
  12. Njima, W.; Chafii, M.; Shubair, R. GAN based data augmentation for indoor localization using labeled and unlabeled data. In Proceedings of the International Balkan Conference on Communications and Networking, Novi Sad, Serbia, 20–22 September 2021; pp. 36–39. [Google Scholar]
  13. Chen, Z.; Li, Z.; Zhang, S.; Fang, L.; Jiang, Q.; Zhao, F. Deformable feature aggregation for dynamic multi-modal 3D object detection. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 628–644. [Google Scholar]
  14. Li, X.; Ma, T.; Hou, Y.; Shi, B.; Yang, Y.; Liu, Y.; Wu, X.; Chen, Q.; Li, Y.; Qiao, Y. Logonet: Towards accurate 3D object detection with local-to-global cross-modal fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 17524–17534. [Google Scholar]
  15. Li, J.; Dai, H.; Han, H.; Ding, Y. Mseg3d: Multi-modal 3d semantic segmentation for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 21694–21704. [Google Scholar]
  16. Ge, C.; Chen, J.; Xie, E.; Wang, Z.; Hong, L.; Lu, H.; Li, Z.; Luo, P. Metabev: Solving sensor failures for 3d detection and map segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 8721–8731. [Google Scholar]
  17. Caesar, H.; Bankiti, V.; Lang, A.; Vora, S.; Liong, V.; Xu, Q.; Krishnan, A.; Pan, Y.; Baldan, G.; Beijbom, O. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 11621–11631. [Google Scholar]
  18. Lai, X.; Chen, Y.; Lu, F.; Liu, J.; Jia, J. Spherical transformer for lidar-based 3d recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 17545–17555. [Google Scholar]
  19. Qi, C.; Su, H.; Mo, K.; Guibas, L. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 652–660. [Google Scholar]
  20. Zhou, Y.; Tuzel, O. Voxelnet: End-to-end learning for point cloud-based 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4490–4499. [Google Scholar]
  21. Li, Z.; Wang, F.; Wang, N. Lidar r-cnn: An efficient and universal 3d object detector. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 7546–7555. [Google Scholar]
  22. Liu, Y.; Wang, T.; Zhang, X.; Sun, J. Petr: Position embedding transformation for multi-view 3d object detection. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 531–548. [Google Scholar]
  23. Liu, Z.; Tang, H.; Amini, A.; Yang, X.; Mao, H.; Rus, D.; Han, S. Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation. arXiv 2022, arXiv:2205.13542. [Google Scholar]
  24. Xie, Y.; Xu, C.; Rakotosaona, M.; Rim, P.; Tombari, F.; Keutzer, K.; Tomizuka, M.; Zhan, W. Sparsefusion: Fusing multi-modal sparse representations for multi-sensor 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 17591–17602. [Google Scholar]
  25. Zou, J.; Huang, T.; Yang, G.; Guo, Z.; Luo, T.; Feng, C.; Zuo, W. Unim2AE: Multi-modal masked autoencoders with unified 3d representation for 3d perception in autonomous driving. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; pp. 296–313. [Google Scholar]
  26. Cui, L.; Li, X.; Meng, M.; Mo, X. MMFusion: A generalized multi-modal fusion detection framework. In Proceedings of the IEEE International Conference on Development and Learning, Macau, China, 9–12 November 2023; pp. 415–422. [Google Scholar]
  27. Njima, W.; Bazzi, A.; Chafii, M. DNN-Based Indoor Localization Under Limited Dataset Using GANs and Semi-Supervised Learning. IEEE Access 2022, 10, 69896–69909. [Google Scholar] [CrossRef]
  28. Yu, H.; Chan, S.; Zhou, X.; Zhang, X. SGFormer: Semantic-Geometry Fusion Transformer for Multi-modal 3D Panoptic Segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 25–27 February 2025; pp. 9616–9625. [Google Scholar]
  29. Shi, S.; Wang, Z.; Shi, J.; Wang, X.; Li, H. From points to parts: 3D object detection from point cloud with part-aware and part-aggregation network. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 48, 2647–2664. [Google Scholar] [CrossRef] [PubMed]
  30. Wang, J.; Sun, K.; Cheng, T.; Jiang, B.; Deng, C.; Zhao, Y.; Liu, D.; Mu, Y.; Tan, M.; Wang, X. Deep high-resolution representation learning for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 3349–3364. [Google Scholar] [CrossRef] [PubMed]
  31. Adam, K. A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar] [CrossRef]
  32. Zhang, Y.; Zhou, Z.; David, P.; Yue, X.; Xi, Z.; Gong, B.; Foroosh, H. Polarnet: An improved grid representation for online lidar point clouds semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 9601–9610. [Google Scholar]
  33. Yan, X.; Gao, J.; Li, J.; Zhang, R.; Li, Z.; Huang, R.; Cui, S. Sparse single sweep lidar point cloud segmentation via learning contextual shape priors from scene completion. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 2–9 February 2021; pp. 3101–3109. [Google Scholar]
  34. Tang, H.; Liu, Z.; Zhao, S.; Lin, Y.; Lin, J.; Wang, H.; Han, S. Searching efficient 3d architectures with sparse point-voxel convolution. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 685–702. [Google Scholar]
  35. Cheng, R.; Razani, R.; Taghavi, E.; Li, E.; Liu, B. 2-s3net: Attentive feature fusion with adaptive feature selection for sparse semantic segmentation network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 12547–12556. [Google Scholar]
  36. Tan, S.; Fazlali, H.; Xu, Y.; Ren, Y.; Liu, B. Uplifting range-view-based 3D semantic segmentation in real-time with multi-sensor fusion. In Proceedings of the IEEE International Conference on Robotics and Automation, Yokohama, Japan, 13–17 May 2024; pp. 16162–16169. [Google Scholar]
  37. Zhuang, Z.; Li, R.; Jia, K.; Wang, Q.; Li, Y.; Tan, M. Perception-aware multi-sensor fusion for 3d lidar semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 16280–16290. [Google Scholar]
  38. Wu, Z.; Zhang, Y.; Lan, R.; Qiu, S.; Ran, S.; Liu, Y. APPFNet: Adaptive point-pixel fusion network for 3D semantic segmentation with neighbor feature aggregation. Expert Syst. Appl. 2024, 251, 123990. [Google Scholar] [CrossRef]
  39. Tan, M.; Zhuang, Z.; Chen, S.; Li, R.; Jia, K.; Wang, Q.; Li, Y. EPMF: Efficient perception-aware multi-sensor fusion for 3D semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 8258–8273. [Google Scholar] [CrossRef] [PubMed]
  40. Li, J.; Dai, H.; Ding, Y. Self-distillation for robust LiDAR semantic segmentation in autonomous driving. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 659–676. [Google Scholar]
  41. Genova, K.; Yin, X.; Kundu, A.; Pantofaru, C.; Cole, F.; Sud, A.; Brewington, B.; Shucker, B.; Funkhouser, T. Learning 3D semantic segmentation with only 2D image supervision. In Proceedings of the International Conference on 3D Vision, London, UK, 1–3 December 2021; pp. 361–372. [Google Scholar]
Figure 1. Comparison of traditional approaches and ours for addressing issues of missing sensor modalities when running multi-modal perception models on edge computing devices. When cameras are missing, traditional methods suffer significant performance degradation (4.7% mIoU). We designed a missing feature reconstruction module that can reconstruct missing features based on intra-modal correlation cues.
Figure 1. Comparison of traditional approaches and ours for addressing issues of missing sensor modalities when running multi-modal perception models on edge computing devices. When cameras are missing, traditional methods suffer significant performance degradation (4.7% mIoU). We designed a missing feature reconstruction module that can reconstruct missing features based on intra-modal correlation cues.
Electronics 15 00119 g001
Figure 2. Overall framework of our RMP. RMP is structured around three components: feature extraction, missing feature reconstruction, and multi-modal fusion. The missing feature reconstruction module reconstructs missing features based on intra-modal commonality cues. Then, the reconstructed image features and LiDAR features undergo fusion to achieve a robust joint representation.
Figure 2. Overall framework of our RMP. RMP is structured around three components: feature extraction, missing feature reconstruction, and multi-modal fusion. The missing feature reconstruction module reconstructs missing features based on intra-modal commonality cues. Then, the reconstructed image features and LiDAR features undergo fusion to achieve a robust joint representation.
Electronics 15 00119 g002
Figure 3. Ablation studies on crop ratio and the number of missing cameras. The left figure presents the model performance under different crop ratios. The right figure compares the model performance of our work and MSeg3D as the number of missing cameras increases. * denotes that the training and inference environments are identical to those used in this paper.
Figure 3. Ablation studies on crop ratio and the number of missing cameras. The left figure presents the model performance under different crop ratios. The right figure compares the model performance of our work and MSeg3D as the number of missing cameras increases. * denotes that the training and inference environments are identical to those used in this paper.
Electronics 15 00119 g003
Figure 4. Ablation studies on the Missing Feature Reconstruction (MFR) module, fusion strategies, and different learning rates.
Figure 4. Ablation studies on the Missing Feature Reconstruction (MFR) module, fusion strategies, and different learning rates.
Electronics 15 00119 g004
Figure 5. Performance curves of the model under different learning rates on the nuScenes validation set.
Figure 5. Performance curves of the model under different learning rates on the nuScenes validation set.
Electronics 15 00119 g005
Figure 6. Comparisons of inference time and performance of different approaches on the nuScenes validation set.
Figure 6. Comparisons of inference time and performance of different approaches on the nuScenes validation set.
Electronics 15 00119 g006
Figure 7. Visualization results of our work and MSeg3D on the nuScenes validation set.
Figure 7. Visualization results of our work and MSeg3D on the nuScenes validation set.
Electronics 15 00119 g007
Table 1. Quantitative results of different methods on the nuScenes testing set. ’L’ is LiDAR and ’C’ is camera. denotes that the training and inference environments are identical to those used in this paper. The bold numbers indicate the best results.
Table 1. Quantitative results of different methods on the nuScenes testing set. ’L’ is LiDAR and ’C’ is camera. denotes that the training and inference environments are identical to those used in this paper. The bold numbers indicate the best results.
MethodInputmIoUBarrierBicycleBusCarConstructionMotorcyclePedestrianTraffic_CONETrailerTruckDriveableOther_FLATSidewalkTerrainManmadeVegetation
PolarNet [32]L69.472.216.877.086.551.169.764.854.169.763.596.667.177.772.187.184.5
JS3C-Net [33]73.680.126.287.884.555.272.671.366.366.871.296.864.576.974.187.586.1
SPVNAS [34]77.480.030.091.990.864.779.075.670.981.074.697.469.280.076.189.387.1
Cylinder3D [6]77.282.829.884.389.463.079.377.273.484.669.197.770.280.375.590.487.6
AF2S3Net [35]78.378.952.289.984.277.474.377.372.083.973.897.166.577.574.087.786.8
SphereFormer [18]78.181.539.793.487.566.475.777.270.685.673.697.664.879.875.092.289.0
LaCRange [36]LC75.378.032.688.384.563.981.575.672.564.768.096.665.978.675.090.488.3
PMF-ResNet50 [37]77.082.140.380.986.463.779.279.875.981.267.197.367.778.174.590.088.5
APPFNet [38]78.177.252.290.993.654.279.280.771.264.184.297.573.977.275.291.187.9
PMF [37]75.580.135.779.786.062.476.376.973.678.566.997.165.377.674.489.588.5
EPMF [39]79.276.939.890.387.872.086.479.676.684.174.997.766.479.576.491.187.9
MSeg3D-H48 [15]81.183.142.594.992.067.178.685.780.587.577.397.769.881.277.892.490.1
Mseg3D  [15]72.671.539.792.587.845.079.576.961.251.083.095.769.569.370.385.384.1
OursLC79.882.544.992.791.071.573.982.576.185.775.597.469.179.976.390.387.4
Table 2. The results of different approaches on the nuScenes validation set. denotes that the training and inference environments are identical to those used in this paper. The bold numbers indicate the best results.
Table 2. The results of different approaches on the nuScenes validation set. denotes that the training and inference environments are identical to those used in this paper. The bold numbers indicate the best results.
MethodInputmIoUBarrierBicycleBusCarConstructionMotorcyclePedestrianTraffic_CONETrailerTruckDriveableOther_FLATSidewalkTerrainManmadeVegetation
AF2S3Net [35]L62.260.312.682.980.020.162.059.049.042.267.494.268.064.168.682.982.4
PolarNet [32]71.074.728.285.390.935.177.571.358.857.476.196.571.174.774.087.385.7
Cylinder3D [6]76.176.440.391.293.851.378.078.964.962.184.496.871.676.475.490.587.4
2DPASS [5]76.474.444.393.692.054.079.778.957.272.585.796.272.774.174.587.585.4
SphereFormer [18]78.477.743.894.593.152.486.981.265.473.485.397.073.475.475.091.089.2
SDSeg3D [40]77.777.549.493.992.554.986.780.167.865.786.096.474.074.974.586.082.8
APPFNet [38]LC78.177.252.290.993.654.279.280.771.264.184.297.573.977.275.291.187.9
PMF-ResNet50 [37]79.074.955.491.093.060.580.383.273.667.284.595.975.174.675.590.389.0
2D3DNet [41]79.078.355.195.487.759.479.380.770.268.286.696.174.975.775.191.489.9
MSeg3D-H48 [15]80.079.259.896.189.454.189.382.272.870.486.096.773.676.175.689.388.3
MSeg3D  [15]74.772.448.193.888.447.082.679.764.656.184.196.069.570.271.087.086.2
MSeg3D + M F R  [15]79.078.455.495.389.457.388.483.468.866.586.196.474.974.373.089.087.1
Ours 80.379.256.896.189.756.789.584.472.671.187.796.975.976.576.189.387.4
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ma, X.; Cai, X.; Song, Y.; Liang, Y.; Liu, G.; Yang, Y. RMP: Robust Multi-Modal Perception Under Missing Condition. Electronics 2026, 15, 119. https://doi.org/10.3390/electronics15010119

AMA Style

Ma X, Cai X, Song Y, Liang Y, Liu G, Yang Y. RMP: Robust Multi-Modal Perception Under Missing Condition. Electronics. 2026; 15(1):119. https://doi.org/10.3390/electronics15010119

Chicago/Turabian Style

Ma, Xin, Xuqi Cai, Yuansheng Song, Yu Liang, Gang Liu, and Yijun Yang. 2026. "RMP: Robust Multi-Modal Perception Under Missing Condition" Electronics 15, no. 1: 119. https://doi.org/10.3390/electronics15010119

APA Style

Ma, X., Cai, X., Song, Y., Liang, Y., Liu, G., & Yang, Y. (2026). RMP: Robust Multi-Modal Perception Under Missing Condition. Electronics, 15(1), 119. https://doi.org/10.3390/electronics15010119

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.
Back to TopTop