MultiDistiller: Efficient Multimodal 3D Detection via Knowledge Distillation for Drones and Autonomous Vehicles

Yang, Binghui; Tao, Tao; Wu, Wenfei; Zhang, Yongjun; Meng, Xiuyuan; Yang, Jianfeng

doi:10.3390/drones9050322

Open AccessArticle

MultiDistiller: Efficient Multimodal 3D Detection via Knowledge Distillation for Drones and Autonomous Vehicles

by

Binghui Yang

¹

,

Tao Tao

¹,

Wenfei Wu

¹,

Yongjun Zhang

²,

Xiuyuan Meng

² and

Jianfeng Yang

^1,*

¹

School of Electronic Information, Wuhan University, Wuhan 430079, China

²

Shanxi Road and Bridge Group Xinzhou National Highway Project Construction Management Co., Ltd., Xinzhou 034000, China

^*

Author to whom correspondence should be addressed.

Drones 2025, 9(5), 322; https://doi.org/10.3390/drones9050322

Submission received: 26 March 2025 / Revised: 20 April 2025 / Accepted: 21 April 2025 / Published: 22 April 2025

(This article belongs to the Special Issue Advances in Cooperative Perception Application for Unmanned System in Modern Transportation)

Download

Browse Figures

Versions Notes

Abstract

Real-time 3D object detection is a cornerstone for the safe operation of drones and autonomous vehicles (AVs)—drones must avoid millimeter-scale power lines in cluttered airspace, while AVs require instantaneous recognition of pedestrians and vehicles in dynamic urban environments. Although significant progress has been made in detection methods based on point clouds, cameras, and multimodal fusion, the computational complexity of existing high-precision models struggles to meet the real-time requirements of vehicular edge devices. Additionally, during the model lightweighting process, issues such as multimodal feature coupling failure and the imbalance between classification and localization performance often arise. To address these challenges, this paper proposes a knowledge distillation framework for multimodal 3D object detection, incorporating attention guidance, rank-aware learning, and interactive feature supervision to achieve efficient model compression and performance optimization. Specifically: To enhance the student model’s ability to focus on key channel and spatial features, we introduce attention-guided feature distillation, leveraging a bird’s-eye view foreground mask and a dual-attention mechanism. To mitigate the degradation of classification performance when transitioning from two-stage to single-stage detectors, we propose ranking-aware category distillation by modeling anchor-level distribution. To address the insufficient cross-modal feature extraction capability, we enhance the student network’s image features using the teacher network’s point cloud spatial priors, thereby constructing a LiDAR-image cross-modal feature alignment mechanism. Experimental results demonstrate the effectiveness of the proposed approach in multimodal 3D object detection. On the KITTI dataset, our method improves network performance by 4.89% even after reducing the number of channels by half.

Keywords:

drones; autonomous vehicles; 3D object detection; multimodal fusion; LiDAR; knowledge distillation; intelligent perception

1. Introduction

With smart transportation systems evolving towards an integrated air–ground–space framework, 3D detection plays a critical role in both drones and autonomous vehicles. In applications of drones, 3D object detection enables accurate identification of both aerial and ground obstacles, supporting autonomous navigation, obstacle avoidance, and target tracking, thereby enhancing flight safety and mission efficiency. In the field of autonomous vehicles, this technology aids in precise localization, real-time obstacle detection, and comprehensive environmental perception in complex traffic conditions, providing crucial data support for autonomous driving decisions and fundamentally improving road safety and operational efficiency. Autonomous drones and vehicles system consists of three main components: perception, decision-making, and control. As the first step, perception plays a critical role, as its accuracy significantly impacts subsequent navigation decisions. In the field of autonomous driving, precise 3D object detection enables vehicles to identify pedestrians, other vehicles, and obstacles while predicting their movement trajectories, leading to safer driving decisions. Therefore, the accuracy and speed of 3D object detection are directly related to the safety and reliability of autonomous navigation.

Despite current research still being primarily centered on ground-based single-platform perception, this cross-platform collaborative vision is driving the development of common technologies such as multimodal fusion and lightweight modeling [1]. Furthermore, we believe that with the evolution of air–ground integrated perception systems, an increasing number of drone-based datasets will be collected, and our methods can be effectively utilized in these applications as well. While recent studies have made progress in point cloud, camera-based, and multimodal detection, practical applications still face significant challenges [2]. The real-time fusion of heterogeneous sensor data, such as LiDAR and cameras, substantially increases computational complexity [3]. Additionally, existing lightweight methods struggle with multimodal feature coupling representation and the balance between classification and localization, leading to severe performance degradation during compression.

Knowledge distillation, an effective model compression technique [4], has been well-established in 2D vision by transferring “soft knowledge” from teacher models to enhance student models’ generalization capabilities. However, knowledge distillation in 3D scenes faces unique bottlenecks. The processing of point clouds and images through multi-view transformations such as BEV/RV and voxelization introduces complex spatial contextual relationships [5,6]. The parameter scale of 3D convolutional networks grows exponentially, while network heterogeneity exacerbates knowledge alignment difficulties. Additionally, the inherent complexity of 3D spatial data results in massive data volumes, further hindering direct migration of 2D distillation methods to 3D domains.

Existing approaches attempt to address these challenges. PointDistiller [7] improves local geometric structure learning of point clouds through dynamic graph convolution, while SparseKD [8] refines logit distillation to guide student models. However, these methods fail to enable models to adaptively learn high-quality features. Moreover, severe foreground-background imbalance in object detection is often overlooked, along with the varying importance of features across different channels and spatial positions. To mitigate this, we incorporate a bird’s-eye view (BEV) foreground mask, enabling the network to focus more on foreground learning and reduce background noise interference. Furthermore, by leveraging spatial and channel attention mechanisms, we emphasize more informative channels and spatial features, enhancing the network’s perception capabilities. While CaKDP [9] introduces student-driven class-aware knowledge distillation to mitigate classification discrepancies between heterogeneous detectors, it neglects relative confidence ranking among anchors, leading to classification errors and false positives. To address this, we propose ranking-aware category distillation, enabling the student model to learn more precise modeling relationships. Moreover, to extend our approach to multimodal 3D object detection, we introduce multimodal interactive feature supervision based on an intermediate fusion model to fully capture the correlations and potential of each modality.

The proposed cross-modal fusion knowledge distillation framework effectively reduces computational complexity while maintaining good model performance. Specifically, its key contributions are as follows: (1) We propose attention-guided feature distillation, incorporating a foreground-background mask to mitigate noise interference and leveraging attention mechanisms to enhance perception of critical channels and spatial locations. (2) We introduce rank-aware category distillation, enabling the student model to establish more accurate modeling relationships. (3) We develop multimodal interactive feature supervision, leveraging an intermediate fusion model to enhance point cloud and image information extraction.

2. Related Works

We first introduce the widely adopted methods in 3D object detection. Currently, existing algorithms can be broadly categorized into three groups based on the type of sensors used: The detection methods can be primarily divided into three categories based on the type of sensor used: (1) LiDAR-based object detection: Point-based methods, such as PointNet++ [10], extract features by using set abstraction; Voxel-based methods [11,12,13] divide point clouds into voxels or pillars [14], which are then convolved to obtain features; Point-voxel methods [15] combine both feature extraction techniques. (2) Image-based object detection: Methods like DETR3D [16] query 3D features for detection, while LSS [17] utilizes view transformations. (3) LiDAR and image-based object detection: Methods like PointPainting [18] enhance LiDAR point clouds with images for early fusion, some works [5,6,19] fuse features in the BEV (bird’s eye view) perspective, and CLOCS [20] performs late fusion to combine proposals from different modalities.

In order to significantly reduce complexity and computational requirements while maintaining network performance, many lightweight methods based on knowledge distillation have been proposed. Hinton et al. [21] first introduced knowledge distillation, the core idea of which is to enhance the performance of small models (student networks) by transferring knowledge from large models (teacher networks). In the field of 2D vision, a relatively mature methodology has been developed: feature alignment-based FitNets [22] significantly improve image classification performance by distilling intermediate-layer features; FAKD [23] introduces feature association mechanisms to enhance the effectiveness of knowledge transfer; and IntRA-KD [24] improves segmentation accuracy by transmitting scene structure information through encoding. For 2D object detection tasks, the latest methods mainly focus on improving distillation loss and soft label strategies, designing feature distillation mechanisms, and optimizing teacher–student network architectures.

In the realm of 3D object detection, research on knowledge distillation has followed two major technical routes. On one hand, methods such as LIGA-Stereo [25] enhance geometric perception through cross-modal knowledge transfer, while CRKD [26] and SRKD [27] improve detection robustness via multi-sensor knowledge fusion and weather condition simulation, respectively. On the other hand, approaches like SparseKD [8] and PointDistiller [7] focus on model lightweighting by balancing efficiency and accuracy through channel/depth compression and local geometry distillation. It is noteworthy that CaKDP [9] proposed a student-driven class-aware distillation strategy to establish a knowledge transfer bridge among heterogeneous detectors. However, existing methods still exhibit several limitations. First, most studies focus on pure point cloud modalities, and the lightweight design of multimodal fusion networks remains insufficiently explored. Second, current feature distillation methods do not effectively address the severe foreground–background imbalance in 3D detection and lack adaptive awareness of the importance of channel and spatial features. Moreover, existing class distillation strategies fail to model the confidence ranking among anchors, thereby limiting improvements in classification performance.

3. Method

3.1. Overall Framework Description

The proposed cross-modal knowledge distillation framework is illustrated in Figure 1. We design an attention-guided feature distillation module, where the foreground and background are separated based on a bird’s-eye view (BEV) foreground mask. Then, a combination of spatial and channel attention mechanisms is employed to guide the student model in focusing on more critical feature regions. Secondly, we introduce a ranking-aware response distillation strategy, which aligns the classification confidence distributions of single-stage and two-stage detectors by modeling anchor-level distributions and applying Kullback–Leibler (KL) divergence constraints. Finally, we propose an interactive feature supervision mechanism that establishes a feature distillation path from LiDAR to images during the intermediate fusion stage, leveraging point cloud data to enhance the spatial awareness of image features.

3.2. Attention-Guided Feature Distillation

Current channel compression-based distillation methods suffer from dual limitations [18,28]. On one hand, reducing feature dimensions leads to the loss of multi-level spatial semantic information, making it difficult to preserve critical geometric details. On the other hand, traditional methods treat feature map channels and spatial positions indiscriminately, neglecting the extreme foreground-background imbalance in 3D scenes, which exacerbates noise interference. To address these issues, as illustrated in Figure 2, we propose a dual-attention guided hierarchical distillation framework. First, a bird’s-eye view (BEV) mask is used to separate the foreground and background. Then, a learnable attention weight mechanism is introduced in the channel dimension to dynamically calibrate and enhance the transfer of knowledge in key channels, mitigating information degradation caused by feature compression. In the spatial dimension, spatial attention is leveraged to focus on the depth feature transmission of potential target regions.

As shown in Equation (1), the bird’s-eye view mask can be represented as

M_{l, w}

, where

r

represents the real labels, and

(l, w)

represents the position in the BEV. If the corresponding position contains an object, the mask is 1; otherwise, it is 0.

M_{l, w} = \{\begin{matrix} 1, i f (l, w) \in r \\ 0, O t h e r w i s e \end{matrix}

(1)

Additionally, to balance the size differences between object categories and avoid detection performance degradation due to small targets (e.g., pedestrians), we also use a mask

G_{l, w}

based on object scale, as shown in Equation (2). Where

S_{r}

represents the area of the ground truth bounding box in the bird’s-eye view (BEV) perspective and

N_{b g}

is the remaining background area, as shown in Equation (3).

G_{l, w} = \{\begin{matrix} \frac{1}{S_{r}}, i f (l, w) \in r \\ \frac{1}{N_{b g}}, O t h e r w i s e \end{matrix}

(2)

N_{b g} = \sum_{l = 1}^{L} \sum_{w = 1}^{W} (1 - M_{l, w})

(3)

The scale-aware mask

G_{l, w}

serves to normalize the contribution of each object based on its size. Larger objects inherently cover more pixels and would otherwise dominate the loss. By incorporating object area and normalized background compensation,

G_{l, w}

ensures that both large and small objects are treated fairly during knowledge transfer. Many studies [29,30,31,32] have shown that using attention mechanisms to focus on key pixels and key channels helps CNN-based models achieve better results. To minimize the computational cost of the student network, we directly use simple calculations [33] to obtain spatial and channel attention masks. First, we calculate the absolute mean values for different pixels and different positions with the following Formulas (4) and (5):

S^{S} (F) = \frac{1}{C} \cdot \sum_{c = 1}^{C} |F_{c}|

(4)

S^{C} (F) = \frac{1}{L W} \cdot \sum_{l = 1}^{L} \sum_{w = 1}^{W} |F_{i, j}|

(5)

where

L

,

W

,

C

represent the length, width, and the number of channels of the feature.

S^{S}

and

S^{C}

are the spatial and channel attention maps. The attention masks are then expressed as Equations (6) and (7):

A^{S} (F) = H \cdot W \cdot s o f t m a x (S^{S} (F) / T)

(6)

A^{C} (F) = C \cdot s o f t m a x (S^{C} (F) / T)

(7)

where

T

is a hyperparameter used to adjust the distribution [21]. During the training process, we use the teacher’s attention mask to guide the student. The spatial attention map

A^{S} (F)

tends to assign higher weights to regions with strong activation responses, often corresponding to object centers or boundaries in 3D scenes. Similarly, the channel attention

A^{C} (F)

emphasizes feature channels that are more discriminative across different object categories. This selective enhancement helps the student focus on semantically meaningful information during the distillation process. The final feature distillation loss is computed as follows:

L_{a t t_f e a t} = α \sum_{c = 1}^{C} \sum_{l = 1}^{L} \sum_{w = 1}^{W} M_{l, w} G_{l, w} A_{l, w}^{S} A_{c}^{C} {(F_{c, l, w}^{t} - f (F_{c, l, w}^{s}))}^{2} + β \sum_{c = 1}^{C} \sum_{l = 1}^{H} \sum_{j = 1}^{W} (1 - M_{l, w}) G_{l, w} A_{l, w}^{S} A_{c}^{C} {(F_{c, l, w}^{t} - f (F_{c, l, w}^{s}))}^{2}

(8)

where

α

and

β

are hyperparameters to balance foreground and background, and

F^{t}

and

F^{s}

represent the teacher and student feature maps, respectively.

To verify the practicality of our dual attention guided distillation in real world settings, we designed all attention computations to rely solely on simple, highly parallelizable tensor operations (summation, normalization and softmax) without introducing any extra convolutional layers. The learnable parameters consist only of per channel scaling weights, which represent a vanishingly small fraction of the overall model size.

During training, the additional cost arises only from (1) computing two attention maps per feature tensor—one along the spatial dimensions and one along the channel dimension—and (2) applying the distillation loss between teacher and student features. Both steps reuse intermediate activations and leverage optimized library routines, so the wall clock time per epoch increases only marginally on modern GPUs. Crucially, there is zero overhead at inference, since all attention masks and loss computations are disabled once training completes.

In terms of memory footprint, only two extra tensors (one of shape L × W and one of length C) need to be stored transiently during forward/backward passes, which is negligible compared to the backbone feature maps. Overall, our framework imposes only a light training time penalty while delivering consistent gains in student accuracy, and it does not affect runtime latency or deployment size.

3.3. Ranking-Aware Class Distillation

Analysis [9] reveals significant differences in prediction distributions between heterogeneous detectors (two-stage vs. single-stage). Two-stage detectors utilize RoI selection to suppress background noise and achieve higher category confidence, while single-stage detectors, with their dense anchors, tend to induce inter-class competition. To address this issue, CAKDP [9] proposes a student-driven, category-aware knowledge distillation strategy. However, directly aligning classification scores overlooks the relative confidence ranking among anchors [34]. Particularly for challenging samples, the student network’s limited feature representation may lead to misaligned anchor selection, making it difficult to maintain spatial consistency with the teacher’s predictions. As illustrated in Figure 3, we have designed a ranking-aware category distillation framework that supervises the differences in category distributions via KL divergence and aligns the confidence ranking relationships using a ranking loss. This approach achieves a deep alignment of prediction distributions between heterogeneous detectors, effectively enhancing the modeling accuracy for challenging targets.

Specifically, we first use the student network to select the Top-N high-response anchors for each instance as representative samples. The spatial positions of the student anchors are then mapped onto the teacher network’s first-stage feature map. RoI pooling is used to extract the corresponding second-stage classification scores. The

j

-th instance of each frame corresponds to the representative sample anchor

a_{i}^{j}

, where

i \in \{1, \dots, N\}

, and

N

represents the number of selected anchors. The corresponding student and teacher target class classification scores are denoted as

{p_{s}^{c l s}}_{i}^{j}

and

{p_{t}^{c l s}}_{i}^{j}

. We apply the softmax function to these anchor scores to obtain the ranking distributions

{p_{s}^{r a n k}}_{i}^{j}

and

{p_{t}^{r a n k}}_{i}^{j}

, as shown in the equations. The temperature parameter

τ

is introduced to soften the distribution, enhancing the ranking information transfer for difficult samples. The KL divergence between the teacher and student ranking distributions is minimized:

{p_{s}^{r a n k}}_{i}^{j} = \frac{e x p ({p_{s}^{c l s}}_{i}^{j} / τ)}{\sum_{m = 1}^{N} e x p ({p_{s}^{c l s}}_{m}^{j} / τ)}

(9)

{p_{t}^{r a n k}}_{i}^{j} = \frac{e x p ({p_{t}^{c l s}}_{i}^{j} / τ)}{\sum_{m = 1}^{N} e x p ({p_{t}^{c l s}}_{m}^{j} / τ)}

(10)

L_{r a n k} = \sum_{j = 1}^{M} \sum_{i = 1}^{N} {p_{t}^{r a n k}}_{i}^{j} l o g (\frac{{p_{t}^{r a n k}}_{i}^{j}}{{p_{s}^{r a n k}}_{i}^{j}})

(11)

We directly constrain the class probability distributions of the teacher and student for the same anchor points, and compute the per-class KL divergence

L_{c a t e}

:

L_{c a t e} = \sum_{j = 1}^{M} \sum_{i = 1}^{N} {p_{t}^{c l s}}_{i}^{j} l o g (\frac{{p_{t}^{c l s}}_{i}^{j}}{{p_{s}^{c l s}}_{i}^{j}})

(12)

L_{c l s} = {γ L}_{r a n k} + δ L_{c a t e}

(13)

where

γ

and

δ

are weight coefficients used for policy adjustment. The ranking-aware class distillation method aligns anchor-level ranking distributions, enhancing the student’s ability to learn from the key anchors in the teacher’s decision logic. This ranking distribution supervision also alleviates the false-negative problem caused by anchor position shifts. The method achieves multi-granularity supervision: the class distribution constraint ensures global confidence accuracy, while the ranking distribution constraint enhances local ranking consistency, thus promoting the performance improvement of heterogeneous detectors.

The overall loss function for the method can be expressed as:

L_{l o s s} = L_{g t} + L_{c l s} + L_{f e a t}

(14)

where

L_{g t}

represents the loss function that calculates the difference between the network’s output and the ground truth during the student network’s training phase.

To ensure that our ranking-aware class distillation is both effective and practical, we carefully designed each component to incur minimal overhead. During training, our module reuses the student’s forward-pass scores and applies a highly optimized top-k selection to identify the N most responsive anchors, incurring only marginal latency. These anchor coordinates are then mapped to the teacher’s first-stage feature map and processed with standard RoI pooling—no new network layers or learnable parameters are introduced. Next, two softmax distributions (Equations (9) and (10)) and their corresponding KL-divergence losses (Equations (11) and (12)) are computed in a fully vectorized fashion, scaling linearly with N and leveraging backend-accelerated tensor routines. Temporarily storing only two small N-length tensors per instance imposes a negligible memory footprint compared to backbone activations. Crucially, all extra operations—anchor selection, RoI pooling, softmax and loss calculation—are active only during training, and are entirely bypassed at inference, resulting in zero impact on deployment latency or model size. Overall, our ranking-aware class distillation adds only a slight, practically imperceptible training-time cost while consistently boosting student accuracy.

3.4. Interactive Feature Supervision Distillation

Currently, there are three main approaches for LiDAR-Camera fusion in object detection: early fusion, intermediate fusion, and late fusion. The early fusion approach typically processes images first, then continues to use a 3D backbone network and detection head to perform object detection on the point clouds. Therefore, the previously proposed point-cloud-based 3D object detection knowledge distillation method can be directly applied. The late fusion approach is similar, allowing the previously introduced loss function for knowledge distillation to be directly used.

In LiDAR–camera 3D detection, three principal fusion paradigms exist: early fusion, intermediate fusion, and late fusion. Early-and late-fusion distillation methods supervise only within a single modality—either by applying point-cloud distillation losses directly to fused representations or by treating the entire multimodal feature map as a homogeneous block. Such approaches overlook the rich, asymmetric relationships between LiDAR and image features and fail to adapt when teacher and student channels do not match. By contrast, our interactive feature supervision mechanism Interactive Feature Supervision Distillation explicitly aligns and transfers knowledge both within and across modalities, before and after fusion.

The intermediate fusion approach is mainly illustrated in Figure 4. In this method, images and point clouds are processed separately by their respective backbone networks to extract features. The student camera features are denoted as

F_{C a m e r a}^{s}

, while the student LiDAR features are denoted as

F_{L i d a r}^{s}

. The teacher LiDAR features are denoted as

F_{L i d a r}^{t}

, and the teacher camera features are denoted as

F_{C a m e r a}^{t}

. The fused features of the student and teacher networks can be represented as

{F_{C L}^{s}}^{'}

and

{F_{C L}^{t}}^{'}

, respectively. Similar to the early fusion method, the difference between

{F_{C L}^{s}}^{'}

and

{F_{C L}^{t}}^{'}

can be minimized using

L_{f e a t}

. However, in the case of intermediate fusion, the feature differences before fusion remain crucial for network performance. Therefore, we propose a multimodal interaction feature supervision mechanism to fully extract the correlations between the multimodal features and the potential of each individual modality.

Similar to the survey [28], we use an automatic channel encoder-decoder, consisting of an encoder that gradually reduces the channel dimension and a decoder that increases the channel dimension. This structure allows the teacher and student networks, which may have mismatched channel dimensions, to be adjusted to the same dimension, after which the loss calculation is performed. Similar to the attention-guided feature distillation method, we can also obtain the spatial attention mask and channel attention mask of the features. To capture global features, the feature distillation loss functions for LiDAR-to-LiDAR

L_{l - l}

, Camera-to-Camera

L_{c - c}

, and LiDAR-to-Camera

L_{l - c}

can be represented by the following formulas:

{L_{l - l} = L}_{F} (A^{S p a t i a l} (F_{L}^{t}), A^{S p a t i a l} (F_{L}^{s})) + L_{F} (A^{C h a n n e l} (F_{L}^{t}), A^{C h a n n e l} (F_{L}^{s}))

(15)

{L_{c - c} = L}_{F} (A^{S p a t i a l} (F_{C}^{t}), A^{S p a t i a l} (F_{C}^{s})) + L_{F} (A^{C h a n n e l} (F_{C}^{t}), A^{C h a n n e l} (F_{C}^{s}))

(16)

{L_{l - c} = L}_{F} (A^{S p a t i a l} (F_{C}^{t}), A^{S p a t i a l} (F_{C}^{s})) + L_{F} (A^{C h a n n e l} (F_{C}^{t}), A^{C h a n n e l} (F_{C}^{s}))

(17)

L_{i n t r a_f e a t} = {σ_{1} L}_{l - l} + {σ_{2} L}_{c - c} + {σ_{3} L}_{l - c}

(18)

σ_{1}, σ_{2}, σ_{3}

are used to balance the relationship between distillation across different modalities. By incorporating

L_{i n t r a_f e a t}

into Equation (14), the loss function for the multimodal object detection knowledge distillation method can be computed.

4. Experiment Result

4.1. Experiment Setting

The KITTI [35] dataset is the most widely used dataset for 3D object detection in outdoor autonomous driving scenarios, comprising 7481 training samples and 7518 test samples. We adopt the same data split as SECOND [12], utilizing 3712 frames for training and 3769 frames for testing, and evaluate the mean Average Precision (mAP) across three difficulty levels (easy, moderate, and hard) using 40 recall points as per the official evaluation protocol. Additionally, we validate our approach on the nuScenes [36] dataset—a larger-scale outdoor driving dataset where each frame includes data from six surrounding cameras and a LiDAR sensor. The dataset contains over 1.4 million annotated 3D detection boxes spanning 10 classes, and the nuScenes detection score (NDS) is used as the evaluation metric.

We validate the proposed method using various distillation pairs across different datasets. For the KITTI dataset, we use several two-stage detectors, including Voxel-RCNN [11], PV-RCNN [15], and PartA2 [37], as teachers to assist in training single-stage detectors (SECOND [12] and CenterPoint [13] with different detection heads). The multimodal data input used directly references methods based on virtual point inputs (SFD [19], Virconv [38]). Specifically, we combine the virtual points generated by depth completion from images and the real LiDAR point clouds as the input data, validating the overall performance of the fusion method. Additionally, we perform network performance validation for intermediate fusion using the existing fusion method SFD [19].

In the experimental setup, for the single-stage, front fusion method with the SECOND detector, the hyperparameters are set as {

α = 1 \times 10^{- 3}, β = 5 \times 10^{- 4}, γ = 1 \times 10^{- 3}, δ = 5 \times 10^{- 3}

}. For CenterPoint, the hyperparameters are set as {

α = 1.6 \times 10^{- 3}, β = 8 \times 10^{- 4}, γ = 3 \times 10^{- 3}, δ = 5 \times 10^{- 3}

}. For the intermediate fusion detectors, the hyperparameters in the interactive feature supervision are set as {

σ_{1} = 3 \times 10^{- 4}, σ_{2} = 1 \times 10^{- 4}, σ_{3} = 5 \times 10^{- 4}

}. Other training and evaluation configurations from OpenPCDet are kept as default. All experiments are deployed on a GeForce RTX 4090 GPU.

4.2. Experiment Result on KITTI

We first validate the effectiveness of our method on the KITTI dataset’s validation set. As shown in Table 1, the detection results for all three categories (Car, Pedestrian, and Cyclist) are evaluated under the moderate difficulty level of the KITTI benchmark. Our fused approach enables the training of accurate and compact student detectors through knowledge distillation from multiple diverse teacher methods.

Specifically, the first column on the right side of the table (up/down1) presents the performance comparison between channel-reduced student networks and their original counterparts. Our method demonstrates that only when the student network’s parameters are reduced toless than 0.2× of the original configuration does it exhibit marginal performance degradation compared to the baseline. In contrast, all other reduced configurations achieve superior performance after knowledge distillation.

The second rightmost column (up/down2) quantifies the performance gap between teacher networks and student networks. Remarkably, our approach even surpasses the two-stage detection teacher networks when the channel reduction in student networks remains moderate (e.g., >0.5× parameter retention). This evidence suggests that our distillation framework successfully preserves critical detection knowledge while achieving substantial model compression.

To further demonstrate the effectiveness of our multi-modal knowledge distillation method, we employ the SFD network as the teacher network and utilize channel-reduced networks with the second stage removed as student networks. Comparative results with other knowledge distillation methods are presented in Table 2. Our approach substantially outperforms competing methods while maintaining baseline performance levels despite significant computational cost reduction.

Our visualization results are presented in Figure 5. The first row shows the detection results of the SECOND student network obtained through our knowledge distillation approach using VoxelRCNN as the teacher, while the second row displays the original SECOND detection results. It is evident that our network significantly improves false detection reduction and maintains high accuracy.

4.3. Ablation Study

This section investigates the impact of different distillation losses through comprehensive ablation studies. We first individually evaluate

L_{c l s}

,

L_{i n t r a_f e a t}

, and

L_{a t t_f e a t}

, followed by pairwise combinations of these loss components. Following the configuration in Table 3, we employ the SFD method as the teacher network and a channel-reduced SFD architecture as the student network. The experimental results presented in Table 3 demonstrate that each proposed computational module contributes positively to overall detection performance, with combined usage achieving optimal effectiveness.

In addition to the KITTI experiments, we further conduct validation experiments using the TransFusion method on the nuScenes dataset. The experimental results are shown in Table 4. The results similarly demonstrate the effectiveness of our approach, showing that the method maintains baseline performance levels while achieving significant model size reduction.

To investigate the characteristics of distillation performance, we conducted experiments using a two-stage to two-stage knowledge distillation framework. The experimental results presented in Table 5 demonstrate that our method effectively preserves model performance while reducing network complexity through knowledge distillation. Notably, our approach achieves simultaneous channel compression and performance enhancement across all comparative experimental groups.

Furthermore, we conducted experiments with varying parameter configurations. Our analysis reveals that the proposed method demonstrates low sensitivity to specific hyperparameter settings, maintaining stable performance across different parameter selections. The channel-reduced student architecture retains critical feature representation capacity through our

L_{a t t_f e a t}

, which enforces structural similarity between teacher and student feature maps. This built-in regularization reduces dependence on precise parameter tuning. The synergistic combination of

L_{c l s}

,

L_{a t t_f e a t}

and

L_{i n t r a_f e a t}

creates complementary learning signals. This loss landscape smoothing effect mitigates the risk of local minima entrapment caused by suboptimal hyperparameters.

5. Conclusions

This paper proposes a cross-modal knowledge distillation framework for enhancing 3D object detection performance through multi-level knowledge transfer strategies. To address critical challenges in cross-modal feature learning, we present three innovative solutions: (1) Attention-Guided Feature Distillation: An attention-driven distillation mechanism incorporating bird’s-eye-view (BEV) foreground masks and dual-attention modules substantially enhances feature learning sufficiency while suppressing background noise interference. (2) Category-Aware Response Distillation: Our anchor-level distribution modeling with KL divergence optimization effectively mitigates classification performance degradation when distilling knowledge from two-stage to single-stage detectors. (3) Interactive Feature Supervision Framework: A novel cross-modal alignment strategy with LiDAR-image feature supervision significantly strengthens the representation capability of intermediate fusion approaches. Extensive experiments on KITTI and other benchmarks demonstrate that our method maintains baseline performance levels while reducing model complexity, even achieving performance improvements over certain existing approaches. These technical advances provide new solutions for real-time 3D perception in autonomous vehicles and UAV applications, while establishing reference standards for future research in multi-sensor fusion systems.

Critical Reflection: Our framework consistently outperforms or matches full-size baselines on KITTI, nuScenes, and other benchmarks—even under substantial compression—demonstrating that carefully designed KD strategies can not only recover but sometimes exceed the accuracy of uncompressed models. While our experiments validate the effectiveness of each module, future work could explore automated hyperparameter tuning for the attention and distillation weights

(α, β, γ, δ)

and assess robustness under varying sensor noise and domain shifts.

Applications and Future Directions: The proposed methods are immediately applicable to real-time 3D perception in autonomous vehicles, UAV navigation, and mobile robotics—domains where tight compute and memory budgets demand efficient yet accurate detectors. Looking forward, we plan to extend our framework to handle (i) dynamic sensor configurations, adapting on-the-fly to sensor failures or occlusions; (ii) continual learning scenarios, where new object classes emerge post-deployment; and (iii) other modality pairs such as radar–camera or thermal–LiDAR fusion. We also envision integrating our distillation strategies with unsupervised and self-supervised pretraining to further reduce annotation requirements.

In summary, this work lays a comprehensive foundation for lightweight, high-performance multimodal detection. By critically evaluating our contributions and outlining clear paths for future enhancement, we aim to inspire continued progress in efficient 3D perception and multi-sensor fusion research.

Author Contributions

Conceptualization, B.Y.; methodology, B.Y.; investigation, B.Y. and T.T.; data curation, J.Y.; writing—original draft preparation, B.Y.; writing—review and editing, B.Y., W.W. and T.T.; funding acquisition, Y.Z., X.M. and J.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was fully supported by the Key Research and Development Program of Hubei Province, China (2022BCA035). The numerical calculations were performed at the Supercomputing Center of Wuhan University.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

Authors Yongjun Zhang and Xiuyuan Meng were employed by the company Shanxi Road and Bridge Group Xinzhou National Highway Project Construction Management Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Mao, J.; Shi, S.; Wang, X.; Li, H. 3D Object Detection for Autonomous Driving: A Comprehensive Survey. Int. J. Comput. Vis. 2023, 131, 1909–1963. [Google Scholar] [CrossRef]
Muzahid, A.A.M.; Han, H.; Zhang, Y.; Li, D.; Zhang, Y.; Jamshid, J.; Sohel, F. Deep learning for 3D object recognition: A survey. Neurocomputing 2024, 608, 128436. [Google Scholar] [CrossRef]
Nagiub, A.S.; Fayez, M.; Khaled, H.; Ghoniemy, S. 3D Object Detection for Autonomous Driving: A Comprehensive Review. In Proceedings of the 2024 6th International Conference on Computing and Informatics (ICCI), New Cairo, Egypt, 6–7 March 2024; pp. 1–11. [Google Scholar]
Gou, J.; Yu, B.; Maybank, S.J.; Tao, D. Knowledge Distillation: A Survey. Int. J. Comput. Vis. 2021, 129, 1789–1819. [Google Scholar] [CrossRef]
Liu, Z.; Tang, H.; Amini, A.; Yang, X.; Mao, H.; Rus, D.L.; Han, S. BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird’s-Eye View Representation. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; pp. 2774–2781. [Google Scholar]
Bai, X.; Hu, Z.; Zhu, X.; Huang, Q.; Chen, Y.; Fu, H.; Tai, C.-L. TransFusion: Robust LiDAR-Camera Fusion for 3D Object Detection with Transformers. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 21–24 June 2022; pp. 1080–1089. [Google Scholar]
Zhang, L.; Dong, R.; Tai, H.-S.; Ma, K. PointDistiller: Structured Knowledge Distillation Towards Efficient and Compact 3D Detection. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; IEEE: Vancouver, BC, Canada, 2023; pp. 21791–21801. [Google Scholar]
Yang, J.; Shi, S.; Ding, R.; Wang, Z.; Qi, X. Towards efficient 3D object detection with knowledge distillation. In Proceedings of the Proceedings of the 36th International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 28 November–9 December 2022; Curran Associates, Inc.: Red Hook, NY, USA, 2022; pp. 21300–21313. [Google Scholar]
Zhang, H.; Liu, L.; Huang, Y.; Yang, Z.; Lei, X.; Wen, B. CaKDP: Category-Aware Knowledge Distillation and Pruning Framework for Lightweight 3D Object Detection. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 15331–15341. [Google Scholar]
Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
Deng, J.; Shi, S.; Li, P.; Zhou, W.; Zhang, Y.; Li, H. Voxel R-CNN: Towards High Performance Voxel-based 3D Object Detection. Proc. AAAI Conf. Artif. Intell. 2021, 35, 1201–1209. [Google Scholar] [CrossRef]
Yan, Y.; Mao, Y.; Li, B. SECOND: Sparsely Embedded Convolutional Detection. Sensors 2018, 18, 3337. [Google Scholar] [CrossRef] [PubMed]
Yin, T.; Zhou, X.; Krähenbühl, P. Center-based 3D Object Detection and Tracking. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 11779–11788. [Google Scholar]
Lang, A.H.; Vora, S.; Caesar, H.; Zhou, L.; Yang, J.; Beijbom, O. PointPillars: Fast Encoders for Object Detection From Point Clouds. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 12689–12697. [Google Scholar]
Shi, S.; Guo, C.; Jiang, L.; Wang, Z.; Shi, J.; Wang, X.; Li, H. PV-RCNN: Point-Voxel Feature Set Abstraction for 3D Object Detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10526–10535. [Google Scholar]
Wang, Y.; Guizilini, V.C.; Zhang, T.; Wang, Y.; Zhao, H.; Solomon, J. DETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries. In Proceedings of the 5th Conference on Robot Learning, PMLR, London, UK, 8–11 November 2021. [Google Scholar]
Philion, J.; Fidler, S. Lift, Splat, Shoot: Encoding Images from Arbitrary Camera Rigs by Implicitly Unprojecting to 3D. In Proceedings of the Computer Vision—ECCV 2020, Glasgow, UK, 23–28 August 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M., Eds.; Springer International Publishing: Cham, Switzerland, 2020; pp. 194–210. [Google Scholar]
Vora, S.; Lang, A.H.; Helou, B.; Beijbom, O. PointPainting: Sequential Fusion for 3D Object Detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 4603–4611. [Google Scholar]
Wu, X.; Peng, L.; Yang, H.; Xie, L.; Huang, C.; Deng, C.; Liu, H.; Cai, D. Sparse Fuse Dense: Towards High Quality 3D Detection with Depth Completion. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 21–24 June 2022; pp. 5408–5417. [Google Scholar]
Pang, S.; Morris, D.; Radha, H. CLOCs: Camera-LiDAR Object Candidates Fusion for 3D Object Detection. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 25–29 October 2020; pp. 10386–10393. [Google Scholar]
Hinton, G.E.; Vinyals, O.; Dean, J. Distilling the Knowledge in a Neural Network. arXiv 2015, arXiv:1503.02531. [Google Scholar]
Romero, A.; Ballas, N.; Kahou, S.E.; Chassang, A.; Gatta, C.; Bengio, Y. FitNets: Hints for Thin Deep Nets. arXiv 2015, arXiv:1412.65550. [Google Scholar] [CrossRef]
He, Z.; Dai, T.; Lu, J.; Jiang, Y.; Xia, S.-T. Fakd: Feature-Affinity Based Knowledge Distillation for Efficient Image Super-Resolution. In Proceedings of the 2020 IEEE International Conference on Image Processing (ICIP), Abu Dhabi, United Arab Emirates, 25–28 October 2020; pp. 518–522. [Google Scholar]
Hou, Y.; Ma, Z.; Liu, C.; Hui, T.-W.; Loy, C.C. Inter-Region Affinity Distillation for Road Marking Segmentation. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 12483–12492. [Google Scholar]
Guo, X.; Shi, S.; Wang, X.; Li, H. LIGA-Stereo: Learning LiDAR Geometry Aware Representations for Stereo-based 3D Detector. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021; pp. 3133–3143. [Google Scholar]
Zhao, L.; Song, J.; Skinner, K.A. CRKD: Enhanced Camera-Radar Object Detection with Cross-Modality Knowledge Distillation. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 15470–15480. [Google Scholar]
Huang, X.; Wu, H.; Li, X.; Fan, X.; Wen, C.; Wang, C. Sunshine to Rainstorm: Cross-Weather Knowledge Distillation for Robust 3D Object Detection. In Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence, IAAI 2024, Fourteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2014, Vancouver, BC, Canada, 20–27 February 2024; Wooldridge, M.J., Dy, J.G., Natarajan, S., Eds.; AAAI Press: Washington, DC, USA, 2024; pp. 2409–2416. [Google Scholar]
Cho, H.; Choi, J.; Baek, G.; Hwang, W. itKD: Interchange Transfer-based Knowledge Distillation for 3D Object Detection. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 13540–13549. [Google Scholar]
Xiao, J.; Wu, Y.; Chen, Y.; Wang, S.; Wang, Z.; Ma, J. LSTFE-Net: Long Short-Term Feature Enhancement Network for Video Small Object Detection. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 14613–14622. [Google Scholar]
Xiao, J.; Yao, Y.; Zhou, J.; Guo, H.; Yu, Q.; Wang, Y.-F. FDLR-Net: A feature decoupling and localization refinement network for object detection in remote sensing images. Expert Syst. Appl. 2023, 225, 120068. [Google Scholar] [CrossRef]
Xiao, J.; Wang, S.; Zhou, J.; Zeng, Z.; Luo, M.; Chen, R. Revisiting the Learning Stage in Range View Representation for Autonomous Driving. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5702014. [Google Scholar] [CrossRef]
Xiao, J.; Guo, H.; Zhou, J.; Zhao, T.; Yu, Q.; Chen, Y.; Wang, Z. Tiny object detection with context enhancement and feature purification. Expert Syst. Appl. 2023, 211, 118665. [Google Scholar] [CrossRef]
Yang, Z.; Li, Z.; Jiang, X.; Gong, Y.; Yuan, Z.; Zhao, D.; Yuan, C. Focal and Global Knowledge Distillation for Detectors. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 21–24 June 2022; pp. 4633–4642. [Google Scholar]
Li, G.; Li, X.; Wang, Y.; Zhang, S.; Wu, Y.; Liang, D. Knowledge Distillation for Object Detection via Rank Mimicking and Prediction-Guided Feature Imitation. In Proceedings of the Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI 2022, Thirty-Fourth Conference on Innovative Applications of Artificial Intelligence, IAAI 2022, The Twelveth Symposium on Educational Advances in Artificial Intelligence, EAAI 2022, Virtual Event, 22 February–1 March 2022; AAAI Press: Washington, DC, USA, 2022; pp. 1306–1313. [Google Scholar]
Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? The KITTI vision benchmark suite. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 3354–3361. [Google Scholar]
Caesar, H.; Bankiti, V.; Lang, A.H.; Vora, S.; Liong, V.E.; Xu, Q.; Krishnan, A.; Pan, Y.; Baldan, G.; Beijbom, O. nuScenes: A Multimodal Dataset for Autonomous Driving. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11618–11628. [Google Scholar]
Shi, S.; Wang, Z.; Shi, J.; Wang, X.; Li, H. From Points to Parts: 3D Object Detection From Point Cloud With Part-Aware and Part-Aggregation Network. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 2647–2664. [Google Scholar] [CrossRef] [PubMed]
Wu, H.; Wen, C.; Shi, S.; Li, X.; Wang, C. Virtual Sparse Convolution for Multimodal 3D Object Detection. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 21653–21662. [Google Scholar]
Dai, X.; Jiang, Z.; Wu, Z.; Bao, Y.; Wang, Z.; Liu, S.; Zhou, E. General Instance Distillation for Object Detection. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 7838–7847. [Google Scholar]

Figure 1. Illustration of the interactive feature supervision Method: The pink parts represent the student network, the blue parts represent the teacher network, and the green parts indicate the added knowledge distillation modules. The green dashed lines represent the backpropagation pathway.

Figure 2. Illustration of the attention-guided feature distillation.

Figure 3. Illustration of the ranking-aware class distillation.

Figure 4. Illustration of the interactive feature supervision method.

Figure 5. Visualization results.

Table 1. Illustration of the interactive feature supervision method.

	Model	Model Size	KD	Car	Pedestrian	Cyclist	mAP	Up/Down1	Up/Down2
Student	SECOND [12]	base	$\times$	82.49	53.98	67.41	67.96	-	-
Student	CenterPoint [13]	base	$\times$	79.97	52.93	69.16	67.35	-	-
Teacher	Voxel-RCNN [11]	base	$\times$	85.56	59.20	74.60	73.45	-	-
Student	SECOND [12]	base	$\sqrt$	84.48	61.78	76.29	74.18	$↑$	$↑$
		S	$\sqrt$	84.16	64.03	74.52	74.24	$↑$	$↑$
		XS	$\sqrt$	83.67	61.64	73.24	72.85	$↑$	$↓$
		XXS	$\sqrt$	80.33	54.76	67.41	67.50	$↓$	$↓$
	CenterPoint [13]	base	$\sqrt$	84.4	62.27	75.41	74.03	$↑$	$↑$
		S	$\sqrt$	83.72	60.39	74.76	72.96	$↑$	$↓$
		XS	$\sqrt$	81.71	62.04	73.08	72.28	$↑$	$↓$
		XXS	$\sqrt$	74.84	60.15	67.85	67.61	$↑$	$↓$
Teacher	PV-RCNN [15]	base	$\times$	85.50	59.21	73.13	72.61	-	-
Student	SECOND [12]	base	$\sqrt$	84.59	60.28	75.41	73.43	$↑$	$↑$
		S	$\sqrt$	83.88	60.76	72.9	72.51	$↑$	$↓$
		XS	$\sqrt$	83.33	61.57	72.29	72.40	$↑$	$↓$
		XXS	$\sqrt$	80.62	53.77	67.39	67.26	$↓$	$↓$
	CenterPoint [13]	base	$\sqrt$	84.68	58.66	71.72	71.69	$↑$	$↓$
		S	$\sqrt$	84.39	59.7	74.16	72.75	$↑$	$↑$
		XS	$\sqrt$	81.93	60.2	71.7	71.28	$↑$	$↓$
		XXS	$\sqrt$	75.35	58.92	68.26	67.51	$↑$	$↓$
Teacher	PartA2 [37]	base	$\times$	83.65	61.78	74.08	73.17	-	-
Student	SECOND [12]	base	$\sqrt$	83.89	61.51	75.12	73.51	$↑$	$↑$
		S	$\sqrt$	84.39	61.7	74.94	73.68	$↑$	$↑$
		XS	$\sqrt$	84.28	59.25	73.34	72.29	$↑$	$↓$
		XXS	$\sqrt$	78.82	54.42	67.36	66.87	$↓$	$↓$
	CenterPoint [13]	base	$\sqrt$	84.06	59.61	73.24	72.30	$↑$	$↓$
		S	$\sqrt$	83.85	59.62	74.46	72.64	$↑$	$↓$
		XS	$\sqrt$	81.18	60.86	73.67	71.90	$↑$	$↓$
		XXS	$\sqrt$	75.76	59.09	66.40	67.08	$↓$	$↓$

Table 2. Illustration of the interactive feature supervision method.

	Model	Size	Car	Pedestrians	Cyclist
Tea	SFD [19]	base	88.27	66.69	72.95
Stu	Reduced SFD	S	86.16	63.47	70.14
	+Vanilla KD [21]	S	86.12	63.72	71.06
	+GID [39]	S	85.79	62.46	69.27
	+SparseKD [8]	S	87.11	65.29	71.15
	+PointDistiller [7]	S	86.84	64.12	70.74
	+CaKDP [9]	S	87.09	65.48	71.57
	+Ours	S	88.34	66.41	72.73

Table 3. Ablation studies on SFD method.

Method			Car	Pedestrians	Cyclist
$L_{i n t r a_f e a t}$	$L_{a t t_f e a t}$	$L_{c l s}$	Car	Pedestrians	Cyclist
$\times$	$\times$	$\times$	86.16	63.47	70.14
$\sqrt$	$\times$	$\times$	87.14	64.51	71.07
$\times$	$\sqrt$	$\times$	86.87	64.37	70.86
$\times$	$\times$	$\sqrt$	86.94	65.14	71.24
$\sqrt$	$\sqrt$	$\times$	87.49	65.25	70.97
$\times$	$\sqrt$	$\sqrt$	87.74	65.75	71.94
$\sqrt$	$\times$	$\sqrt$	87.19	65.83	71.48
$\sqrt$	$\sqrt$	$\sqrt$	88.34	66.41	72.73

Table 4. Ablation studies on nuScenes method.

	Model	Size	mAP	mATE	mASE	mAOE	mAVE	mAAE	NDS
Tea	Transfusion [6]	base	0.6388	0.2867	0.2555	0.2725	0.2624	0.1895	0.6927
Stu	Reduced Transfusion	S	0.6230	0.3290	0.2569	0.3124	0.2683	0.1921	0.6756
Stu	Reduced Transfusion + Ours	S	0.6387	0.2885	0.2564	0.2870	0.2638	0.1904	0.6907

Table 5. Ablation studies on homogeneous method.

	Model	Size	Car	Pedestrians	Cyclist
Tea	SFD	base	88.27	66.69	72.95
Stu	SFD	S	88.53	66.72	73.10
Tea	Voxel-RCNN	base	85.56	59.20	74.60
Stu	Voxel-RCNN	S	85.76	60.14	74.87
Tea	PV-RCNN	base	85.50	59.21	73.13
Stu	PV-RCNN	S	85.61	59.33	73.48
Tea	PartA2	base	83.65	61.78	74.08
Stu	PartA2	S	83.72	61.94	74.31

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yang, B.; Tao, T.; Wu, W.; Zhang, Y.; Meng, X.; Yang, J. MultiDistiller: Efficient Multimodal 3D Detection via Knowledge Distillation for Drones and Autonomous Vehicles. Drones 2025, 9, 322. https://doi.org/10.3390/drones9050322

AMA Style

Yang B, Tao T, Wu W, Zhang Y, Meng X, Yang J. MultiDistiller: Efficient Multimodal 3D Detection via Knowledge Distillation for Drones and Autonomous Vehicles. Drones. 2025; 9(5):322. https://doi.org/10.3390/drones9050322

Chicago/Turabian Style

Yang, Binghui, Tao Tao, Wenfei Wu, Yongjun Zhang, Xiuyuan Meng, and Jianfeng Yang. 2025. "MultiDistiller: Efficient Multimodal 3D Detection via Knowledge Distillation for Drones and Autonomous Vehicles" Drones 9, no. 5: 322. https://doi.org/10.3390/drones9050322

APA Style

Yang, B., Tao, T., Wu, W., Zhang, Y., Meng, X., & Yang, J. (2025). MultiDistiller: Efficient Multimodal 3D Detection via Knowledge Distillation for Drones and Autonomous Vehicles. Drones, 9(5), 322. https://doi.org/10.3390/drones9050322

Article Menu

MultiDistiller: Efficient Multimodal 3D Detection via Knowledge Distillation for Drones and Autonomous Vehicles

Abstract

1. Introduction

2. Related Works

3. Method

3.1. Overall Framework Description

3.2. Attention-Guided Feature Distillation

3.3. Ranking-Aware Class Distillation

3.4. Interactive Feature Supervision Distillation

4. Experiment Result

4.1. Experiment Setting

4.2. Experiment Result on KITTI

4.3. Ablation Study

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI