Next Article in Journal
Resilient Edge-IVA: Perception-Aware Adaptive Control for Stable Real-Time Analytics on Resource-Constrained Devices
Previous Article in Journal
Effect of Duct Inclination and Acoustic–Electrostatic Hybridization on Particle Removal in Low-Velocity Airflows: Experimental Analysis
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

UDF-3D: Uncertainty-Driven Decision-Level Fusion for Camera–LiDAR 3D Object Detection

1
School of Electrical and Control Engineering, Shaanxi University of Science and Technology, Xi’an 710021, China
2
School of Mechanical and Precision Instrument Engineering, Xi’an University of Technology, Xi’an 710048, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2026, 16(12), 5983; https://doi.org/10.3390/app16125983 (registering DOI)
Submission received: 12 May 2026 / Revised: 4 June 2026 / Accepted: 9 June 2026 / Published: 12 June 2026
(This article belongs to the Section Computing and Artificial Intelligence)

Abstract

Camera and LiDAR provide highly complementary information, and effective fusion of both modalities is desirable for 3D object detection. However, existing decision-level fusion methods mainly rely on the confidence of objects while neglecting the object uncertainty. To address this, we propose UDF-3D, an uncertainty-driven camera–LiDAR decision-level fusion method based on Dempster–Shafer evidence theory. First, object uncertainty is quantified by introducing the theory of subjective logic, where subjective opinions incorporate category belief masses and an uncertainty mass. Second, a cost matrix is designed for object matching, where each element is a weighted combination of geometric and semantic information from both sensors, and the weights are determined by the uncertainty parameters. Third, we construct a view-frustum constraint to re-evaluate unmatched objects, thereby reducing the false-negative rate. Finally, we design a novel evidence discounting factor within the Dempster–Shafer framework for matched objects, thereby mitigating cross-modal object conflicts during fusion and improving detection accuracy. Experiments on the KITTI dataset demonstrate that the proposed method outperforms existing decision-level fusion approaches, yielding improved detection accuracy.

1. Introduction

The two most widely used sensing modalities for object detection are RGB cameras and Light Detection and Ranging (LiDAR) [1]. These sensors are highly complementary in terms of information representation and perceptual capability. Cameras provide high-resolution images with rich texture and appearance cues, which are thus well-suited for object classification, 2D detection, and semantic segmentation. However, due to the lack of explicit depth information, it is difficult to obtain precise spatial location information from images. LiDAR directly measures sparse and irregular 3D point clouds, providing reliable geometric and range information and obtaining accurate characterization of the 3D scene structure [2]. Nevertheless, single-modality 3D detection methods still suffer from notable limitations. Image-based methods struggle to achieve high-precision 3D bounding box regression because depth and structural cues are missing. While LiDAR-based methods perform well for nearby and large objects (e.g., cars and trucks), they often miss distant or small objects (e.g., pedestrians and cyclists) due to extremely sparse points and may produce false positives when background structures exhibit geometries similar to target objects [3,4,5,6,7,8]. In addition, early approaches that project point clouds into view images or voxel grids inevitably lose fine-grained 3D details during quantization and projection, which constrains the upper bound of detection accuracy. Therefore, it is necessary to develop a fusion method to fully exploit the complementary information of camera images and LiDAR point clouds for robust object detection.
According to the fusion stage, existing fusion methods can be broadly categorized into data-level fusion, feature-level fusion, and decision-level fusion. Data-level and feature-level fusion align and interact multi-modal information at the perception front-end or intermediate network layers. In principle, they can well exploit cross-modal complementarity, but they typically rely on accurate spatial–temporal calibration, are sensitive to misalignment errors, and often introduce complex network designs with challenging training procedures [9]. Decision-level fusion associates and merges candidate boxes at the detection output level across modalities. It leverages mature single-modality detectors directly, resulting in simple system implementation and high engineering flexibility [10]. Consequently, a key advantage of decision-level fusion lies in its ability to fully exploit useful information from each modality, leading to more reliable final decisions.
Current camera–LiDAR decision-level fusion for 3D object detection still faces two key challenges. First, most decision fusion methods treat the class confidence scores produced by detectors as reliable probabilities, which lacks interpretability and makes it difficult to determine which modality should be trusted more under conflicts [11]. Second, the fusion of detection results itself often remains at a relatively shallow level. A typical practice is to perform target matching first, concatenate a few scalar features such as modality-specific scores and IoU, and then feed them into a lightweight scoring module or apply simple weighting to obtain a fused confidence score, while directly taking the matched 3D boxes as the final outputs. The core contribution of this work is UDF-3D, an uncertainty-driven decision-level fusion framework. The main innovations are summarized as follows.
(1) An uncertainty-driven object matching method is proposed for camera–LiDAR decision-level fusion. Unlike conventional decision-level matching strategies that rely only on bounding-box geometric overlap or classification confidence, this paper represents camera and LiDAR detection outputs as subjective opinions containing class belief masses and object uncertainty masses and uses the uncertainty to adaptively adjust the weights of geometric consistency and semantic consistency in the matching cost, thereby reducing invalid candidate correspondences and improving cross-modal matching accuracy.
(2) We propose a new camera–LiDAR decision-level fusion method, whose core consists of an improved evidence discounting fusion strategy based on Dempster–Shafer evidence theory and a frustum re-evaluation module. The improved evidence discounting fusion is used to alleviate cross-modal conflicts during the fusion process, thereby improving the accuracy of fusion-based detection. The frustum re-evaluation module targets the missed-detection problem by using unmatched 2D detections as semantic and geometric anchors to recover true objects that may have been suppressed by NMS from the pre-NMS LiDAR candidate set. This method combines the frustum constraint with the proposed evidence discounting fusion, forming a complete missed-object re-evaluation and conflict suppression process, thereby improving object recall and the reliability of fusion-based detection.
The remainder of this paper is organized as follows. Section 2 reviews related work on camera–LiDAR 3D object detection. Section 3 presents the proposed uncertainty-driven decision-level fusion framework, including object uncertainty quantification, uncertainty-weighted object matching, frustum re-evaluation, and uncertainty-discounted decision-level fusion. Section 4 reports experimental results and ablation studies on the public KITTI dataset to evaluate the effectiveness of the proposed method under the standard KITTI evaluation protocol. Section 5 concludes the paper and discusses future research directions.

2. Related Work

Camera–LiDAR fusion methods are commonly categorized into three paradigms according to the fusion stage: data-level, feature-level, and decision-level fusion.
Data-level fusion focuses on directly integrating multi-modal information at the raw-data stage to preserve observation fidelity as much as possible [12]. For example, PointPainting proposed by Vora et al. [13] projects LiDAR points onto the output of an image semantic segmentation network and assigns a class score to each point, thereby significantly increasing the information density of the original point cloud without altering the LiDAR backbone architecture. Similarly, Wen et al. [14] map RGB image features onto corresponding point clouds via a dedicated point-feature fusion module, achieving lightweight data-level fusion outside the main backbone network. Subsequent works such as PointAugmenting [15] and FusionPainting [16] further improve multi-modal detection performance by introducing cross-modal data augmentation or adaptive point painting strategies while keeping the LiDAR detector largely unchanged. Although these methods are advantageous in retaining information completeness, they are often highly sensitive to extrinsic calibration and temporal synchronization accuracy, adapt poorly to differences in sensor resolution and field of view (FoV), and may incur non-negligible overhead in computational complexity and system deployment.
Compared with data-level methods, feature-level fusion emphasizes deep interaction between multi-modal features at intermediate layers of neural networks. One representative early work is MV3D proposed by Chen et al. [17], which fuses multi-view features through dedicated subnetworks after multi-view data encoding, serving as a typical example of early feature-level fusion. EPNet++ [18] further introduces cascade bi-directional fusion between image and point-cloud features to strengthen multi-modal representation. DeepFusion [19] aligns and fuses camera features with deep LiDAR backbone features rather than performing “painting” only at the input point level, substantially improving the representation capacity and robustness of high-level multi-modal features. This indicates that cross-modal alignment and fusion in a more abstract feature space can yield better detection performance. Building on this line of work, BEVFusion [20] encodes LiDAR point clouds and multi-view images separately and projects them into a shared bird’s-eye-view (BEV) space, where multi-modal feature fusion is performed. This design preserves geometric structure while enhancing semantic density, achieving strong performance in multi-task and multi-sensor 3D perception settings. Chen et al. [21] further construct a dual-feature interaction module using Transformers, adopting a soft-fusion strategy for bidirectional deep interaction between LiDAR and camera features, and introduce an uncertainty-based 3D IoU metric to model the uncertainty induced by coupling among 3D attributes, thereby alleviating error propagation and improving accuracy and stability. Zhang et al. [22] propose the Adaptive Fusion Transformer (AFTR), which explicitly models spatial and temporal relationships among multi-sensor BEV features within the detection backbone. By leveraging adaptive spatial cross-attention and spatio-temporal self-attention, AFTR performs unified cross-modal and cross-time feature modeling and demonstrates stronger robustness in complex traffic scenarios. More recently, uncertainty-aware BEV fusion has also been explored. Xu et al. [23] propose UncertainBEV, an uncertainty-aware BEV fusion framework for roadside 3D object detection. By modeling the uncertainty of camera and LiDAR BEV features, UncertainBEV dynamically adjusts the fusion weights to alleviate cross-modal feature misalignment and improve BEV feature representation. While feature-level fusion methods can exploit the complementary strengths of different sensors through deep fusion, they are typically parameter-heavy and structurally complex, sensitive to data scale and training strategy, and may suffer from high training cost and overfitting risk.
Decision-level fusion combines independently generated outputs from modality-specific detectors to enhance overall robustness and flexibility. A typical pipeline runs LiDAR and camera detectors in parallel and then performs cross-modal matching and fusion at the candidate-box level. The CLOCs framework proposed by Pang et al. [24] jointly models candidate objects from cameras and LiDAR after any given 2D and 3D detectors and before non-maximum suppression (NMS). By learning geometric and semantic consistency, it produces fusion scores that significantly improve final 3D detection results without modifying the original detector architectures. Its subsequent version, Fast-CLOCs [25], further simplifies the candidate-level network to achieve near real-time, high-accuracy decision-level fusion. Shen et al. [26] propose a generalized optimization-based multi-modal fusion framework that formulates a late-fusion optimization problem using a bidirectional nearest-neighbor probabilistic model and 3D object tracking, improving accuracy and robustness without additional training. These studies suggest that decision-level fusion has inherent advantages in handling multi-modal heterogeneity, reusing existing detector architectures, and smoothly integrating new modalities, making it particularly suitable for incremental upgrades and extensions in engineering systems.
However, the performance ceiling of decision-level fusion is still largely constrained by the underlying single-modality detectors. A common issue is the overconfidence of single-modality outputs. For both LiDAR-based and image-based detectors, the final class confidence is often trained as a highly “polarized” point estimate: predictions deemed correct receive scores close to 1, whereas incorrect predictions receive very low scores. However, such scores do not effectively distinguish between two fundamentally different cases: well-evidenced certainty and miscalibrated overconfidence induced by model bias. In typical scenarios where cross-modal information conflicts (e.g., distant objects with extremely sparse point clouds, adverse illumination, or partial occlusion), two mutually contradictory yet highly confident detections can cause traditional fusion strategies to fail. Simple score weighting or purely geometry-based matching is often insufficient to produce reliable fusion results. This motivates decision-level fusion strategies that explicitly model uncertainty to improve the accuracy of subsequent fusion decisions.

3. Method

We propose UDF-3D, an uncertainty-driven decision-level fusion framework for camera–LiDAR 3D object detection, as illustrated in Figure 1. The framework involves three stages: (1) object uncertainty quantification, (2) uncertainty-weighted cross-modal object matching, and (3) frustum re-evaluation and uncertainty-discounted decision-level fusion.

3.1. Object Uncertainty Quantification

Inspired by evidential deep learning (EDL) [27] and its application to object-detection uncertainty quantification [28], we design an uncertainty quantification scheme that models the class probability vector with a Dirichlet distribution. The concentration parameters are constructed by mapping the pre-activation logits to non-negative evidence, and the belief mass and overall uncertainty mass are then derived under subjective logic [29]. For decision-level fusion, uncertainty provides additional reliability information beyond the original detection outputs. The fusion module operates at the detection-output level, where the inputs mainly include candidate boxes and class scores. These outputs indicate what each detector predicts, but they do not sufficiently describe whether the prediction itself is reliable. Therefore, we adopt a representation that can simultaneously describe class-level evidence support and overall detection uncertainty. The class belief masses provide a structured representation for measuring semantic consistency between camera and LiDAR detections, while the uncertainty mass serves as an explicit reliability indicator for adaptive matching and subsequent evidence fusion. Compared with probabilistic uncertainty estimation methods such as MC Dropout, Bayesian neural networks, or ensemble-based approaches, which usually require multiple forward passes, sampling, or model ensembles, the adopted representation can be obtained from a single forward pass of each detector with minimal additional computational cost. This makes it suitable for a lightweight decision-level fusion framework. Based on this representation, the detection outputs are converted into belief masses and uncertainty masses for subsequent matching and fusion. The detailed procedure is as follows.
Let K denote the number of categories, and let z = [ z 1 , z 2 , , z K ] R K be the network’s pre-activation outputs. The non-negative evidence is obtained through a softplus mapping
e = softplus ( z ) = [ e 1 , e 2 , , e K ] , e i 0 , i = 1 , , K .
Assume that the class-probability vector p = [ p 1 , , p K ] lies on the K-simplex, i.e.,  p i 0 and i = 1 K p i = 1 , and that p follows a Dirichlet distribution:
p Dir ( α ) , α = [ α 1 , , α K ] .
The Dirichlet density is given by
Dir ( p α ) = Γ i = 1 K α i i = 1 K Γ ( α i ) i = 1 K p i α i 1 .
Then, we model the class-probability vector with a Dirichlet distribution. Since a Dirichlet distribution is uniquely determined by its concentration parameters α = [ α 1 , α 2 , , α K ] , it suffices to construct α . In particular, we incorporate a base-rate prior a and a prior weight W to form
α i = e i + a i W , i = 1 , , K .
In practice, we set the base rate to a uniform distribution a i = 1 K to avoid introducing any prior preference among classes, and choose the prior weight as W = K so that the prior term a i W equals 1 for all classes. This simplifies the Dirichlet parameters to
α = e + 1 .
Let S denote the Dirichlet strength, S = j = 1 K α j . Following subjective logic, the belief mass assigned to class i is defined as
b i = e i S , i = 1 , , K .
Accordingly, under the Dirichlet distribution, the expected probability that the target belongs to class i and the overall uncertainty mass are given by [29]
E [ p i ] = α i S , u = K S .
Here, the belief mass b i represents the degree of evidence support assigned to class i, and its value is determined by the proportion of the evidence e i for this class in the total Dirichlet strength S. The uncertainty mass u represents the overall uncertainty that is not assigned to any specific class. When the total evidence is limited, S is smaller and u = K / S becomes larger, indicating that the detection result lacks sufficient evidential support. The expected probability E [ p i ] denotes the mean probability of class i under the Dirichlet distribution, which combines the belief mass of this class with the uncertainty distributed according to the base rate. It should be noted that u + i = 1 K b i = 1 , indicating the complete allocation of belief mass and uncertainty mass.

3.2. Uncertainty-Weighted Cross-Modal Object Matching

For decision-level fusion, given image and LiDAR data captured at the same instant, each modality predicts multiple distinct detection boxes. Consequently, it is necessary to establish a one-to-one correspondence between detection boxes across the two modalities. The detailed workflow of the proposed uncertainty-weighted cross-modal matching module is summarized in Algorithm 1. Considering the inherent correlation between target geometry and category across different modalities, we construct a similarity matrix based on geometric and semantic consistency and incorporate uncertainty information to assign weights to these two factors. The specific details are as follows.
We denote the set of camera detections and each camera detection as
B C = { B 1 C , B 2 C , , B M C } , B i C = { [ x i C , y i C , h i C , w i C ] , b i C , u i C } .
where M is the number of camera detections, ( x i C , y i C ) is the center of the 2D box, h i C and w i C are its height and width, b i C is the belief-mass vector, and  u i C is the uncertainty of the i-th camera detection. Similarly, the set of LiDAR detections and each LiDAR detection are denoted as
B L = { B 1 L , B 2 L , , B N L } , B j L = { [ x j L , y j L , z j L , l j L , w j L , h j L , θ j L ] , b j L , u j L } .
Here, N is the number of LiDAR detections, ( x j L , y j L , z j L ) is the 3D box center in the LiDAR coordinate system, l j L , w j L , and  h j L denote the 3D box dimensions, θ j L is the heading angle, b j L is the belief-mass vector, and  u j L is the uncertainty of the j-th LiDAR detection. For geometric matching, the LiDAR 3D box is projected onto the image plane, and  ( x ˜ j L , y ˜ j L ) , h ˜ j L , and  w ˜ j L denote the center, height, and width of its projected 2D box.
Algorithm 1 Uncertainty-weighted cross-modal matching
Input: Camera detections B C , LiDAR detections B L .
Output: Matching result.
   1:
function  UWeightedMatching ( B C , B L )
   2:
      Project each LiDAR box B j L onto the image plane
   3:
      for each pair ( B i C , B j L )  do
   4:
            Compute normalized center distance d i j norm
   5:
            if  d i j norm τ d  then
   6:
                  S i j w g IoU i j + w s L sem , i j
   7:
                  C i j 1 S i j
   8:
            else
   9:
                  C i j T
 10:
          end if
 11:
    end for
 12:
    Matching result ← HungarianAssignment ( C )
 13:
    return Matching result
 14:
end function
Normalized center-distance gating To reduce computational burden, we first introduce normalized center-distance gating to filter out candidate pairs with a low matching probability. The normalized center distance is given by
d i j norm = x i C x ˜ j L w ¯ i j 2 + y i C y ˜ j L h ¯ i j 2
Here, ( x i C , y i C ) and ( x ˜ j L , y ˜ j L ) are the center coordinates of the i-th camera detection box and the projected 2D box of the j-th LiDAR detection, respectively. Their average height and width are calculated as h ¯ i j = ( h i C + h ˜ j L ) / 2 and w ¯ i j = ( w i C + w ˜ j L ) / 2 , respectively. We use image detection boxes as the reference and form candidate pairs with neighboring LiDAR detection boxes. Based on the normalized center distance, the candidate pair set is defined as
G = ( i , j ) d i j norm τ d ,
where τ d is the normalized center-distance threshold. If the normalized center distance between two detection boxes is excessively large, the pair is deemed to lack matching potential and is excluded from the subsequent Hungarian matching algorithm. Only candidate pairs belonging to G are retained.
Geometric consistency In practice, the bounding boxes of the same object in the image and point cloud have consistent geometric boundaries after coordinate alignment. Therefore, we use the Intersection over Union (IoU) metric to measure geometric consistency, defined as
IoU i j = o j L o i C o j L o i C ,
where o j L and o i C denote the 2D boxes corresponding to the j-th LiDAR detection (after 3D-to-2D projection) and the i-th camera detection, respectively.
However, geometric overlap alone may be insufficient for reliable matching because projected boxes, especially small boxes at long range, can be sensitive to localization and calibration errors.
Semantic consistency To complement the geometric relationship, we incorporate semantic consistency into the matching process. In this paper, we construct a semantic similarity measure using belief masses to reflect whether the two sensors provide consistent opinions about the same object. The belief masses derived from the evidence vector represent the opinion associated with each detection box, and the Bhattacharyya coefficient is adopted to compute the similarity between the belief-mass vectors of the two modalities. Accordingly, the semantic similarity is defined as
L sem , i j = k = 1 K b j , k L b i , k C
where b j , k L and b i , k C denote the belief masses of the j-th LiDAR detection and the i-th camera detection for class k, respectively, where K is the number of classes.
Uncertainty-Weighted Matching (U-weighted Matching) The proposed U-weighted Matching assigns adaptive weights to geometric and semantic consistency according to the uncertainty of each modality. Specifically, we construct the similarity matrix as
S = S 11 S 12 S 1 N S 21 S 22 S 2 N S M 1 S M 2 S M N ,
where each entry is computed by a weighted combination of geometric and semantic consistency:
S i j = w g IoU i j + w s L sem , i j .
Here, w g and w s denote the adaptive weights for the geometric term IoU i j and the semantic term L sem , i j , respectively. They are defined as
w g = ( 1 u j L ) exp γ d d max 2 1 u i C + ( 1 u j L ) exp γ d d max 2 , w s = 1 w g .
where u i C and u j L are the uncertainties of the i-th image detection box and the j-th point-cloud detection box, respectively. d max denotes the maximum sensing distance of the LiDAR, d is the Euclidean distance from the center of the point-cloud detection box to the LiDAR origin, and γ is a decay factor that controls the attenuation rate.
The weight design in Equation (16) reflects the fact that geometric and semantic consistency have different levels of reliability under different conditions. For nearby LiDAR detections with low uncertainty, the projected boxes usually provide more reliable geometric relationships, and thus geometric consistency is assigned a larger weight. In contrast, for distant objects or detections with high uncertainty, the projected boxes are more likely to be affected by sparse point clouds, localization errors, and calibration errors. In such cases, relying only on geometric overlap may lead to incorrect matching, while the semantic consistency represented by class belief masses provides an additional class-consistency constraint. Therefore, the adaptive weights in Equation (16) allow the matching cost to dynamically adjust the contributions of geometric and semantic consistency according to detection reliability.
We then convert the similarity matrix into a cost matrix for Hungarian assignment. For the candidate pairs that pass the normalized center-distance gating, the matching cost is defined as 1 S i j . For the remaining invalid pairs, a large penalty is assigned:
C i j = 1 S i j , ( i , j ) G , T , ( i , j ) G ,
where G is the candidate pair set defined by normalized center-distance gating, T is a sufficiently large constant representing the maximum assignment cost, and S i j is given by (15). Thus, the cost matrix is
C = C 11 C 12 C 1 N C 21 C 22 C 2 N C M 1 C M 2 C M N .
Finally, the Hungarian algorithm is applied to C to obtain the one-to-one matching results. Assignments with the penalty cost T are regarded as invalid and are treated as unmatched detections.

3.3. Frustum Re-Evaluation for Unmatched Detections

After applying the uncertainty-weighted object matching described in Section 3.2, the detections are divided into matched pairs and unmatched detections. In practical applications, due to the distinct sensing characteristics of image and point-cloud detectors, the two modalities may produce inconsistent predictions. Therefore, enforcing one-to-one matching for all detection boxes is not always feasible. A subset of detections may remain unmatched, including 2D boxes without corresponding 3D counterparts and 3D boxes without corresponding 2D observations.
To handle these unmatched detections, we design a filtering criterion that jointly considers the maximum expected class probability and the corresponding uncertainty:
max i E [ p i ] τ p , u τ u ,
where τ p and τ u are the thresholds for the expected probability and the uncertainty, respectively. For unmatched LiDAR detections, this criterion is used to decide whether the 3D box should be retained in the final output. For unmatched camera detections, the retained 2D boxes are not directly output as 3D detections. Instead, they are used as anchors for the subsequent frustum re-evaluation module.
For unmatched camera detections, the main difficulty is that they provide reliable 2D image boxes but do not contain explicit 3D box parameters. Directly discarding these detections may increase false negatives. This problem is more serious for distant pedestrians and cyclists. In these cases, LiDAR points are usually sparse, and the corresponding 3D candidates may receive low confidence scores or be suppressed by NMS. To address this problem, we design a frustum re-evaluation module. This module uses the retained 2D detection box as a semantic and geometric anchor to recover potential 3D candidates from the LiDAR candidate set.
Specifically, for each retained unmatched 2D box, we use the camera intrinsic parameters and the camera–LiDAR calibration matrix to back-project its image-plane region into 3D space and construct the corresponding view frustum. As shown in Figure 2, for a given 2D detection box, the camera optical center, i.e., the origin of the camera coordinate system, is connected to the four corners of the box on the image plane. In this way, four boundary rays are formed in 3D space. LiDAR points whose projections fall inside the 2D detection box are located within the spatial region enclosed by these four boundary rays. Since the unique depth of the object cannot be determined from the 2D box alone, the depth range is determined by the valid spatial range of the LiDAR point cloud. Therefore, the four projection rays and the valid depth range jointly form a 3D frustum region. This region represents the possible spatial range of the object corresponding to the 2D box and provides a geometric constraint for searching or filtering 3D candidates.
After the frustum is constructed, it is used to constrain the pre-NMS LiDAR candidate boxes, as illustrated in Figure 3. Instead of searching all LiDAR detections after NMS, we use the pre-NMS LiDAR candidate set as the recovery source. The reason is that NMS may suppress true objects when their confidence scores are low or when they overlap with higher-scoring but less accurate boxes. This situation is common for small, distant, or partially occluded objects. By revisiting the pre-NMS candidates under the frustum constraint, the proposed module can recover valid 3D boxes that have been removed before the final LiDAR outputs are generated.
To avoid introducing too many false positives from the large pre-NMS candidate set, we apply strict frustum-based geometric filtering. A LiDAR candidate is retained only when its projected 3D box is spatially consistent with the frustum generated by the 2D anchor. In practice, this step removes candidates that are geometrically incompatible with the image detection and keeps only a small number of possible 3D hypotheses. For the remaining candidates, we recompute the cross-modal similarity using the U-weighted Matching strategy described in Section 3.2. If the matching cost satisfies the acceptance threshold, the candidate with the lowest matching cost is selected as the recovered 3D counterpart of the 2D detection. This avoids forcibly matching a 2D detection with an unreliable 3D candidate when no valid candidate exists in the frustum.
Following the complete workflow in Figure 4, after re-association, the recovered 2D–3D pair is not directly added to the final detection set. Instead, it is further evaluated by the uncertainty-discounted fusion strategy described in Section 3.4. The fused belief mass and uncertainty are then used to decide whether the recovered object should be retained. Therefore, the frustum re-evaluation module does not improve recall by simply adding more proposals. It first performs constrained candidate recovery and then applies uncertainty-aware evidence verification. This design helps recover true objects suppressed by NMS while limiting the introduction of false positives.

3.4. Uncertainty-Discounted Decision-Level Fusion

After uncertainty-weighted matching and frustum-based re-evaluation, the associated camera–LiDAR detection pairs are fused to obtain the final detection decisions. Given an associated pair ( B i C , B j L ) , the two detectors can be regarded as two independent opinions about the same object. We aim to combine these opinions via the Dempster–Shafer evidence theory [30] to obtain a more reliable and interpretable fused result. However, directly applying the classical Dempster–Shafer combination rule may yield abnormal fusion results under conflicts, such as an extreme bias toward a single source. Therefore, we adopt a discounting-based approach [31] to mitigate the impact of conflicts while preserving the desirable properties of the Dempster–Shafer rule.
To handle potential conflicts during fusion, we adopt an adaptive discounting scheme. Specifically, we use the Jensen–Shannon (JS) divergence to measure the conflict degree between two detections:
D JS p C p L = 1 2 D KL p C m + 1 2 D KL p L m ,
where p s = α 1 s S s , α 2 s S s , , α K s S s , s { C , L } represents the expected probability vector. m = 1 2 p C + p L , and the KL divergence is defined as
D KL p s m = k = 1 K p k s log p k s m k .
We normalize the JS divergence to obtain the conflict degree:
conf = D JS p C p L log 2 .
Then, we construct the correlation matrix:
R = 1 1 conf 1 u C 1 conf 1 1 u L 1 u C 1 u L 1 ,
where u C and u L are the uncertainties of the camera and LiDAR detections, respectively. The correlation matrix jointly considers the conflict between the two class opinions and the reliability of each modality.
We compute the eigenvector corresponding to the maximum eigenvalue:
R β = λ max β ,
where β = ( β C , β L , β 0 ) . The eigenvector corresponding to the maximum eigenvalue is used to derive the relative evidence weights from the correlation matrix. If one modality has low uncertainty and agrees well with the other modality, its corresponding eigenvector component becomes larger, and more of its evidence is retained. Conversely, if one modality has high uncertainty or strong conflict, its component becomes smaller, and its evidence is discounted before Dempster–Shafer fusion. Based on these relative weights, the discounting coefficients are constructed as
β ¯ s = β s max ( β ) , s { C , L } .
Thus, the discounted evidence is given by
e ^ i s = β ¯ s e i s .
After evidence discounting, the corresponding discounted belief masses and uncertainty are recomputed. Then, the discounted opinions are combined under the Dempster–Shafer rule to obtain the fused belief masses and uncertainty:
b ^ i = b ^ i C b ^ i L + b ^ i C u ^ L + b ^ i L u ^ C 1 κ , u ^ = u ^ L u ^ C 1 κ .
where κ = i = 1 K j = 1 j i K b ^ i C b ^ j L denotes the conflict mass between mutually exclusive class hypotheses. We take the maximum expected class probability after fusion as the final confidence score. In this way, the proposed uncertainty-discounted fusion strategy mitigates cross-modal object conflicts during fusion and improves the reliability of final detection decisions. Figure 5 illustrates a typical category-conflict case, where inconsistent camera and LiDAR class evidence is corrected after applying the discounting strategy before final Dempster–Shafer fusion.

4. Experiments

We conducted a comprehensive experimental evaluation of the proposed method on the KITTI [32] dataset. Following the standard KITTI evaluation protocol, we reported results on the three main categories, i.e., Car, Pedestrian, and Cyclist.

4.1. Dataset and Evaluation Metrics

KITTI dataset We performed experiments on the KITTI dataset, which is widely used in autonomous driving. KITTI consists of urban driving scenes and provides high-quality annotations for multiple perception tasks, including 3D object detection, optical flow estimation, and depth estimation. The data collection platform is equipped with a pair of high-resolution color cameras, a Velodyne 64-beam LiDAR, and a GPS/IMU system. The annotations are provided primarily from LiDAR point clouds, yielding accurate 3D bounding-box labels. For 3D object detection, KITTI contains 7481 labeled images. Following the standard split, 3712 images are used for training and 3769 images are used for validation. In addition, a test set whose ground-truth labels are withheld is reserved for online benchmark evaluation. The annotations cover multiple categories, including Car, Pedestrian, and Cyclist. For each category, the dataset further defines three difficulty levels—Easy, Moderate, and Hard—based on object size, occlusion, and truncation, enabling thorough evaluation under varying challenges.
Evaluation metrics We followed the official KITTI evaluation protocol for 3D object detection. The primary metric is Average Precision (AP). Specifically, for each object category, AP is computed under the three predefined difficulty levels (Easy, Moderate, and Hard). The evaluation is based on the 3D bounding-box Intersection over Union (IoU), with IoU thresholds set to 0.7 for Car and 0.5 for Pedestrian and Cyclist. We report AP computed with 40 recall positions (APR40) as the main result, and AP under the Moderate level is used as the key criterion for method ranking.

4.2. Detector Settings

To assess the generality and effectiveness of the proposed framework, we instantiated it on a set of widely adopted 3D detection baselines that represent different design paradigms. We adopted YOLOv8 [33] and RT-DETR [34] as 2D detectors due to their favorable accuracy–speed trade-offs and their broad use as high-performance detection models. We used SECOND [3], PointPillars [4], and PartA2 [5] as the LiDAR 3D detection baselines in the proposed fusion framework. All LiDAR baseline detectors, including SECOND, PointPillars, and PartA2, use models trained with the OpenPCDet framework and were evaluated under the same KITTI data split and evaluation protocol. When integrated with the proposed UDF-3D framework, their detector configurations were kept unchanged to ensure fair internal comparisons and ablation studies. These baselines are standard and representative open-source architectures in the community. Although they exhibit different performance levels on the KITTI leaderboard, they provide a solid testbed for evaluating the robustness and general applicability of our fusion framework. Experimental results show that the proposed method consistently improves the performance of these baseline 3D detectors.

4.3. Results and Analysis

This subsection provides a comprehensive evaluation of the proposed LiDAR–image decision-level fusion method on the KITTI validation set. We followed the official evaluation protocol and reported both 3D APR40 and BEV APR40 as the core metrics. Our analysis focuses on: (1) comparisons with LiDAR-only baselines to verify the effectiveness of the fusion strategy; (2) comparisons with existing multi-modal fusion methods, especially the classical decision-level fusion approach, CLOCs, to position the contribution and competitiveness of our method.
As shown in Table 1 and Table 2, the proposed fusion framework achieves clear improvements over the original baselines across different LiDAR detectors, including SECOND, PointPillars, and PartA2, as well as different 2D detector settings. This indicates that the performance gain of the proposed method does not depend on a specific detector combination. In these two tables, green values indicate the best result within each detector group, and blue values indicate the best result in each column across the whole table. The category-wise results show that the improvements are more evident for Pedestrian and Cyclist. These categories usually have sparse point-cloud observations, smaller object scales, and more frequent occlusions, and therefore, camera semantic cues can provide stronger complementary information. For the Car category, although the LiDAR baselines already achieve relatively high detection accuracy, the proposed method still brings stable improvements. The qualitative results in Figure 6 further show that the proposed fusion strategy can reduce typical false positives and missed detections produced by the LiDAR-only baseline.
To position the proposed method among existing camera–LiDAR fusion frameworks, we compared UDF-3D with representative fusion methods, including data-level fusion methods, feature-level fusion methods, and the classical decision-level fusion method CLOCs. The reported CLOCs results correspond to its commonly used SECOND + Cascade-RCNN detector setting, where the SECOND branch also adopts the official SECOND pretrained weights provided by OpenPCDet. Thus, CLOCs is included as a reference under the same decision-level fusion paradigm. The results are shown in Table 3 and Table 4. Overall, UDF-3D (PA+RD) achieves the highest Moderate mAP in both 3D and BEV detection, indicating that the proposed method has competitive overall performance.
It should be noted that some existing methods still show strong performance for the Car category. For example, in BEV detection, CLOCs obtains slightly higher AP than UDF-3D for Car under the Easy and Moderate settings. This suggests that existing decision-level fusion methods can still achieve good detection performance for large objects with relatively sufficient LiDAR observations. However, UDF-3D shows clearer advantages for the Pedestrian and Cyclist categories. These two categories usually have sparse point-cloud observations, smaller object scales, and stronger cross-modal complementarity, and therefore rely more on the effective cooperation between camera semantic information and LiDAR geometric information. These results indicate that the proposed method is more adaptable to challenging object categories.
From the perspective of overall Moderate mAP, UDF-3D outperforms CLOCs in both 3D and BEV detection, indicating that the proposed method does not merely achieve local improvements in a single category, but provides more stable overall detection performance. Different from CLOCs, UDF-3D does not require an additional trainable fusion network for candidate box pairs. Instead, it performs decision-level fusion through uncertainty modeling, candidate association, and uncertainty-discounted fusion. Therefore, the proposed framework can be conveniently incorporated into different combinations of 2D and 3D detectors, reducing the dependence on a specific fusion network structure and retraining process. Compared with other existing fusion methods, such as EPNet++ and CAT-Det, UDF-3D also achieves better overall Moderate mAP, further demonstrating that the proposed method improves overall detection performance while preserving the flexibility of decision-level fusion.
Although UDF-3D improves the overall detection performance, some challenging cases still deserve further discussion. As shown in Figure 7, distant objects with severe occlusion or truncation usually correspond to relatively sparse LiDAR observations. In such cases, the 2D detector may also fail to provide sufficient image cues, while the LiDAR branch provides weak geometric information, which can weaken the fusion effect of UDF-3D to some extent. Since the frustum re-evaluation module still relies on valid 2D observations and pre-NMS 3D candidates, its ability to further recover such objects may be limited when both cues are insufficient.
In summary, the proposed Dempster–Shafer evidence-theory-based fusion framework explicitly models and fuses the uncertainties of LiDAR and image detections, thereby significantly improving the overall performance of multi-modal 3D object detection. The experiments show that our method not only significantly surpasses the LiDAR-only baselines, but also achieves superior performance over the representative decision-level fusion method CLOCs under the same fusion paradigm, providing an effective solution for fusion perception in autonomous driving.

4.4. Ablation Study

To evaluate the contribution of each module to the proposed fusion framework, we conduct an ablation study on the KITTI validation set. PointPillars+YOLOv8 is used as the representative detector combination. Table 5 reports the class-averaged 3D APR40 and BEV APR40 over Car, Pedestrian, and Cyclist under the Easy, Moderate, and Hard difficulty levels. The first row denotes the PointPillars single-modal baseline without any fusion module. The remaining rows show the results obtained by adding different combinations of the proposed modules based on the LiDAR and RGB branches.
The ablation results indicate that U-weighted Matching effectively improves the detection performance. This module establishes cross-modal associations between camera boxes and LiDAR boxes. It uses the richer semantic information from the RGB branch to constrain LiDAR candidates. In this way, false positives without image semantic support can be reduced. Compared with the LiDAR-only baseline, the overall AP is improved after this module is introduced. This indicates that the RGB branch can provide useful semantic information for LiDAR detection. The frustum re-evaluation module is mainly used to recover objects missed by the LiDAR branch. Pedestrian and Cyclist objects, as well as distant objects, usually have sparse LiDAR points. Therefore, LiDAR detectors may assign low confidence scores to them, or they may be suppressed during NMS. This module constructs frustums from unmatched camera detections. It then searches for potential 3D boxes from the pre-NMS LiDAR candidate set. Therefore, it can help recover some missed objects. The uncertainty-discounted fusion module further improves the reliability of the fusion decision. Simple score fusion can be affected by high-confidence but unreliable predictions. In contrast, the proposed uncertainty-discounted fusion module is based on Dempster–Shafer evidence theory. It adjusts the evidence according to uncertainty and cross-modal conflict. When the two modalities are consistent and have low uncertainty, the belief of the correct class is strengthened. When one modality has high uncertainty or semantic conflict, its influence is suppressed. Therefore, this module can reduce errors caused by conflicting or weak evidence. It also makes the fusion results more stable and contributes more to the improvement of accuracy.
To further illustrate the positive effect of uncertainty on detection accuracy, two variants are evaluated within the proposed framework: one does not model object uncertainty, while the other adopts the proposed uncertainty-based fusion method. The detection performance of these two settings is compared using AP curves under different 3D IoU evaluation thresholds, as shown in Figure 8. It can be observed that incorporating uncertainty information leads to higher AP values across most IoU thresholds, indicating that the proposed uncertainty-based fusion method improves the detection accuracy of the final results. The x-axis represents the 3D IoU threshold used for evaluation, while the y-axis represents the corresponding 3D APR40. We compared the setting that accounts for uncertainty (U) with the setting that does not (N-U). These two variants are not entirely identical in their matching and fusion formulas, as uncertainty information primarily serves the Dempster–Shafer fusion rule; however, they follow the same candidate association process: candidate filtering, constructing similarity using geometric and semantic consistency, and one-to-one assignment using the Hungarian algorithm. The Average Precision (AP) curves indicate that, for most Intersection over Union (IoU) thresholds and difficulty levels, U outperforms N-U in the detection of Car, Pedestrian, and Cyclist. As the evaluation IoU threshold increases, U’s advantage over N-U becomes increasingly pronounced. This indicates that uncertainty-aware fusion not only enhances detection performance under lenient evaluation thresholds but also maintains superior performance under stricter localization requirements. For Car, as the LiDAR baseline already achieves relatively high detection accuracy, the gap between U and N-U is limited at low IoU thresholds. However, at higher IoU thresholds, inaccurate matches and localization errors are penalized more severely. In this scenario, U more effectively suppresses unreliable fusion results, thereby demonstrating a more significant performance advantage. For Pedestrian and Cyclist, U’s contribution is more consistent. These two categories typically feature smaller object sizes and sparser point cloud observations, and are more susceptible to occlusion, distance and pose variations. Consequently, single-modal detectors are more likely to produce uncertain or erroneous predictions. Cameras provide textural and semantic information, while LiDAR provides spatial and geometric information. Uncertainty-aware fusion amplifies reliable evidence when both modalities are trustworthy, while suppressing unreliable evidence when one modality is uncertain. Consequently, for Pedestrian and Cyclist, U demonstrates more significant improvements compared to N-U. This further verifies the effectiveness of explicitly modeling uncertainty in decision-level fusion.
Overall, the combination of the three modules significantly improves detection performance. U-weighted Matching mainly improves cross-modal association and reduces unreliable matches. The frustum re-evaluation module mainly recovers missed detections from the pre-NMS LiDAR candidate set. The uncertainty-discounted fusion module mainly suppresses unreliable and conflicting detection results during the final decision fusion stage. The AP curves further show that the uncertainty-driven fusion method can maintain better performance under different IoU thresholds. When all three modules are enabled, the proposed method achieves the best performance. This verifies the effectiveness of the proposed uncertainty-driven decision-level fusion framework.

4.5. Sensitivity Analysis of the Decay Factor

To evaluate the sensitivity of the proposed method to the decay factor γ in Equation (16), we conducted experiments using PointPillars+YOLOv8 on the KITTI validation set. As shown in Table 6, the detection performance first improves and then slightly decreases as γ increases. The best results are obtained when γ = 2.5 . A small γ provides insufficient distance attenuation, whereas an excessively large γ may over-reduce the contribution of geometric consistency for distant objects. The relatively small performance variation across the tested settings indicates that the proposed method is not highly sensitive to γ . Therefore, γ = 2.5 is adopted as the default setting in all experiments.

4.6. Computational Complexity

To evaluate the efficiency and deployment potential of the proposed framework, we further report the computational complexity of UDF-3D after the ablation study. The measurements of our method are conducted on the KITTI validation set using 200 samples under Ubuntu 20.04 with an NVIDIA RTX 3060 GPU, PyTorch 2.1, and CUDA 11.8. As shown in Table 7, the additional costs introduced by uncertainty quantification and uncertainty-discounted fusion are negligible, while the frustum re-evaluation module introduces only a limited overhead for recovering suppressed candidates. Since UDF-3D performs fusion at the decision level, the LiDAR and RGB detection branches can run in parallel. Therefore, the final latency is computed as the slower detection branch plus the subsequent uncertainty quantification, frustum re-evaluation, and uncertainty-discounted fusion time. For UDF-3D, the values are reported in the format of YOLOv8/RT-DETR; specifically, 35.76 ms corresponds to the YOLOv8 setting, whereas 75.56 ms corresponds to the RT-DETR setting.
The results show that UDF-3D maintains a lightweight and deployment-friendly fusion pipeline. This is mainly because it avoids dense feature-level interaction between image and point-cloud backbones and only performs matching and fusion on detection candidates. Although the RT-DETR-based setting increases the computation of the RGB branch, the overall latency remains moderate compared with the listed multi-modal fusion methods. The uncertainty quantification and uncertainty-discounted fusion modules together take less than 1 ms, indicating that the performance gain is mainly obtained from more reliable matching, conflict discounting, and frustum re-evaluation rather than from heavy computational modules.
The GPU memory results further show that the additional memory overhead of UDF-3D is limited under the current setting. Since the proposed framework does not maintain additional dense cross-modal feature maps, its extra memory mainly comes from candidate-level data structures, including the matching matrix, frustum candidates, and belief/uncertainty vectors. Therefore, the memory cost is more dependent on the number of retained candidates than on the spatial size of image, voxel, or BEV feature maps. For parallel execution, although the LiDAR and RGB branches can be executed independently, the decision-level fusion stage requires the outputs from both branches. As a result, the overall parallel efficiency is mainly bounded by the slower detection branch before the subsequent fusion process can be performed.

5. Conclusions

In this paper, we propose UDF-3D, an uncertainty-driven decision-level fusion framework for camera–LiDAR 3D object detection. The proposed framework represents detector outputs as subjective-logic opinions, where class-wise belief masses and object-level uncertainty are explicitly quantified from detector outputs. Based on the quantified uncertainty, an uncertainty-weighted matching strategy is developed to improve cross-modal object association by jointly considering geometric overlap and semantic consistency. For unmatched detections, a frustum re-evaluation mechanism is designed to recover valid 3D candidates that may be suppressed by NMS. For matched detections, an uncertainty-discounted Dempster–Shafer fusion rule is introduced to reduce the influence of conflicting evidence from different modalities on the fusion results. Ablation results further show that the three modules contribute to the framework from different aspects: uncertainty-weighted matching improves cross-modal association, frustum re-evaluation reduces missed detections, and uncertainty-discounted fusion reduces the influence of unreliable and conflicting detection results on the final decision. Experimental results on the KITTI validation set show that UDF-3D consistently improves different LiDAR-only baselines and achieves competitive performance compared with existing fusion methods. The improvements are especially evident for Pedestrian and Cyclist, indicating that the proposed framework can effectively exploit camera–LiDAR complementarity for challenging objects with sparse point-cloud observations. In addition, the computational complexity analysis shows that the proposed fusion modules introduce only limited additional overhead, which preserves the practical efficiency and real-time potential of the decision-level fusion framework. In future work, we will further validate the proposed framework on larger-scale and more complex datasets, such as nuScenes and Waymo, to more comprehensively evaluate its generalizability under different sensor configurations, data formats, and evaluation protocols.

Author Contributions

Conceptualization, C.H. and C.D.; methodology, C.D.; software, C.D.; validation, C.D. and C.H.; formal analysis, C.D.; investigation, C.D.; resources, C.H. and Y.L.; data curation, C.D.; writing—original draft preparation, C.D.; writing—review and editing, C.H. and Y.L.; visualization, C.D.; supervision, C.H. and Y.L.; project administration, C.H. and Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Natural Science Foundation of China (NSFC) under Grant No. 62403295.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The KITTI dataset used in this study is publicly available. The generated experimental results and implementation details are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Wang, L.; Zhang, X.; Song, Z.; Bi, J.; Zhang, G.; Wei, H.; Tang, L.; Yang, L.; Li, J.; Jia, C.; et al. Multi-modal 3D object detection in autonomous driving: A survey and taxonomy. IEEE Trans. Intell. Veh. 2023, 8, 3781–3798. [Google Scholar] [CrossRef]
  2. Qian, R.; Lai, X.; Li, X. 3D object detection for autonomous driving: A survey. arXiv 2021, arXiv:2106.10823. [Google Scholar] [CrossRef]
  3. Yan, Y.; Mao, Y.; Li, B. SECOND: Sparsely embedded convolutional detection. Sensors 2018, 18, 3337. [Google Scholar] [CrossRef] [PubMed]
  4. Lang, A.H.; Vora, S.; Caesar, H.; Zhou, L.; Yang, J.; Beijbom, O. PointPillars: Fast encoders for object detection from point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 12689–12697. [Google Scholar]
  5. Shi, S.; Wang, Z.; Shi, J.; Wang, X.; Li, H. From points to parts: 3D object detection from point cloud with part-aware and part-aggregation network. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 2647–2664. [Google Scholar] [CrossRef] [PubMed]
  6. Shi, S.; Wang, X.; Li, H. PointRCNN: 3D object proposal generation and detection from point cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 770–779. [Google Scholar]
  7. Shi, S.; Guo, C.; Jiang, L.; Wang, Z.; Shi, J.; Wang, X.; Li, H. PV-RCNN: Point-voxel feature set abstraction for 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 10526–10535. [Google Scholar]
  8. Yin, T.; Zhou, X.; Krähenbühl, P. Center-based 3D object detection and tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2021; pp. 11779–11788. [Google Scholar]
  9. Wu, Y.; Liu, J.; Gong, M.; Miao, Q.; Ma, W.; Xu, C. Joint semantic segmentation using representations of LiDAR point clouds and camera images. Inf. Fusion 2024, 108, 102370. [Google Scholar] [CrossRef]
  10. Yu, K.; Tao, T.; Xie, H.; Lin, Z.; Liang, T.; Wang, B.; Chen, P.; Hao, D.; Wang, Y.; Liang, X. Benchmarking the robustness of LiDAR-camera fusion for 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Vancouver, BC, Canada, 17–24 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 3188–3198. [Google Scholar]
  11. Smets, P. The combination of evidence in the transferable belief model. IEEE Trans. Pattern Anal. Mach. Intell. 1990, 12, 447–458. [Google Scholar] [CrossRef]
  12. Xu, D.; Anguelov, D.; Jain, A. PointFusion: Deep sensor fusion for 3D bounding box estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2018; pp. 244–253. [Google Scholar]
  13. Vora, S.; Lang, A.H.; Helou, B.; Beijbom, O. PointPainting: Sequential fusion for 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–18 June 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 4603–4611. [Google Scholar]
  14. Wen, L.; Jo, K.-H. Fast and accurate 3D object detection for LiDAR-camera-based autonomous vehicles using one shared voxel-based backbone. IEEE Access 2021, 9, 22080–22089. [Google Scholar] [CrossRef]
  15. Wang, C.; Ma, C.; Zhu, M.; Yang, X. PointAugmenting: Cross-modal augmentation for 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 11789–11798. [Google Scholar]
  16. Xu, S.; Zhou, D.; Fang, J.; Yin, J.; Bin, Z.; Zhang, L. FusionPainting: Multimodal fusion with adaptive attention for 3D object detection. In Proceedings of the IEEE Intelligent Transportation Systems Conference (ITSC), Indianapolis, IN, USA, 19–22 September 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 3047–3054. [Google Scholar]
  17. Chen, X.; Ma, H.; Wan, J.; Li, B.; Xia, T. Multi-view 3D object detection network for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2017; pp. 1907–1915. [Google Scholar]
  18. Liu, Z.; Huang, T.; Li, B.; Chen, X.; Wang, X.; Bai, X. EPNet++: Cascade bi-directional fusion for multi-modal 3D object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 8324–8341. [Google Scholar] [CrossRef] [PubMed]
  19. Li, Y.; Yu, A.W.; Meng, T.; Caine, B.; Ngiam, J.; Peng, D.; Shen, J.; Lu, Y.; Zhou, D.; Le, Q.V.; et al. DeepFusion: LiDAR-camera deep fusion for multi-modal 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 17161–17170. [Google Scholar]
  20. Liu, Z.; Tang, H.; Amini, A.; Yang, X.; Mao, H.; Rus, D.L.; Han, S. BEVFusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 2774–2781. [Google Scholar]
  21. Chen, M.; Liu, P.; Zhao, H. LiDAR-camera fusion: Dual transformer enhancement for 3D object detection. Eng. Appl. Artif. Intell. 2023, 120, 105815. [Google Scholar] [CrossRef]
  22. Zhang, Y.; Liu, K.; Bao, H.; Qian, X.; Wang, Z.; Ye, S.; Wang, W. AFTR: A robustness multi-sensor fusion model for 3D object detection based on adaptive fusion transformer. Sensors 2023, 23, 8400. [Google Scholar] [CrossRef] [PubMed]
  23. Xu, J.; Song, C.; Shi, C.; Liu, H.; Wang, Q. UncertainBEV: Uncertainty-aware BEV fusion for roadside 3D object detection. Image Vis. Comput. 2025, 159, 105567. [Google Scholar] [CrossRef]
  24. Pang, S.; Morris, D.; Radha, H. CLOCs: Camera-LiDAR object candidates fusion for 3D object detection. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 24 October 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 10386–10393. [Google Scholar]
  25. Pang, S.; Morris, D.; Radha, H. Fast-CLOCs: Fast camera-LiDAR object candidates fusion for 3D object detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 3747–3756. [Google Scholar]
  26. Shen, B.; Dai, S.; Chen, Y.; Xiong, R.; Wang, Y.; Jiao, Y. GOOD: General optimization-based fusion for 3D object detection via LiDAR-camera object candidates. arXiv 2023, arXiv:2303.09800. [Google Scholar]
  27. Sensoy, M.; Kaplan, L.; Kandemir, M. Evidential deep learning to quantify classification uncertainty. In Advances in Neural Information Processing Systems (NeurIPS); Curran Associates, Inc.: Red Hook, NY, USA, 2018; Volume 31. [Google Scholar]
  28. Nallapareddy, M.R.; Sirohi, K.; Drews, P.L.J.; Burgard, W.; Cheng, C.-H.; Valada, A. EvCenterNet: Uncertainty estimation for object detection using evidential learning. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Detroit, MI, USA, 1–5 October 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 5699–5706. [Google Scholar]
  29. Jøsang, A. Subjective Logic: A Formalism for Reasoning Under Uncertainty; Springer: Cham, Switzerland, 2016. [Google Scholar]
  30. Dempster, A.P. Upper and lower probabilities induced by a multivalued mapping. Ann. Math. Stat. 1967, 38, 325–339. [Google Scholar] [CrossRef]
  31. Mercier, D.; Lefèvre, E.; Delmotte, F. Belief functions contextual discounting and canonical decompositions. Int. J. Approx. Reason. 2012, 53, 146–158. [Google Scholar] [CrossRef][Green Version]
  32. Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? The KITTI vision benchmark suite. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI, USA, 16–21 June 2012; IEEE: Piscataway, NJ, USA, 2012; pp. 3354–3361. [Google Scholar]
  33. Varghese, R.; Sambath, M. YOLOv8: A novel object detection algorithm with enhanced performance and robustness. In Proceedings of the International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS), Chennai, India, 17–18 April 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1–6. [Google Scholar]
  34. Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs beat YOLOs on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 16965–16974. [Google Scholar]
  35. Qi, C.R.; Liu, W.; Wu, C.; Su, H.; Guibas, L.J. Frustum PointNets for 3D object detection from RGB-D data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 918–927. [Google Scholar]
  36. Paigwar, A.; Sierra-Gonzalez, D.; Erkent, Ö.; Laugier, C. Frustum-PointPillars: A multi-stage approach for 3D object detection using RGB camera and LiDAR. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Montreal, BC, Canada, 11–17 October 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 2926–2933. [Google Scholar]
  37. Yoo, J.H.; Kim, Y.; Kim, J.; Choi, J.W. 3D-CVF: Generating joint camera and LiDAR features using cross-view spatial feature fusion for 3D object detection. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 720–736. [Google Scholar]
  38. Zhang, Y.; Chen, J.; Huang, D. CAT-Det: Contrastively augmented transformer for multi-modal 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 908–917. [Google Scholar]
Figure 1. Overview of UDF-3D, the proposed uncertainty-driven decision-level fusion framework for camera–LiDAR 3D object detection.
Figure 1. Overview of UDF-3D, the proposed uncertainty-driven decision-level fusion framework for camera–LiDAR 3D object detection.
Applsci 16 05983 g001
Figure 2. Construction principle of the view frustum from a 2D detection box.
Figure 2. Construction principle of the view frustum from a 2D detection box.
Applsci 16 05983 g002
Figure 3. Frustum visualization of pre-NMS candidate boxes within the frustum.
Figure 3. Frustum visualization of pre-NMS candidate boxes within the frustum.
Applsci 16 05983 g003
Figure 4. Complete workflow of the proposed frustum re-evaluation module for recalling and re-associating suppressed 3D candidates.
Figure 4. Complete workflow of the proposed frustum re-evaluation module for recalling and re-associating suppressed 3D candidates.
Applsci 16 05983 g004
Figure 5. Illustration of uncertainty-discounted fusion for correcting category conflicts between camera and LiDAR detections.
Figure 5. Illustration of uncertainty-discounted fusion for correcting category conflicts between camera and LiDAR detections.
Applsci 16 05983 g005
Figure 6. Comparison visualization between the baseline LiDAR detector (SECOND) and the proposed decision-level fusion method on the validation set. Red boxes indicate false positives, blue boxes indicate missed detections, and both types of errors are corrected by the proposed fusion method.
Figure 6. Comparison visualization between the baseline LiDAR detector (SECOND) and the proposed decision-level fusion method on the validation set. Red boxes indicate false positives, blue boxes indicate missed detections, and both types of errors are corrected by the proposed fusion method.
Applsci 16 05983 g006
Figure 7. Representative challenging case of the proposed UDF-3D framework. The left side shows the detection result of the LiDAR baseline, and the right side shows the detection result of UDF-3D.
Figure 7. Representative challenging case of the proposed UDF-3D framework. The left side shows the detection result of the LiDAR baseline, and the right side shows the detection result of UDF-3D.
Applsci 16 05983 g007
Figure 8. KITTI validation 3D APR40 curves of PointPillars+YOLOv8 under different 3D IoU evaluation thresholds, comparing uncertainty-aware fusion (U) with non-uncertainty fusion (N-U).
Figure 8. KITTI validation 3D APR40 curves of PointPillars+YOLOv8 under different 3D IoU evaluation thresholds, comparing uncertainty-aware fusion (U) with non-uncertainty fusion (N-U).
Applsci 16 05983 g008
Table 1. Comparison of 3D object detection between baseline and fusion methods on the KITTI validation set (3D APR40).
Table 1. Comparison of 3D object detection between baseline and fusion methods on the KITTI validation set (3D APR40).
MethodModalitymAPCarPedestrianCyclist
Mod. Easy Mod. Hard Easy Mod. Hard Easy Mod. Hard
SECOND (Baseline) [3]L66.5090.5581.6178.6055.9451.1546.1780.9666.7462.78
SECOND + YOLOv8L & I71.8792.1583.3280.4166.4260.1955.6587.1472.1165.49
SECOND + RT-DETRL & I72.2991.5983.0680.1765.2061.1655.1882.6672.6468.33
PointPillars (Baseline) [4]L63.9387.7677.4075.1957.3151.4546.8781.5762.9458.96
PointPillars + YOLOv8L & I71.1188.8380.0177.1167.3661.2056.7386.1572.1167.63
PointPillars + RT-DETRL & I71.5388.0079.6476.7066.2062.2858.0686.2172.6868.24
PartA2 (Baseline) [5]L70.8992.4582.8880.6466.8459.6954.5890.3570.1066.97
PartA2 + YOLOv8L & I75.7593.1783.8182.6377.0770.3163.9490.9073.1470.14
PartA2 + RT-DETRL & I76.6992.0683.0782.6276.3470.4565.6991.9276.5572.32
Note: Green values indicate the highest relative to the baseline, and blue values indicate the overall best result.
Table 2. Comparison of BEV object detection between baseline and fusion methods on the KITTI validation set (BEV APR40).
Table 2. Comparison of BEV object detection between baseline and fusion methods on the KITTI validation set (BEV APR40).
MethodModalitymAPCarPedestrianCyclist
Mod. Easy Mod. Hard Easy Mod. Hard Easy Mod. Hard
SECOND (Baseline)L71.4292.4288.5587.6560.7456.5752.1386.0469.1566.90
SECOND + YOLOv8L & I78.1993.4190.0489.6872.0165.9161.4989.7878.6174.04
SECOND + RT-DETRL & I78.9393.2290.1689.8871.3467.4463.2888.1979.1874.66
PointPillars (Baseline)L70.1492.0488.0686.6761.5856.0152.0185.2766.3462.35
PointPillars + YOLOv8L & I78.2393.3490.0587.4473.6967.4463.0291.4677.2172.52
PointPillars + RT-DETRL & I78.9393.0189.9987.4072.9168.9464.6192.0377.8775.57
PartA2 (Baseline)L76.0593.5589.3887.1370.5064.1059.1791.9374.6670.61
PartA2 + YOLOv8L & I81.3994.7691.1289.1481.3974.6069.6092.4578.4574.00
PartA2 + RT-DETRL & I82.8794.8891.5589.7480.6476.0571.3494.6281.0276.74
Note: Green values indicate the highest relative to the baseline, and blue values indicate the overall best result.
Table 3. Comparison of 3D object detection between existing fusion methods and the proposed fusion methods on the KITTI validation set (3D APR40). The abbreviations correspond to SECOND (SD), YOLOv8 (YL), PartA2 (PA), and RT-DETR (RD).
Table 3. Comparison of 3D object detection between existing fusion methods and the proposed fusion methods on the KITTI validation set (3D APR40). The abbreviations correspond to SECOND (SD), YOLOv8 (YL), PartA2 (PA), and RT-DETR (RD).
MethodModalitymAPCarPedestrianCyclist
Mod. Easy Mod. Hard Easy Mod. Hard Easy Mod. Hard
F-PointNet [35]L & I62.9183.7670.9263.6570.0061.3253.5977.1556.4953.37
F-PointPillars [36]L & I71.3288.9079.2878.0766.1161.8956.9187.5472.7866.07
PointFusion [12]L & I40.1577.9263.0053.2733.3628.0423.3849.3429.4226.98
PointPainting [13]L & I70.3488.3877.7476.7669.3861.6754.5885.2171.6266.98
EPNet++ [18]L & I70.8092.5183.1782.2773.7765.4259.1386.2363.8260.02
3D-CVF [37]L & I-89.6779.8878.47------
CAT-Det [38]L & I73.5490.1281.4679.1574.0866.3558.9287.6472.8268.20
CLOCs [24]L & I69.4592.8983.0778.6564.6857.3751.1987.5767.9263.67
UDF-3D (SD+YL)L & I71.8792.1583.3280.4166.4260.1955.6587.1472.1165.49
UDF-3D (PA+YL)L & I75.7593.1783.8182.6377.0770.3163.9490.9073.1470.14
UDF-3D (PA+RD)L & I76.6992.0683.0782.6276.3470.4565.6991.9276.5572.32
Note: Blue values indicate the overall best result.
Table 4. Comparison of BEV object detection between existing fusion methods and the proposed fusion methods on the KITTI validation set (BEV APR40). The abbreviations correspond to SECOND (SD), YOLOv8 (YL), PartA2 (PA), and RT-DETR (RD).
Table 4. Comparison of BEV object detection between existing fusion methods and the proposed fusion methods on the KITTI validation set (BEV APR40). The abbreviations correspond to SECOND (SD), YOLOv8 (YL), PartA2 (PA), and RT-DETR (RD).
MethodModalitymAPCarPedestrianCyclist
Mod. Easy Mod. Hard Easy Mod. Hard Easy Mod. Hard
F-PointNet [35]L & I70.1588.1684.0276.4472.3866.3959.5781.8260.0356.32
F-PointPillars [36]L & I78.0490.2089.4388.7772.1767.8963.4688.5876.7974.80
PointFusion [12]L & I47.0887.4576.1365.3237.9132.3527.3554.0232.7730.19
PointPainting [13]L & I75.8090.1987.6486.7172.6566.0661.2486.3373.6970.17
EPNet++ [18]L & I75.3695.9889.0888.8678.2372.0966.1786.2564.9161.30
3D-CVF [37]L & I-93.5289.5682.45------
CAT-Det [38]L & I70.4592.5990.0785.8257.1348.7845.5685.3572.5165.55
CLOCs [24]L & I75.6596.1892.5389.4369.2663.1356.7991.3571.3067.65
UDF-3D (SD+YL)L & I78.1993.4190.0489.6872.0165.9161.4989.7878.6174.04
UDF-3D (PA+YL)L & I81.3994.7691.1289.1481.3974.6069.6092.4578.4574.00
UDF-3D (PA+RD)L & I82.8794.8891.5589.7480.6476.0571.3494.6281.0276.74
Note: Blue values indicate the overall best result.
Table 5. Ablation study of the proposed fusion modules using PointPillars+YOLOv8 on the KITTI validation set.
Table 5. Ablation study of the proposed fusion modules using PointPillars+YOLOv8 on the KITTI validation set.
U-Weighted
Matching
Discounted
Fusion
Frustum
Re-Evaluation
3D APR40BEV APR40
Easy Mod. Hard Easy Mod. Hard
75.5563.9360.3479.6370.1467.01
76.9966.1862.5082.0173.6570.18
78.1268.5465.1883.6775.9573.03
80.4270.7266.7485.8677.8674.02
78.5569.2065.3684.2876.5473.68
80.7871.1167.1686.1678.2374.33
Table 6. Sensitivity analysis of the decay factor γ using PointPillars+YOLOv8 on the KITTI validation set. The reported values are class-averaged 3D APR40 over Car, Pedestrian, and Cyclist.
Table 6. Sensitivity analysis of the decay factor γ using PointPillars+YOLOv8 on the KITTI validation set. The reported values are class-averaged 3D APR40 over Car, Pedestrian, and Cyclist.
γ 3D Easy3D Mod.3D Hard
1.080.4270.7366.84
1.580.6170.9667.02
2.580.7871.1167.16
3.580.6671.0367.08
4.080.5170.8866.95
Table 7. Computational complexity comparison of UDF-3D and representative fusion methods on the KITTI validation set.
Table 7. Computational complexity comparison of UDF-3D and representative fusion methods on the KITTI validation set.
GroupMethod/ModuleSpeed (ms)GFLOPSGPU Memory (MB)FPS
UDF-3D modulesLiDAR branch (PointPillars)24.4462.82248.07
RGB branch (YOLOv8/RT-DETR)19.30/64.2479.07/251.75150.90/203.40
Uncertainty Quantification0.590.000.00
Frustum Re-evaluation10.570.0029.42
Uncertainty-Discounted Fusion0.160.000.00
End-to-endUDF-3D35.76/75.56141.89/314.57428.39/480.8927.96/13.23
EPNet++169.79357.855059.645.89
PointPainting271.771782.02742.063.68
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Hu, C.; Di, C.; Liu, Y. UDF-3D: Uncertainty-Driven Decision-Level Fusion for Camera–LiDAR 3D Object Detection. Appl. Sci. 2026, 16, 5983. https://doi.org/10.3390/app16125983

AMA Style

Hu C, Di C, Liu Y. UDF-3D: Uncertainty-Driven Decision-Level Fusion for Camera–LiDAR 3D Object Detection. Applied Sciences. 2026; 16(12):5983. https://doi.org/10.3390/app16125983

Chicago/Turabian Style

Hu, Chongyang, Chuangye Di, and Yanwei Liu. 2026. "UDF-3D: Uncertainty-Driven Decision-Level Fusion for Camera–LiDAR 3D Object Detection" Applied Sciences 16, no. 12: 5983. https://doi.org/10.3390/app16125983

APA Style

Hu, C., Di, C., & Liu, Y. (2026). UDF-3D: Uncertainty-Driven Decision-Level Fusion for Camera–LiDAR 3D Object Detection. Applied Sciences, 16(12), 5983. https://doi.org/10.3390/app16125983

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop