Boosting 3D Object Detection with Adversarial Adaptive Data Augmentation Strategy

Li, Shihao; Li, Jingsong; Fu, Jianghua; Chen, Qiuyue

doi:10.3390/s25113493

Open AccessArticle

Boosting 3D Object Detection with Adversarial Adaptive Data Augmentation Strategy

Key Laboratory of Advanced Manufacturing Technology for Automotive Parts of Ministry of Education, School of Automotive Engineering, Chongqing University of Technology, Chongqing 401320, China

^*

Author to whom correspondence should be addressed.

Sensors 2025, 25(11), 3493; https://doi.org/10.3390/s25113493

Submission received: 16 March 2025 / Revised: 18 May 2025 / Accepted: 19 May 2025 / Published: 31 May 2025

(This article belongs to the Topic Applied Computer Vision and Pattern Recognition: 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

In real-world applications, autonomous driving systems need to handle a variety of complex scenarios, such as object occlusion and lighting changes. In these scenarios, accurately identifying various objects is crucial for perceiving the surrounding environment and making reliable decisions. In this context, the fusion of Lidar and cameras is vital for the accuracy of object detection. To this end, we propose an adversarial adaptive data augmentation strategy that introduces virtual adversarial perturbations during the image feature extraction process, effectively enhancing the robustness of 3D object detection methods and enabling them to maintain stable performance when facing environmental changes and data perturbations. Experimental results on the nuScenes-mini and KITTI datasets show that, compared with previous 3D object detection methods, our method not only improves detection accuracy but also demonstrates stronger stability.

Keywords:

robustness; 3D object detection; virtual adversarial training; adaptive augmentation

1. Introduction

Reliable 3D object detection plays a crucial role in autonomous driving, promoting the safety and dependability of the self-driving system. Existing 3D object detection methods can be categorized into two groups: unimodal approaches and multimodal approaches. Unimodal approaches rely solely on single type of sensor for object detection, which is easy to implement but suffers inherent limitations. For example, while cameras excel in capturing rich visual details, they are susceptible to variations in ambient light and weather conditions. In contrast, Lidar provides precise three-dimensional information about objects with higher robustness to illumination variations. However, it lacks the ability to discern detailed information. Different from unimodal approaches, multimodal approaches can fully integrate both sensors to exploit their complementary information for higher accuracy and robustness. Consequently, these methods outperform unimodal ones and are widely studied in the field of 3D object detection.

In the field of multimodal 3D object detection, the key to success lies in the fusion of multimodal data. In recent years, the increasing diversity of sensor data has made the integration of complementary information from different modalities more important. Methods based on the Bird’s Eye View (BEV) [1] perspective, such as BEVDet [2] and BEVFusion [3], can effectively address issues of scale and occlusion caused by different viewpoints, enabling more efficient information aggregation and thereby enhancing the accuracy and reliability of 3D object detection. Therefore, we adopt the BEV method to transform data from different sensors into a unified top–down perspective. Despite the significant progress made by BEV methods, there are still performance bottlenecks when dealing with issues such as object occlusion.

Motivated by the great success of adversarial training, in this paper, we proposed an adaptive data augmentation method to improve the robustness of BEV-based methods. Traditional static perturbations (such as CutMix [4] and Mixup [5]) usually increase sample diversity by mixing image regions and labels, but they rely on simple linear interpolation or region mixing, which limits their improvement concerning model robustness. In contrast, our method applies small perturbations to the input images in the convergent direction. These perturbations dynamically adjust with the training process, changing their intensity and direction in real time to adapt to the model’s changes. This adaptive perturbation can continuously identify the model’s vulnerabilities and conduct targeted training, effectively preventing the model from converging prematurely. In this way, the model can learn more robust feature representations when facing different environments and data distributions. This dynamic perturbation enhances the model’s consistency in feature learning, enabling it to maintain stable performance when confronted with complex changes in real-world application scenarios. Experimental results show that our method achieves 1.5% and 0.6% improvement in terms of mean Average Precision (mAP) and nuScenes Detection Score (NDS) against the baseline on the nuScenes-mini dataset, with a 0.8% mAP improvement on the Kitti dataset, which validates the effectiveness of our method.

Our main contributions are as follows:

We propose an adversarial adaptive data augmentation strategy (AADA) to improve the robustness of 3D object detection. To the best of our knowledge, we present the first attempt to employ virtual adversarial training for adaptive augmentation.
Extensive experiments show that our method significantly improves the performance of 3D object detection on the nuScenes-mini and Kitti datasets, demonstrating the effectiveness of our method.

2. Related Work

In this section, we first briefly review Lidar-based and camera-based 3D perception methods. Then, we present recent advances in multi-sensor fusion approaches.

2.1. Lidar-Based 3D Perception

Due to the disordered and uneven characteristics of point clouds, existing Lidar-based 3D object detection methods commonly project point clouds into Bird’s Eye View (BEV) or Range View (RV) to utilize well-studied 2D convolutional neural networks (CNNs). In early studies, Yang et al. [6] introduced an efficient single-stage detector without anchor boxes. Graham et al. [7] proposed flat point cloud features and detected them in BEV space. Liang et al. [8] proposed employing a 2D CNN to learn spatial features from the RV, and then obtained a 3D bounding box through a Region-CNN (R-CNN). VoxelNext [9] leveraged the sparsity of point clouds to directly extract and predict 3D detection boxes based on sparse features, without converting sparse features into dense feature maps. It relies entirely on 3D CNNs, eliminating the sparse-to-dense conversion step and the NMS post-processing step. Ref. [10] proposed an innovative method that enhances the accuracy and efficiency of 3D object detection by moving point cloud features between clusters. This method performed well in complex scenarios such as occlusion and sparse point clouds, significantly enhancing the model’s discriminative ability while reducing computational costs, demonstrating its potential for application in autonomous driving. SAFDNet [11] designed an adaptive feature diffusion strategy that effectively addresses the common issue of center feature loss in sparse feature detectors and reduces the computational cost for long-range detection, improving the computational efficiency of high-performance 3D object detectors in long-range detection. Despite great progress in Lidar-based 3D object detection, existing methods still suffer from limited accuracy for distant objects due to the loss of details.

2.2. Camera-Based 3D Perception

As 2D object detection methods are being widely investigated and continue to improve, image-based 3D object detection methods have been developed to achieve satisfactory performance at a relatively low cost. Chu et al. [12] first performed monocular depth estimation and lifted 2D pixels to pseudo-3D points. Then, they designed a novel neighbor-voting method that incorporates neighbor predictions to improve object detection from severely deformed pseudo-Lidar point clouds. Chen et al. [13] focused on generating 3D proposals by encoding object size prior, ground-plane prior, and depth information into an energy function. Li et al. [14] introduced additional branches after the stereo Region Proposal Network (RPN) to predict sparse keypoints, viewpoints, and object dimensions, which are then used to obtain coarse 3D object bounding boxes. Guo et al. [15] leverage high-level geometric representations from LiDAR point clouds to guide the stereo image detection and introduce an auxiliary 2D detection head to provide direct 2D semantic supervision. MonoCD [16] boosted detection accuracy and robustness by leveraging the complementarity of global depth cues and geometric relationships, without enhancing the precision of individual detection branches. OPEN [17], via Object-wise Position Embedding, effectively incorporated object depth information, thereby improving the accuracy of 3D detection. Although camera-based 3D object detection methods exhibit superior results under good weather conditions, they suffer from severe accuracy drops under significant weather condition variations.

2.3. Multi-Sensor Fusion

To address the limitations of the aforementioned methods using only a single type of data, multi-sensor fusion has been studied to combine the best of both worlds. Early research [18] focused on sensor feature fusion. For example, a multi-camera-based joint 3D detection and segmentation method was developed using a unified BEV representation to fuse multi-camera data [19,20]. Then, subsequent research continued to improve multi-sensor fusion detection from different perspectives [21,22]. Specifically, a BEV-based multimodal fusion 3D detection method [23] was proposed to aggregate Lidar, camera, and millimeter-wave radar data with BEV for 3D object detection. Overall, it has been widely demonstrated that combining BEV with multi-sensor fusion 3D detection can improve detection accuracy and robustness.

3. Methodology

In this section, we propose an adversarial adaptive data augmentation strategy to enhance the robustness of multimodal 3D object detection methods. As shown in Figure 1, our network takes multiview images and Lidar point clouds as inputs. In the image feature extraction branch, the generalization ability of the model is enhanced by applying adaptive pixel-level perturbations to the images. Following this, specific encoders are employed to extract features from both images and point clouds. Then, these multimodal features are merged into a unified Bird’s Eye View (BEV) representation to reduce occlusion and eliminate viewpoint differences, facilitating the fusion process. Finally, a customized task head executes the object detection task to produce the final results.

3.1. Feature Extraction

3.1.1. Image Feature Encoding

The input image

I_{img}

is first cropped into patches and fed to the patch embedding layer to obtain feature vectors. Then, the resultant embeddings are passed to the subsequent Transformer layers in order to integrate a self-attention mechanism to capture long-range dependencies within the image. By integrating the hierarchical features produced by each Transformer layer, multi-scale image information can be aggregated such that the network’s ability can be enhanced. In addition, a hierarchical perceptual window mechanism is also employed such that features at different levels can be concentrated in perceptual windows of different sizes, which significantly improves the efficiency of the network in processing multi-scale information.

3.1.2. Point Cloud Feature Encoding

The input point cloud is first voxelized and then fed into the sparse point cloud encoder to obtain point cloud embeddings. Within the point cloud encoder, sparse 3D convolutional networks are employed for efficient point cloud feature extraction. Afterwards, the resultant sparse voxelized features are passed through the convolutional and encoder layers to generate the final features.

3.2. BEV Feature Transformation

To project the features of camera images and Lidar point clouds into a unified coordinate system, we construct a shared BEV plane, which effectively fuses multimodal features. This method helps to preserve geometric and semantic information while reducing efficiency bottlenecks in view transformation.

3.2.1. Image-to-BEV Transformation

As shown in Figure 2, after projecting the Lidar point cloud onto the image, we extract the depth information of each pixel to create a depth map. Subsequently, we employ a three-layer network to extract depth features from the depth map. Then, the resultant depth features are combined with the original image features and fed to another three-layer network for feature refinement. Afterwards, a softmax layer is adopted to generate the depth distribution. Next, we multiply the depth information with the features to aggregate image and depth information. After this series of processing steps, we map the pixel coordinates

(u, v)

of the image depth features to the Bird’s Eye View (BEV) plane based on the extrinsic camera parameters, producing the coordinates

(x_{BEV}

and

y_{BEV})

corresponding to each pixel, as shown in Equation (1).

\begin{matrix} [\begin{matrix} x_{BEV} \\ y_{BEV} \end{matrix}] = T \cdot K^{- 1} \cdot [\begin{matrix} u \\ v \\ 1 \end{matrix}] \end{matrix}

(1)

where K is the internal camera parameter matrix and T is the transformation matrix.

3.2.2. Point Cloud-to-BEV Transformation

After point cloud feature encoding, each feature is associated with specific coordinates

(x, y, z)

. To align these point cloud features with the BEV plane, a flattening transformation is conducted on the z-axis. This ensures accurate mapping of the point cloud features to their corresponding locations on the BEV plane.

3.3. Convolutional BEV Fusion

After converting the point cloud and image features extracted from the backbone into a unified BEV space, feature misalignment between the two modalities still hinders their fusion. To address this issue, we employ a Convolutional BEV Encoder. This encoder consists of a single-layer convolutional fusion module and the SECOND [24] network. The single-layer 3 × 3 convolutional fusion module is capable of extracting classification targets from bird’s-eye view images, effectively alleviating the local misalignment problem of multimodal features. Meanwhile, the SECOND [24] network can further extract deeper and more semantic features, enhancing the representation ability and highlighting key target information.

3.4. Adversarial Adaptive Data Augmentation

To enhance the generalization capability of 3D object detection methods, we propose an adversarial adaptive data augmentation strategy. By introducing minor perturbations to input images to hinder model convergence—perturbations that dynamically adapt during training—the model is trained to maintain stable predictions under these “adversarial” interferences, thereby achieving superior generalization performance.

The adversarial adaptive augmentation strategy is detailed in Algorithm 1. First, a softmax operation is conducted on

F_{img}

to obtain

F_{imgs}

:

\begin{matrix} F_{img} & = & f (I_{img}; θ) \end{matrix}

(2)

\begin{matrix} F_{imgs} & = & s o f t m a x (F_{img}) \end{matrix}

(3)

Next, we randomly generate a perturbation

d_{torch}

with the same shape of

I_{img}

and normalize

d_{torch}

to obtain

d_{norm}

.

d_{norm}

is then fed into the image feature extraction module along with

I_{img}

to obtain

F_{imgd}

, where the hyperparameter

ξ

controls the intensity of the perturbation.

F_{imgd} = log_softmax (f (I_{img} + ξ * d_{n o r m}; θ))

(4)

Algorithm 1: Adversarial Adaptive Augmentation Strategy

Input:

I_{i m g}

: Orignal image

m o d e l

: Image feature extraction network

ξ

: Against the size of the sample perturbation

ϵ

: The size of the antagonistic sample

Output:

L_{A d a}

: Local adversarial loss

1

F_{i m g s} \leftarrow s o f t (m o d e l (I_{i m g}));

2

d_{t o r c h} \leftarrow c r e a t e (I_{i m g} . s h a p e);

Create a tensor

3

d_{n o r m} \leftarrow n o r m (d_{t o r c h});

4

F_{i m g d} \leftarrow l o g_{s o f t} (m o d e l (I_{i m g} + ξ * d_{n o r m}));

5

a d v \leftarrow K L (F_{i m g d}, F_{i m g s});

6

a d v . b a c k w a r d ();

7

d_{g r a d} \leftarrow n o r m (d_{n o r m} . g r a d);

8

r_{d i s} \leftarrow d_{g r a d} * ϵ;

9

F_{dis} \leftarrow l o g_{s o f t} (m o d e l (I_{i m g} + r_{d i s}));

10

L_{A d a} \leftarrow K L (F_{dis}, F_{i m g s});

Afterwards, we evaluate the difference between

F_{imgs}

and

F_{imgd}

by calculating their similarity

a d v_{dis}

:

a d v_{dis} = KL (F_{imgd}, F_{imgs})

(5)

Subsequently, we compute the gradient with respect to the perturbation

d_{norm}

and optimize

d_{norm}

to obtain

d_{grad}

.

d_{grad}

is then combined with the hyperparameter

ϵ

to compute the adaptive perturbation

r_{dis}

, where the

ϵ

controls the maximum perturbation value. After that,

r_{dis}

is added to the original image

I_{img}

, which is then fed into the image feature extraction module to obtain the perturbed image feature

F_{dis}

.

\begin{matrix} r_{dis} & = & d_{grad} * ϵ \end{matrix}

(6)

\begin{matrix} F_{dis} & = & log_softmax (f (I_{img} + r_{dis}; θ)) \end{matrix}

(7)

Finally, we compute the KL loss between

F_{imgs}

and

F_{dis}

as the adaptive perturbation loss

L_{Ada}

:

L_{Ada} = KL (F_{dis}, F_{imgs})

(8)

3.5. Loss Function

In our model, the total loss function

L_{t o t a l}

, as shown in Equation (13), consists of the following components: The classification loss

L_{c l s}

is used to evaluate the model’s classification accuracy by reflecting the differences between the predicted and ground-truth classes. The IoU loss

L_{I o U}

measures the overlap between the predicted and ground-truth bounding boxes, and optimizing it can make the predicted boxes more accurately cover the targets. The bounding box loss

L_{b b o x}

calculates the coordinate deviations between the predicted and ground-truth bounding boxes, helping to precisely adjust the position and size of the boxes. The keypoint detection regression loss

L_{h e a t}

ensures the accurate positioning of keypoints. In addition, we introduce the adaptive augmentation loss

L_{A d a}

to enhance the model’s robustness in different environments. The combined effect of these loss functions drives the model to achieve high accuracy and stability in the object detection task.

\begin{matrix} L_{c l s} & = & - α_{t} {(1 - p_{t})}^{γ} \end{matrix}

(9)

\begin{matrix} L_{I o U} & = & - log (p_{t}) + log (σ (p_{t})) \end{matrix}

(10)

\begin{matrix} L_{b b o x} & = & \sum_{i = 1}^{4} |t_{p r e d, i} - t_{g t, i}| \end{matrix}

(11)

\begin{matrix} L_{h e a t} & = & \sum_{i} - α_{t} {(1 - p_{t})}^{γ} e^{- \frac{{(d_{i} - μ_{i})}^{2}}{2 σ^{2}}} \end{matrix}

(12)

\begin{matrix} L_{t o t a l} & = & α_{1} \cdot L_{c l s} + α_{2} \cdot L_{I o U} + α_{3} \cdot L_{b b o x} + α_{4} \cdot L_{h e a t} + α_{5} \cdot L_{A d a} \end{matrix}

(13)

where

p_{t}

is the predicted probability,

α_{t}

is a weighting factor,

γ

is an adjustment factor,

σ (p_{t})

is used to adjust the weights in the loss function.

t_{p r e d, i}

and

t_{g t, i}

are the parameters of the predicted bounding box and the true bounding box, and

α_{*}

denotes the weighted value of the loss.

4. Experiments

4.1. Datasets

To evaluate its effectiveness, we conducted experiments on the nuScenes-mini and KITTI [25] datasets. The nuScenes-mini dataset is a subset of the nuScenes [26] dataset, including ten scenes and 1000 samples. The mean average precision (mAP) and nuScenes detection score (NDS) are used to evaluate and compare the performance of different methods. The KITTI dataset provides 14,999 images and corresponding point clouds for the detection task, of which 7481 groups are used for training and 7518 groups are used for testing. The dataset is divided into three categories: easy, medium, and difficult according to whether the annotation box is occluded, the degree of occlusion, and the box height. The mAP is used to evaluate different methods.

4.2. Implementation Details

On the nuScenes-mini dataset, we set

ξ

to 5,

ϵ

to 2, epoch to 12, and

α_{1}

and

α_{2}

to 1 and 0.1, respectively. The batch size was set to 4, the minimum learning rate was set to 0.0001, and a cluster equipped with four V100 GPUs was used for training. On the KITTI dataset, we used the same training parameters as the nuScenes-mini dataset. All experiments were conducted on a PC with four NVIDIA V100 GPUs.

4.3. Evaluation Results

4.3.1. Results on nuScenes-Mini Validation Set

A series of state-of-the-art 3D object detection methods were re-implemented for evaluation, including PointPillars [27], PGD [28], Centerpoint [29], DAL [30], UVTR [31], Sparsefusion [32], and BEVFusion [3]. As shown in Table 1, the number of training iterations was adjusted during re-implementation to ensure fair comparison. Detection results on the nuScenes-mini validation set are reported in Table 1, where our method is observed to achieve a 1.5% improvement in mAP and a 0.4% improvement in NDS compared to the baseline detector BEVFusion. Furthermore, superior performance over previous approaches is demonstrated, confirming the method’s effectiveness. Increased computational resources are limited to 14% relative to BEVFusion, with training time per epoch being extended by only 47 s, as indicated in Table 2.

4.3.2. Results on KITTI Validation Set

To further verify the effectiveness of the proposed method, additional experiments were conducted on the KITTI dataset. MVXNet [33] was selected as the baseline model, with the original image feature extraction network (ResNet [34]) being replaced by Swin Transformer [35]. The proposed AADA strategy was subsequently introduced to enhance model robustness. Experimental results are presented in Table 3.

As indicated in the results, the proposed method achieves a minimum improvement of 0.8% in mean average precision (mAP). Of particular significance is the observed performance enhancement in cyclist detection tasks, where superior results are demonstrated compared to the baseline detector across varying difficulty levels, with respective improvements of 3.4%, 2.6%, and 1.7% being achieved. These findings not only validate the method’s effectiveness, but also reveal its capability to consistently improve detection performance under different challenge conditions, particularly in the more demanding scenario of cyclist detection. The demonstrated improvements are considered to provide substantial evidence for the method’s practical application potential.

4.4. Ablation Study

This section analyzes the specific effects of hyperparameters

ξ

and

ϵ

on method effectiveness, with particular focus on their impact on model performance. Ablation experiments are subsequently conducted through systematic comparison between the proposed adversarial perturbation and Gaussian noise.

4.4.1. Impact of $ϵ$

The hyperparameter

ϵ

determines the maximum perturbation magnitude during adversarial sample generation. Smaller

ϵ

values result in generated samples being closer to the original inputs. However, excessively small

ϵ

may compromise model robustness against real-world perturbations. Experimental investigation of

ϵ

values (Table 4) reveals optimal performance with

ϵ

= 2, where the proposed method achieves peak mAP and NDS scores. This configuration is consequently adopted as the default experimental setting.

4.4.2. Impact of $ξ$

Unlike

ϵ

,

ξ

directly controls the magnitude of the input perturbation during training, which is proportional to

ξ

. By appropriately increasing the input perturbation, the model is prone to be more robust to perturbations. Nevertheless, if

ξ

is too large, excessive perturbation may negatively affect the model performance. Consequently, we conduct experiments to investigate the effects of

ξ

and present the results in Table 5. As we can see, our model achieves the best performance in terms of mAP and NDS when

ξ

is set to 5. This suggests that by carefully tuning the

ξ

parameter, the model’s ability to adapt to input perturbations can be enhanced while maintaining its performance.

4.4.3. Comparison with Other Data Augmentation Methods

To highlight the superiority of our method, we conducted comparative experiments with baseline models and multiple perturbation strategies under specific experimental conditions. In the experiments, we selected the following three perturbation types: Gaussian noise, viewpoint transformation, and the dynamic data augmentation strategy PGD [36]. Gaussian noise was chosen as a common perturbation form due to its broad applicability and effective perturbation performance, with its mean and standard deviation set to 0 and 25, respectively, to ensure reasonable and effective perturbations. The viewpoint transformation method simulates image variations under different perspectives by adjusting angles and dimensions, where the rotation angle range is set between −5.4 to 5.4 degrees, and the scaling ratios for height and width are within 0.38 to 0.55 times, mimicking potential viewpoint changes encountered during real-world vehicle operation. For the PGD strategy, the perturbation norm is 0.015, the perturbation step size is 0.01, and the number of iterations is set to 10, ensuring controllability of perturbations and reproducibility of experiments. The experimental results are shown in Table 6.

As shown in the table, our method demonstrates significant advantages over static data augmentation approaches—Gaussian noise and view transformation. In comparison with the dynamic data augmentation strategy PGD, although our method exhibits a slightly lower mAP metric than PGD, the training time of PGD is approximately twice as long as that of our method. This indicates that our approach achieves high performance while substantially reducing training time costs, demonstrating superior efficiency and practical applicability.

4.5. Visualization Results

In Figure 3, we present 3D object detection results achieved on the nuScenes-mini dataset. As we can see, our method can effectively identify objects even in challenging scenes, such as occluded objects and close objects. As shown in Figure 3a, our method captures details that may be overlooked in natural scenes. In addition, as shown in Figure 3b, our method is also able to recognize objects without explicit labels. In Figure 4, we can see that our method has better recognition effect than MVXNet, which further demonstrates the effectiveness of our method.

5. Limitations and Future Research

5.1. Limitations

Despite the superior performance, our approach still has some challenges.

5.1.1. Poor Recognition of Small Objects

As shown in Figure 5, when objects are distant and small, their recognition becomes more challenging. This is primarily because these small, far-away objects occupy only a few pixels in the image. Our feature extraction network partitions the image into several non-overlapping local windows for self-attention computation. The division of local windows makes feature extraction in each window relatively independent, with limited information interaction across windows. Small targets may be scattered across multiple windows, making it difficult for the model to capture their global features and contextual information, which affects the accurate detection of small targets.

5.1.2. Limitations of Manual Hyperparameter Selection for AADA

The two key hyperparameters of AADA (perturbation size and number of perturbed samples) significantly affect the algorithm’s performance. Manually setting these parameters has limitations because they are difficult to adapt to different datasets and task requirements, necessitating extensive experiments for tuning, which is a cumbersome and time-consuming process. Moreover, the manually chosen parameters may no longer be applicable when the data distribution or environment changes, leading to unstable algorithm performance.

5.2. Future Research Trends

In response to the limitations of the current algorithm, our future work will focus on the following aspects:

1.: Enhancing small-object extraction capability: We will consider dynamically adjusting the size and position of local windows in the image feature extraction network based on image content, focusing on feature extraction in small-object regions. Meanwhile, we will design feature extraction modules tailored for small objects to increase the model’s sensitivity and extraction efficiency for small-object features.
2.: Diverse environment evaluation: We will conduct comprehensive environmental adaptability assessments of the proposed method. The performance of the method will be tested under various environmental conditions, including night driving and adverse weather, to analyze its limitations and optimize the algorithm in a targeted manner.
3.: Automated hyperparameter tuning: We will develop automated hyperparameter adjustment methods to efficiently explore the hyperparameter space and identify better parameter combinations, thereby improving the overall performance and stability of the algorithm.

6. Conclusions

This paper proposes an adversarial adaptive data augmentation strategy to enhance the robustness of 3D object detection. Unlike traditional data augmentation methods that rely on simple linear interpolation or image region mixing, our AADA method introduces adaptive perturbations during training that can dynamically adjust their intensity and direction in real time to adapt to model changes and prevent premature convergence. These perturbations can identify model weaknesses and conduct targeted training, prompting the model to generate more consistent outputs, thereby achieving superior performance. Experimental results on the nuScenes-mini and KITTI datasets demonstrate the significant effectiveness and practicality of our method in 3D object detection.

Author Contributions

Conceptualization, S.L. and Q.C.; software, methodology, formal analysis, validation, investigation, S.L.; supervision, project administration, J.L. and J.F.; writing—original draft preparation, S.L.; writing—review and editing, J.L. and J.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Li, H.; Sima, C.; Dai, J.; Wang, W.; Lu, L.; Wang, H.; Zeng, J.; Li, Z.; Yang, J.; Deng, H.; et al. Delving Into the Devils of Bird’s-Eye-View Perception: A Review, Evaluation and Recipe. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 2023, 46, 2151–2170. [Google Scholar] [CrossRef] [PubMed]
Huang, J.; Huang, G.; Zhu, Z.; Ye, Y.; Du, D. BEVDet: High-performance Multi-camera 3D Object Detection in Bird-Eye-View. arXiv 2021, arXiv:2112.11790. [Google Scholar] [CrossRef]
Liu, Z.; Tang, H.; Amini, A.; Yang, X.; Mao, H.; Rus, D.; Han, S. BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird’s-Eye View Representation. arXiv 2022. [Google Scholar] [CrossRef]
Yun, S.; Han, D.; Oh, S.; Chun, S.; Choe, J.; Yoo, Y. CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features. In Proceedings of the International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Zhang, H.; Cisse, M.; Oh, S.; Dauphin, Y.N.; Lopez-Paz, D. mixup: Beyond Empirical Risk Minimization. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar] [CrossRef]
Yang, B.; Luo, W.; Urtasun, R. PIXOR: Real-time 3D Object Detection from Point Clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Graham, B.; Engelcke, M.; Laurens, V.D.M. 3D Semantic Segmentation with Submanifold Sparse Convolutional Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 9224–9232. [Google Scholar] [CrossRef]
Liang, Z.; Zhang, M.; Zhang, Z.; Zhao, X.; Pu, S. RangeRCNN: Towards Fast and Accurate 3D Object Detection with Range Image Representation. arXiv 2020, arXiv:2009.00206. [Google Scholar] [CrossRef]
Chen, Y.; Liu, J.; Zhang, X.; Qi, X.; Jia, J. VoxelNeXt: Fully Sparse VoxelNet for 3D Object Detection and Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 21674–21683. [Google Scholar] [CrossRef]
Chen, Z.; Pham, K.T.; Ye, M.; Shen, Z.; Chen, Q. Cross-Cluster Shifting for Efficient and Effective 3D Object Detection in Autonomous Driving. IEEE Int. Conf. Robot. Autom. 2024, 4273–4280. [Google Scholar] [CrossRef]
Zhang, G.; Chen, J.; Gao, G.; Li, J.; Liu, S.; Hu, X. SAFDNet: A Simple and Effective Network for Fully Sparse 3D Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024. [Google Scholar] [CrossRef]
Chu, X.; Deng, J.; Li, Y.; Yuan, Z.; Zhang, Y.; Ji, J.; Zhang, Y. Neighbor-Vote: Improving Monocular 3D Object Detection through Neighbor Distance Voting. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual, 20–24 October 2021; pp. 5239–5247. [Google Scholar]
Chen, X.; Kundu, K.; Zhu, Y.; Ma, H.; Fidler, S.; Urtasun, R. 3D Object Proposals Using Stereo Imagery for Accurate Object Class Detection. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 2018, 40, 1259–1272. [Google Scholar] [CrossRef] [PubMed]
Li, P.; Chen, X.; Shen, S. Stereo R-CNN Based 3D Object Detection for Autonomous Driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7636–7644. [Google Scholar]
Guo, X.; Shi, S.; Wang, X.; Li, H. Learning LiDAR Geometry Aware Representations for Stereo-based 3D Detector. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 3133–3143. [Google Scholar]
Yan, L.; Yan, P.; Xiong, S.; Xiang, X.; Tan, Y. MonoCD: Monocular 3D Object Detection with Complementary Depths. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 10248–10257. [Google Scholar] [CrossRef]
Hou, J.; Wang, T.; Ye, X.; Liu, Z.; Gong, S.; Tan, X.; Ding, E.; Wang, J.; Bai, X. OPEN: Object-Wise Position Embedding for Multi-view 3D Object Detection. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2024; pp. 146–162. [Google Scholar]
Mao, J.; Shi, S.; Wang, X.; Li, H. 3D Object Detection for Autonomous Driving: A Comprehensive Survey. Int. J. Comput. Vis. 2022, 131, 1909–1963. [Google Scholar] [CrossRef]
Wang, S.; Zhao, X.; Xu, H.; Chen, Z.; Yu, D.; Chang, J.; Yang, Z.; Zhao, F. Towards Domain Generalization for Multi-view 3D Object Detection in Bird-Eye-View. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 13333–13342. [Google Scholar] [CrossRef]
Xie, E.; Yu, Z.; Zhou, D.; Philion, J.; Anandkumar, A.; Fidler, S.; Luo, P.; Alvarez, J.M. M²BEV: Multi-Camera Joint 3D Detection and Segmentation with Unified Birds-Eye View Representation. arXiv 2022, arXiv:2204.05088. [Google Scholar] [CrossRef]
Liang, M.; Yang, B.; Chen, Y.; Hu, R.; Urtasun, R. Multi-Task Multi-Sensor Fusion for 3D Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–17 June 2019; pp. 7337–7345. [Google Scholar] [CrossRef]
Zhao, X.; Sun, P.; Xu, Z.; Min, H.; Yu, H. Fusion of 3D LIDAR and Camera Data for Object Detection in Autonomous Vehicle Applications. IEEE Sens. J. 2020, 20, 4901–4913. [Google Scholar] [CrossRef]
Tang, Y.; He, H.; Wang, Y.; Wang, H.; Mao, Z. Multi-modality 3D object detection in autonomous driving: A review. Neurocomputing 2023, 553, 101–118. [Google Scholar] [CrossRef]
Yan, Y.; Mao, Y.; Li, B. SECOND: Sparsely Embedded Convolutional Detection. Sensors 2018, 18, 3337. [Google Scholar] [CrossRef] [PubMed]
Urtasun, R.; Lenz, P.; Geiger, A. Are we ready for autonomous driving? The KITTI vision benchmark suite. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 3354–3361. [Google Scholar]
Caesar, H.; Bankiti, V.; Lang, A.H.; Vora, S.; Liong, V.E.; Xu, Q.; Krishnan, A.; Pan, Y.; Baldan, G.; Beijbom, O. nuScenes: A Multimodal Dataset for Autonomous Driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11618–11628. [Google Scholar] [CrossRef]
Lang, A.H.; Vora, S.; Caesar, H.; Zhou, L.; Yang, J.; Beijbom, O. PointPillars: Fast Encoders for Object Detection from Point Clouds. arXiv 2019. [Google Scholar] [CrossRef]
Wang, T.; Zhu, X.; Pang, J.; Lin, D. Probabilistic and Geometric Depth: Detecting Objects in Perspective. In Proceedings of the Conference on Robot Learning, London, UK, 8–11 November 2021. [Google Scholar]
Yin, T.; Zhou, X.; Krhenbühl, P. Center-based 3D Object Detection and Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 11779–11788. [Google Scholar] [CrossRef]
Huang, J.; Ye, Y.; Liang, Z.; Shan, Y.; Du, D. Detecting As Labeling: Rethinking LiDAR-camera Fusion in 3D Object Detection. arXiv 2023. [Google Scholar] [CrossRef]
Li, Y.; Chen, Y.; Qi, X.; Li, Z.; Sun, J.; Jia, J. Unifying Voxel-based Representation with Transformer for 3D Object Detection. Adv. Neural Inf. Process. Syst. 2022, 35, 18442–18455. [Google Scholar]
Xie, Y.; Xu, C.; Rakotosaona, M.; Rim, P.; Tombari, F.; Keutzer, K.; Tomizuka, M.; Zhan, W. SparseFusion: Fusing Multi-Modal Sparse Representations for Multi-Sensor 3D Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023. [Google Scholar]
Vishwanath, A.S.; Yin, Z.; Oncel, T. MVX-Net: Multimodal VoxelNet for 3D Object Detection. In Proceedings of the International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 7276–7282. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the International Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015. [Google Scholar]
Liu, Z.; Hu, H.; Lin, Y.; Yao, Z.; Xie, Z.; Wei, Y.; Ning, J.; Cao, Y.; Zhang, Z.; Dong, L.; et al. Swin Transformer V2: Scaling Up Capacity and Resolution. In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
Madry, A.; Makelov, A.; Schmidt, L.; Tsipras, D.; Vladu, A. Towards Deep Learning Models Resistant to Adversarial Attacks. arXiv 2019. [Google Scholar] [CrossRef]

Figure 1. An overview of our method. Our method introduces an adversarial adaptive data augmentation strategy in the image feature extraction branch, which is denoted by the dotted box.

Figure 2. Illustration of image depth feature transformation. Point cloud is projected onto image to obtain depth information of corresponding object.

Figure 3. Object detection results for our method on the nuScenes-mini dataset. (a,b) demonstrate that our method produces superior results when dealing with occlusions and unlabeled objects.

Figure 4. Object detection results of our method on the KITTI dataset.

Figure 5. Object detection results for our method on the nuScenes-mini dataset.

Table 1. Performance comparison on the nuScenes-mini validation set.

Method	mAP	NDS	AP
Method	mAP	NDS	Car	Truck	Bus	Ped.	Motor	Bicycle
PointPillars	29.6	40.8	83.5	31.2	75.1	77.5	7.0	21.4
PGD	30.7	32.3	52.0	44.2	52.8	45.3	35.1	19.3
CenterPoint	37.3	46.6	80.3	60.7	86.6	87.8	17.2	-
DAL	44.9	53.4	85.7	59.4	64.9	91.4	38.8	12.3
UVTR	42.9	50.0	86.7	63.5	94.7	85.4	41.1	4.0
Sparsefusion	49.3	47.0	85.4	72.5	72.9	91.8	54.0	34.4
BEVFusion	48.1	52.9	89.2	64.1	98.2	91.8	56.0	27.7
Ours	49.6	53.3	88.3	69.1	99.3	92.1	58.8	33.4

Table 2. Comparison of training efficiency on the nuScenes-mini dataset.

Method	Training Time per Epoch(s)	GPU Memory (MIB)
BEVFusion	216	17,847
Ours	263	20,410

Table 3. Performance comparison on the KITTI validation set.

Method	mAP	Car			Pedestrain			Cyclist
Method	Mod.	Easy	Mod.	Hard	Easy	Mod.	Hard	Easy	Mod.	Hard
MvxNet-ResNet	63.5	88.4	78.8	74.7	62.9	58.4	55.2	70.9	53.2	49.6
MvxNet-SwinT	64.1	88.2	78.2	75.5	64.1	58.5	54.1	71.9	55.6	52.7
Ours	64.9	89.5	78.3	75.7	64.7	58.0	53.6	75.3	58.2	54.4

Table 4. Comparison of 3D object detection performance on the nuScenes-mini dataset with different

ϵ

values.

Table 4. Comparison of 3D object detection performance on the nuScenes-mini dataset with different

ϵ

values.

$ξ$	$ϵ$	mAP	NDS	AP
$ξ$	$ϵ$	mAP	NDS	Car	Truck	Bus	Ped.	Motor	Bicycle
5.0	1.0	47.9	52.3	88.6	64.0	97.3	90.4	54.7	28.3
5.0	2.0	49.6	53.5	88.3	69.1	99.3	92.1	58.8	33.4
5.0	3.0	48.7	53.1	88.6	65.9	98.3	91.3	54.6	35.1

Table 5. Performance comparison of 3D object detection on the nuScenes-mini dataset with different

ξ

values.

Table 5. Performance comparison of 3D object detection on the nuScenes-mini dataset with different

ξ

values.

$ξ$	$ϵ$	mAP	NDS	AP
$ξ$	$ϵ$	mAP	NDS	Car	Truck	Bus	Ped.	Motor	Bicycle
3.0	2.0	48.8	52.2	88.4	63.4	99.2	91.5	47.7	39.2
4.0	2.0	48.8	52.6	88.1	66.6	98.3	91.4	48.6	36.2
5.0	2.0	49.6	53.5	88.3	69.1	99.3	92.1	58.8	33.4
6.0	2.0	48.1	52.9	88.3	67.0	99.3	91.0	50.4	34.8
7.0	2.0	48.8	53.0	88.4	65.0	97.9	91.2	54.7	35.4
8.0	2.0	48.6	53.2	88.3	67.0	97.9	91.2	57.9	30.2
9.0	2.0	49.1	53.7	88.6	69.0	99.3	91.0	56.5	29.1
10.0	2.0	49.3	52.9	88.2	67.8	99.4	91.8	51.8	36.1

Table 6. Comparison of 3D object detection results with different types of perturbation.

Method	mAP	NDS	AP
Method	mAP	NDS	Car	Truck	Bus	Ped.	Motor	Bicycle
Based on Gaussian noise	49.0	53.2	88.8	66.9	99.7	91.6	57.5	36.3
Based on view transformation	48.1	52.9	89.2	64.1	98.2	91.8	56.0	27.7
Based on PGD	50.2	53.1	88.6	68.0	99.5	91.5	58.9	37.8
Ours	49.6	53.5	88.3	69.1	99.3	92.1	58.8	33.4

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, S.; Li, J.; Fu, J.; Chen, Q. Boosting 3D Object Detection with Adversarial Adaptive Data Augmentation Strategy. Sensors 2025, 25, 3493. https://doi.org/10.3390/s25113493

AMA Style

Li S, Li J, Fu J, Chen Q. Boosting 3D Object Detection with Adversarial Adaptive Data Augmentation Strategy. Sensors. 2025; 25(11):3493. https://doi.org/10.3390/s25113493

Chicago/Turabian Style

Li, Shihao, Jingsong Li, Jianghua Fu, and Qiuyue Chen. 2025. "Boosting 3D Object Detection with Adversarial Adaptive Data Augmentation Strategy" Sensors 25, no. 11: 3493. https://doi.org/10.3390/s25113493

APA Style

Li, S., Li, J., Fu, J., & Chen, Q. (2025). Boosting 3D Object Detection with Adversarial Adaptive Data Augmentation Strategy. Sensors, 25(11), 3493. https://doi.org/10.3390/s25113493

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Boosting 3D Object Detection with Adversarial Adaptive Data Augmentation Strategy

Abstract

1. Introduction

2. Related Work

2.1. Lidar-Based 3D Perception

2.2. Camera-Based 3D Perception

2.3. Multi-Sensor Fusion

3. Methodology

3.1. Feature Extraction

3.1.1. Image Feature Encoding

3.1.2. Point Cloud Feature Encoding

3.2. BEV Feature Transformation

3.2.1. Image-to-BEV Transformation

3.2.2. Point Cloud-to-BEV Transformation

3.3. Convolutional BEV Fusion

3.4. Adversarial Adaptive Data Augmentation

3.5. Loss Function

4. Experiments

4.1. Datasets

4.2. Implementation Details

4.3. Evaluation Results

4.3.1. Results on nuScenes-Mini Validation Set

4.3.2. Results on KITTI Validation Set

4.4. Ablation Study

4.4.1. Impact of ϵ

4.4.2. Impact of ξ

4.4.3. Comparison with Other Data Augmentation Methods

4.5. Visualization Results

5. Limitations and Future Research

5.1. Limitations

5.1.1. Poor Recognition of Small Objects

5.1.2. Limitations of Manual Hyperparameter Selection for AADA

5.2. Future Research Trends

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

4.4.1. Impact of $ϵ$

4.4.2. Impact of $ξ$