3.1. Experimental Settings and Dataset Description
Experiments were performed on a workstation with an NVIDIA GeForce RTX 2080 Ti GPU (22 GB memory) using PyTorch 2.2.2, CUDA 11.8, Python 3.12.11, and Ultralytics YOLOv8 framework (v8.1.9). For input resolution, VEDAI experiments (both ablation studies and comparative experiments) use to fully exploit the dataset’s native high-resolution aerial imagery, while DroneVehicle experiments adopt resolution to ensure fair comparison with mainstream methods in the literature.
Training Configuration: Training employed 100 epochs with batch size 8 and SGD optimizer. The learning rate schedule uses initial learning rate with final learning rate , momentum 0.937, and weight decay 0.0005. Warmup strategy spans 3 epochs with warmup momentum 0.8 and warmup bias learning rate 0.1. The nominal batch size (nbs) is set to 64 for gradient accumulation scaling. Mixed precision training (AMP) is disabled for reproducibility. Random seed is fixed at 0 with deterministic mode enabled to ensure reproducible results.
Loss Function Configuration: The multi-task loss function combines three components with the following weights: bounding box regression loss weight , classification loss weight , and distribution focal loss (DFL) weight . Label smoothing is set to 0.0 (disabled). No class-specific weighting is applied, treating all vehicle categories equally during training.
Inference Configuration: During inference, IoU threshold for Non-Maximum Suppression (NMS) is set to 0.7, with maximum detections per image limited to 300. Confidence threshold is automatically determined by the framework. For evaluation, mAP is computed at IoU threshold 0.5 (mAP50) and averaged over IoU thresholds from 0.5 to 0.95 with step 0.05 (mAP50-95), following COCO evaluation protocol.
Data Augmentation: Data augmentation encompasses HSV color adjustments (hue shift ±0.015, saturation scale 0.7, value scale 0.4), random translation (scale 0.1)/scaling (scale 0.5), Mosaic augmentation (probability 1.0), RandAugment (auto_augment policy), random erasing (probability 0.4), and horizontal flip (probability 0.5). MixUp and CopyPaste augmentations are disabled (probability 0.0). Rotation, shear, and perspective transformations are not applied (set to 0.0). All augmentations are synchronously applied to both RGB and infrared modalities to ensure spatial consistency.
This study employs two UAV aerial vehicle detection benchmark datasets for comprehensive experimental validation: VEDAI [
35] serves as the primary evaluation benchmark for ablation studies, component analysis, and comparative experiments, while DroneVehicle [
18] provides cross-dataset generalization assessment.
Table 2 summarizes the complete training hyperparameters used in our experiments.
Table 3 details the characteristics of both VEDAI and DroneVehicle datasets used in our experimental validation.
3.3. Ablation Experiments
Comprehensive ablation experiments verify (1) single-modality vs. multimodal fusion necessity and (2) TDA and MAFPN independent contributions and synergies via incremental addition. All ablation experiments use VEDAI dataset with
input resolution to fully leverage high-resolution aerial imagery details and validate component effectiveness under optimal conditions.
Table 4 compares single-modality (IR/RGB) with multimodal ARFM fusion on YOLOv8 baseline.
YOLOv8-Symmetric denotes a symmetric dual-encoder fusion architecture where both RGB and IR branches receive concatenated features ( channels) at each fusion node, treating both modalities equally throughout the network hierarchy. In contrast, our YOLOv8-ARFM employs asymmetric fusion where only the RGB branch accumulates fused features while the IR branch preserves single-modality characteristics.
As shown in
Table 4, YOLOv8-ARFM with RGB + IR multimodal fusion achieves 90.7% mAP50 and 64.1% mAP50-95, significantly outperforming both single-modality baselines and the symmetric fusion approach. Compared to YOLOv8 RGB (85.7% mAP50, 58.5% mAP50-95), YOLOv8-ARFM improves +5.0 pp mAP50 and +5.6 pp mAP50-95. Compared to YOLOv8 IR (83.2% mAP50, 56.5% mAP50-95), the improvement reaches +7.5 pp mAP50 and +7.6 pp mAP50-95. Notably, compared to YOLOv8-Symmetric (88.8% mAP50, 62.0% mAP50-95), our asymmetric ARFM design achieves +1.9 pp mAP50 and +2.1 pp mAP50-95 while using fewer parameters (4.75 M vs. 5.20 M, −8.7%) and lower computational cost (12.1 vs. 13.7 GFLOPs, −11.7%). This validates the effectiveness of asymmetric fusion over symmetric approaches: preserving IR modality-specific features while allowing RGB to accumulate cross-modal information achieves better accuracy–efficiency trade-off than treating both modalities equally.
Figure 3 compares detection results across five challenging aerial scenarios: urban arterial roads (column 1), parking lots (column 2), agricultural farmland (column 3), construction sites with unpaved terrain (column 4), and winding country paths (column 5).
The RGB modality (b) achieves reliable detection in high-contrast scenarios (columns 2, 4–5) where vehicles exhibit distinct color features against pavement or terrain backgrounds. However, RGB completely fails in low-contrast urban roads (column 1) where earth-tone vehicles merge with surrounding soil backgrounds, missing vehicles that are clearly visible in ground truth (a). RGB also shows degraded performance in agricultural scenes (column 3) with lower confidence scores. The infrared modality (c) provides grayscale thermal signatures independent of surface appearance, successfully detecting vehicles in challenging low-contrast scenes where RGB fails (column 1). Nevertheless, IR suffers from reduced spatial precision due to thermal blurring effects and lower resolution, resulting in imprecise bounding box localization and lower confidence scores across multiple scenarios.
Our ATM-Net (d) demonstrates superior robustness by synergistically fusing complementary cues: it maintains RGB’s sharp localization in favorable lighting while compensating with IR’s thermal contrast in adverse conditions. Notably, ATM-Net achieves consistent detection across all terrain types with significantly elevated confidence scores (0.75–0.97 range), validating the effectiveness of asymmetric recurrent fusion for handling diverse UAV surveillance scenarios.
To further analyze the independent contributions and synergies of core modules,
Table 5 evaluates TDA and MAFPN through incremental addition.
As shown in
Table 5, both TDA and MAFPN independently contribute to performance improvement over the baseline YOLOv8-ARFM (90.7% mAP50, 64.1% mAP50-95). Adding TDA alone (YOLOv8-ARFM+TDA) improves mAP50 to 91.7% (+1.0 pp) and mAP50-95 to 64.5% (+0.4 pp), with notably improved recall from 0.756 to 0.828 (+9.5%), indicating TDA’s effectiveness in capturing more true positives through tri-dimensional attention recalibration. Adding MAFPN alone (YOLOv8-ARFM+MAFPN) achieves similar gains with mAP50 of 91.6% (+0.9 pp) and mAP50-95 of 64.5% (+0.4 pp), while improving recall to 0.811 (+7.3%). The complete ATM-Net combining both modules achieves the best performance: 92.4% mAP50 (+1.7 pp over baseline) and 64.7% mAP50-95 (+0.6 pp), with balanced precision (0.913) and recall (0.816). The computational cost remains efficient: ATM-Net requires only 4.83 M parameters and 13.0 GFLOPs, representing minimal overhead (+1.7% parameters, +7.4% GFLOPs) compared to the baseline.
Inference speed (FPS) is measured on Huawei Atlas AIpro-20T mobile GPU with batch size 1 and input resolution. The baseline achieves 38.9 FPS, while TDA slightly improves inference speed to 39.6 FPS (+1.8%), suggesting that TDA’s attention-based feature recalibration helps optimize the feature flow and computational efficiency. MAFPN reduces speed to 36.3 FPS (−6.7%) due to additional multi-path feature aggregation operations. The complete ATM-Net achieves 35.9 FPS (−7.7%), demonstrating acceptable computational overhead on edge computing platforms. All configurations maintain real-time inference capability (>30 FPS), demonstrating ATM-Net’s suitability for practical UAV deployment scenarios on resource-constrained mobile GPU platforms.
Figure 4 visualizes incremental module contributions across two representative scenarios: a parking lot scene (row 1) and an urban road scene (row 2). Comparing against ground truth (a), the baseline YOLOv8-ARFM (b) exhibits suboptimal performance with relatively low confidence scores (0.50 and 0.47), indicating insufficient feature discrimination in multimodal fusion. Adding MAFPN (c) substantially improves confidence to 0.70 and 0.72 across both scenarios, demonstrating the effectiveness of multi-scale adaptive feature aggregation in capturing vehicles at different spatial resolutions. Introducing TDA attention alone (d) shows uneven improvements: enhanced performance in the parking lot (0.73) but limited gains in the urban road scene (0.54), though still improving over baseline. This suggests MAFPN provides more consistent cross-scenario benefits while TDA offers stronger gains in spatially dense scenarios. The complete ATM-Net (e) combining both MAFPN and TDA achieves the highest confidence scores (0.81 and 0.85) across both scenarios, validating their complementary synergy—MAFPN provides robust multi-scale representations while TDA refines feature recalibration through tri-dimensional attention, collectively enabling superior multimodal vehicle detection.
To provide comprehensive evaluation metrics and address the need for per-class analysis, confidence calibration, and statistical rigor, we present additional diagnostic visualizations in
Figure 5.
3.4. Comparative Experiments
To comprehensively evaluate ATM-Net’s performance and generalization capability, we conduct comparative experiments against state-of-the-art single-modality and multimodal fusion methods on both VEDAI and DroneVehicle datasets. VEDAI experiments use input resolution to fully exploit the dataset’s native high-resolution imagery, while DroneVehicle experiments adopt resolution for fair comparison with mainstream methods. All experiments use consistent training configurations: 100 epochs, batch size 8, SGD optimizer with momentum 0.9, and initial learning rate 0.01.
Table 6 presents quantitative comparisons on the VEDAI dataset, including seven single-modality RGB methods (YOLOv5n/s, YOLO-S, SuperDet, DS-YOLOv8, YOLOv8n, YOLOv10n), two single-modality Thermal methods (SPD-YOLOv8, DBD-YOLOv8), and eight multimodal RGB+Thermal fusion methods (LW-CNN, CMCA, EMCF, CMAFF, SuperYOLO, ICAFusion, GHOST, Multispectral DETR).
ATM-Net achieves 92.4% mAP50 and 64.7% mAP50-95 with only 4.83 M parameters, demonstrating superior parameter efficiency and detection accuracy. Compared to large-scale multimodal methods, ATM-Net significantly outperforms Multispectral DETR (+9.7 pp mAP50, +13.9 pp mAP50-95) while using 15.1× fewer parameters (4.83 M vs. 73.0 M), and surpasses GHOST (+12.1 pp mAP50, +15.7 pp mAP50-95) with 2.0× fewer parameters. Compared to lightweight multimodal methods, ATM-Net outperforms SuperYOLO by +17.3 pp (mAP50) while using 1.45× fewer parameters, surpasses LRFL-YOLO by +17.9 pp with 2.15× fewer parameters, and exceeds FGMF by +26.1 pp with 1.76× fewer parameters. Against the best RGB single-modality method DS-YOLOv8 (76.9% mAP50), ATM-Net gains +15.5 pp, and against the best Thermal single-modality method DBD-YOLOv8 (76.0% mAP50), ATM-Net gains +16.4 pp, demonstrating effective RGB-IR asymmetric fusion. Versus the baseline YOLOv8n, ATM-Net achieves +23.8 pp mAP50 and +15.5 pp mAP50-95 improvements with only +60.8% additional parameters.
Table 7 provides per-class performance on VEDAI’s eight categories.
ATM-Net achieves 92.4% mean performance and leads on six categories: car (96.0%), truck (93.8%), pickup (95.0%), tractor (95.2%), camping car (95.9%), and van (99.5%). Among the compared methods in this table, ATM-Net demonstrates strong performance particularly in vehicle subtypes. Compared to YOLOv8n, ATM-Net shows substantial improvements in challenging categories—+22.5% for van and +12.8% for truck—demonstrating the value of RGB-IR complementarity in detecting vehicles with low thermal contrast or visual ambiguity.
Figure 6 visualizes qualitative comparisons across seven methods. Single-modality methods (YOLOv5n, YOLOv8n) exhibit missed/false detections due to single-sensor limitations. Multimodal methods show progressive improvements: ARSOD-YOLO and ICAFusion provide enhanced detection over single-modality baselines and FGMF achieves better fusion performance, while ATM-Net demonstrates the highest completeness and accuracy, especially in challenging dense parking lots and complex road scenarios, validating the effectiveness of asymmetric recurrent multimodal fusion.
To further validate cross-dataset generalization capability, we extend comparative experiments to the large-scale DroneVehicle dataset, which provides complementary evaluation with over 28,000 RGB-IR image pairs captured across diverse urban scenarios under varying day–night illumination conditions. This dataset presents additional challenges including higher scene complexity, increased vehicle density, and more severe occlusion compared to VEDAI, making it an ideal testbed for evaluating multimodal fusion robustness in real-world UAV deployment scenarios.
Table 8 presents comprehensive per-class performance comparisons across five vehicle categories: Car, Freight Car, Truck, Bus, and Van. The evaluation encompasses three modality configurations: RGB single-modality methods (Faster R-CNN, Rol Trans), IR single-modality methods (Faster R-CNN, Rol Trans), and eight state-of-the-art RGB+IR multimodal fusion approaches spanning different architectural paradigms—early fusion (CIAN), attention-based fusion (AR-CNN, UA-CMDet, TSFADet, CALNet), transformer-based fusion (C
2Former), end-to-end fusion (E2E-MFD), and dual-stream fusion (DMM).
ATM-Net achieves the highest overall mAP of 83.7% on DroneVehicle, demonstrating superior cross-dataset generalization. Compared to single-modality baselines, RGB methods (Faster R-CNN: 55.9%, Rol Trans: 61.6%) and IR methods (Faster R-CNN: 64.2%, Rol Trans: 65.5%) exhibit significant performance gaps due to modality-specific limitations—RGB suffers from poor night-time performance while IR lacks spatial detail for small vehicle discrimination.
Among multimodal fusion methods, ATM-Net outperforms the previous best DMM by +4.4 pp (mAP) while using only 5.5% of its parameters (4.83 M vs. 88.0 M), achieving 18.2× parameter efficiency. Compared to recent transformer-based methods, ATM-Net surpasses C2Former (+9.5 pp) with 20.9× fewer parameters, and exceeds TSFADet (+10.6 pp) with 21.7× fewer parameters. This substantial parameter efficiency advantage validates the effectiveness of asymmetric recurrent fusion architecture for lightweight multimodal vehicle detection.
Per-class analysis reveals ATM-Net’s balanced performance across all vehicle categories: it achieves the highest scores in Freight Car (74.7%), Bus (90.3%), and Van (67.1%), ties with DMM for Car (90.5%), and maintains competitive Truck detection (78.5%, only 0.8pp below E2E-MFD). Notably, ATM-Net demonstrates particular strength in detecting challenging small vehicle categories (Freight Car, Van) where precise multimodal feature alignment is critical, improving +1.5 pp over DMM in Freight Car and +2.0 pp in Van. The consistent high performance across diverse vehicle types (ranging from large buses to small vans) validates the robustness of tri-dimensional attention and multi-scale adaptive fusion mechanisms in handling significant intra-class scale variations inherent to UAV aerial vehicle detection.
Figure 7 provides qualitative visualization comparing ground truth annotations (row a) with ATM-Net’s detection results (row b) across six representative scenarios from the DroneVehicle dataset. The scenarios include densely packed urban intersections (columns 1–2), tree-occluded road intersection (column 3), open highway segment (column 4), large-scale parking facility with hundreds of vehicles (column 5), and nighttime residential area (column 6). Overall, ATM-Net demonstrates strong detection performance with high completeness and accurate localization across most scenarios. However, blue arrows highlight several misdetection cases that reveal current limitations: in columns 1, 2, and 5, false positives occur where non-vehicle objects (such as road markings, shadows, or ground textures) are incorrectly classified as vehicles, indicating challenges in distinguishing vehicles from visually similar background elements in complex urban environments. These failure cases suggest that while ATM-Net achieves competitive overall performance (83.7% mAP), further improvements in background discrimination and handling of visual distractors would enhance robustness for practical deployment in diverse real-world UAV scenarios.