Next Article in Journal
Cascaded ADRC Framework for Robust Control of Coaxial UAVs with Uncertainties and Disturbances
Next Article in Special Issue
The Role of Artificial Intelligence in Next-Generation Handover Decision Techniques for UAVs over 6G Networks
Previous Article in Journal
Alternating Optimization-Based Joint Power and Phase Design for RIS-Empowered FANETs
Previous Article in Special Issue
S-HSFL: A Game-Theoretic Enhanced Secure-Hybrid Split-Federated Learning Scheme for UAV-Assisted Wireless Networks
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

ATM-Net: A Lightweight Multimodal Fusion Network for Real-Time UAV-Based Object Detection

1
The School of Geography and Remote Sensing, Guangzhou University, Guangzhou 510006, China
2
The Institute of Aerospace Remote Sensing Innovations, Guangzhou University, Guangzhou 510006, China
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Drones 2026, 10(1), 67; https://doi.org/10.3390/drones10010067
Submission received: 30 November 2025 / Revised: 16 January 2026 / Accepted: 18 January 2026 / Published: 20 January 2026

Highlights

What are the main findings?
  • ATM-Net achieves state-of-the-art performance on two UAV aerial vehicle detection benchmarks—92.4% mAP50 and 64.7% mAP50-95 on VEDAI dataset, and 83.7% mAP on DroneVehicle dataset—with only 4.83M parameters. This lightweight design enables over 15× parameter efficiency compared to existing heavyweight multimodal fusion methods while maintaining superior detection accuracy for small aerial targets.
  • Three synergistic innovations—Asymmetric Recurrent Fusion Module (ARFM), Tri-Dimensional Attention (TDA) mechanism, and Multi-scale Adaptive Feature Pyramid Network (MAFPN)—collectively enable efficient RGB–infrared fusion that balances cross-modal collaboration and modality independence, achieving 40% parameter reduction compared to symmetric architectures.
What is the implication of the main finding?
  • This work demonstrates that lightweight multimodal fusion networks can achieve optimal accuracy–efficiency balance for real-time UAV edge computing deployment, enabling practical all-weather autonomous drone operations in resource-constrained scenarios including emergency response, traffic monitoring, and precision agriculture where both detection performance and computational efficiency are critical.
  • The proposed asymmetric fusion paradigm and tri-dimensional attention mechanism provide generalizable architectural principles for efficient multimodal learning in aerial perception systems, advancing the development of onboard intelligence for autonomous drones operating in complex environments with extreme scale variations and demanding real-time processing requirements.

Abstract

UAV-based object detection faces critical challenges including extreme scale variations (targets occupy 0.1–2% image area), bird’s-eye view complexities, and all-weather operational demands. Single RGB sensors degrade under poor illumination while infrared sensors lack spatial details. We propose ATM-Net, a lightweight multimodal RGB–infrared fusion network for robust UAV vehicle detection. ATM-Net integrates three innovations: (1) Asymmetric Recurrent Fusion Module (ARFM) performs “extraction→fusion→separation” cycles across pyramid levels, balancing cross-modal collaboration and modality independence. (2) Tri-Dimensional Attention (TDA) recalibrates features through orthogonal Channel-Width, Height-Channel, and Height-Width branches, enabling comprehensive multi-dimensional feature enhancement. (3) Multi-scale Adaptive Feature Pyramid Network (MAFPN) constructs enhanced representations via bidirectional flow and multi-path aggregation. Experiments on VEDAI and DroneVehicle datasets demonstrate superior performance—92.4% mAP50 and 64.7% mAP50-95 on VEDAI, 83.7% mAP on DroneVehicle—with only 4.83M parameters. ATM-Net achieves optimal accuracy–efficiency balance for resource-constrained UAV edge platforms.

1. Introduction

Unmanned aerial vehicles (UAVs) have become essential platforms for intelligent surveillance, enabling applications such as traffic monitoring, emergency response, and infrastructure inspection [1,2]. Vision-based object detection is fundamental to these applications, providing situational awareness for autonomous navigation and decision-making.
However, UAV-based aerial detection faces distinct challenges [3]: (1) extreme scale variations where targets occupy only 0.1–2% of image area [4,5]; (2) bird’s-eye view complexities with drastic appearance variations; (3) complex background clutter from urban environments [4]; (4) all-weather operational requirements across varying illumination and weather conditions [6,7].
Single-modality approaches have inherent limitations. RGB-based detectors, including two-stage methods (Faster R-CNN [8], RoI Trans [9]) and one-stage methods (YOLOv5 [10], YOLOv8 [11], FFCA-YOLO [12], DS-YOLOv8 [13], SuperDet [14]), suffer severe degradation under poor illumination. Thermal infrared methods (SPD-YOLOv8 [15], DBD-YOLOv8 [16]) provide illumination-invariant detection but lack spatial resolution and texture details.
Multimodal RGB-IR fusion methods leverage complementary information from both modalities. Early fusion approaches (FGMF [17], UA-CMDet [18]) concatenate raw data but cause destructive inter-modal coupling. Late fusion methods (CIAN [19], AR-CNN [20]) merge predictions at decision level but miss feature-level complementarity. Middle fusion architectures (SuperYOLO [21], ICAFusion [22], ARSOD-YOLO [23], LRFL-YOLO [24], GHOST [25], CALNet [26], TSFADet [27], C2Former [28], E2E-MFD [29], MultispectralDETR [30], DMM [31], YOLOFusion [32], DAMSDet [33], MMYNet [34]) enable cross-modal feature interaction but face critical limitations: (1) simplistic fusion mechanisms (concatenation or addition) fail to model complex cross-modal dependencies; (2) symmetric fusion ignores asymmetric modality characteristics; (3) excessive computational overhead—MultispectralDETR (73.0 M), ICAFusion (120.2 M), TSFADet (104.7 M), C2Former (100.8 M), DMM (88.0 M) parameters—is incompatible with resource-constrained UAV platforms.
To address these limitations, we propose ATM-Net, a lightweight multimodal fusion network with three core innovations:
  • Asymmetric Recurrent Fusion Module (ARFM): Employs asymmetric dual-stream architecture with recurrent “extraction→fusion→separation” cycles across pyramid levels ( P 2 P 5 ). One stream accumulates cross-modal information while the other preserves modality-specific characteristics, resolving the dilemma between collaboration and independence with 40% parameter reduction compared to symmetric architectures.
  • C2f_TDA Module with Tri-Dimensional Attention: Integrates TDA into C2f backbone, recalibrating features through three orthogonal branches (Channel-Width, Height-Channel, Height-Width) for comprehensive multi-dimensional feature enhancement, particularly benefiting small object detection.
  • Multi-scale Adaptive Feature Pyramid Network (MAFPN): Constructs enhanced multi-scale representations through bidirectional information flow and multi-path adaptive fusion, addressing extreme scale variations in aerial imagery.
Experiments demonstrate ATM-Net achieves 92.4% mAP50 and 64.7% mAP50-95 on VEDAI, as well as 83.7% mAP on DroneVehicle, with only 4.83 M parameters—15.1× fewer than MultispectralDETR and 18.2× fewer than DMM—achieving optimal accuracy–efficiency balance for UAV edge platforms.
The remainder of this paper is organized as follows: Section 2 presents the ATM-Net methodology; Section 3 reports experimental results; Section 4 discusses performance analysis and limitations; Section 5 concludes the work.

2. Materials and Methods

This section presents the ATM-Net architecture through four complementary components: (1) overall network design establishing the dual-modality fusion framework; (2) Asymmetric Recurrent Fusion Module (ARFM) introducing the unique asymmetric information flow pattern; (3) C2f_TDA module integrating tri-dimensional attention mechanism for enhanced feature recalibration; (4) Multi-scale Adaptive Feature Pyramid Network (MAFPN) enabling adaptive multi-scale aggregation for robust small object detection.

2.1. ATM-Net Overall Network Design

Drawing inspiration from YOLOv8’s three-stage design paradigm, ATM-Net adopts a backbone–neck–head architecture tailored for dual-modality fusion, as illustrated in Figure 1. Following YOLOv8’s modular philosophy, the network processes dual-modality input through asymmetric recurrent fusion in the backbone, constructs multi-scale representations via MAFPN in the neck, and outputs predictions at three detection scales. This design leverages YOLOv8’s proven architectural principles while introducing multimodal-specific innovations to address RGB-IR fusion challenges.
Backbone Design’: The backbone implements asymmetric dual-branch architecture for multimodal feature extraction. RGB and infrared image pairs are fed into two independent parallel branches. The backbone comprises five hierarchical stages with progressively reduced resolutions. At each pyramid level ( P 2 P 5 ), two parallel branches perform stride-2 downsampling followed by C2f_TDA modules with integrated tri-dimensional attention. These modules repeat 3 times at P 2 / P 5 and 6 times at P 3 / P 4 . Features from both branches concatenate at layers 9, 14, 19, and 24, establishing fusion nodes. Asymmetric propagation is employed: concatenated features flow into one branch while the other branch receives single-path features. SPPF at P 5 expands receptive fields through cascaded pooling.
Neck Design: The neck employs MAFPN architecture to construct enhanced representations through bidirectional flow. The top-down path begins with SPPF-processed P 5 features, which are concatenated with channel-reduced P 4 features (L19) through Conv operations, then processed by C2f modules. Features are progressively upsampled and fused with P 3 (L14) and P 2 (L9) through Concat operations and C2f processing, generating multi-scale representations. The bottom-up path then aggregates features from multiple C2f modules at different levels through Conv operations and Concat, producing final feature maps that feed into detection heads.
Detection Head: The detection head receives three enhanced feature maps from MAFPN and performs multi-scale predictions at three scales. For 1024 × 1024 input images, P 3 / 8 outputs 128 × 128 feature maps for small object detection, P 4 / 16 outputs 64 × 64 feature maps for medium object detection, and P 5 / 32 outputs 32 × 32 feature maps for large object detection. Each scale predicts bounding box coordinates, objectness confidence, and class probabilities.
Information Flow and Design Rationale: The overall information flow in ATM-Net follows a hierarchical multimodal processing pipeline. In the backbone stage, RGB and IR image pairs enter two parallel branches, where features are progressively extracted through stride-2 downsampling and C2f_TDA modules at each pyramid level ( P 2 P 5 ). At fusion nodes (layers 9, 14, 19, 24), cross-modal features are concatenated and asymmetrically routed—fused features flow into the RGB branch while the IR branch maintains single-modality propagation, enabling progressive cross-modal integration without losing modality-specific characteristics. In the neck stage, MAFPN constructs bidirectional feature pathways: the top-down path propagates high-level semantic information from P 5 to lower levels, while the bottom-up path aggregates fine-grained spatial details back to higher levels. This bidirectional flow ensures comprehensive feature fusion across scales. Finally, three detection heads receive enhanced feature maps at different resolutions ( 128 × 128 , 64 × 64 , 32 × 32 ), enabling robust detection of vehicles across extreme scale variations typical in UAV aerial imagery. This end-to-end design achieves effective RGB-IR fusion with only 4.83 M parameters, suitable for resource-constrained UAV edge platforms.

2.2. Asymmetric Recurrent Fusion Module (ARFM)

Traditional multimodal fusion strategies face a fundamental trade-off: early fusion couples modalities too tightly (losing modality-specific characteristics), while late fusion misses feature-level complementarity. ARFM addresses this dilemma through asymmetric recurrent middle fusion with three core design principles.
Dual-Branch Hierarchical Architecture: As illustrated in Figure 1 (left), ARFM processes RGB and IR inputs through two parallel branches across four pyramid levels ( P 2 P 5 ). At each level l { 2 , 3 , 4 , 5 } , both branches perform feature extraction through stride-2 convolution followed by C2f_TDA modules:
F l R G B = C 2 f _ TDA ( Conv s = 2 ( F l 1 R G B ) ) , F l I R = C 2 f _ TDA ( Conv s = 2 ( F l 1 I R ) )
where Conv s = 2 denotes stride-2 convolution for spatial downsampling, and C2f_TDA applies feature extraction with tri-dimensional attention. The modules repeat strategically: 3 times at P 2 / P 5 for efficiency, 6 times at P 3 / P 4 for stronger mid-scale representations. Channel dimensions C l progress as { 128 , 256 , 512 , 1024 } at downsampling rates { 4 × , 8 × , 16 × , 32 × } .
Asymmetric Recurrent Fusion: At each pyramid level l, features from both branches are concatenated channel-wise:
F l f u s e d = Concat ( F l R G B , F l I R ) R H l × W l × 2 C l
This operation doubles the feature channels from C l to 2 C l . The key innovation lies in asymmetric propagation to the next level: one branch receives the fused features F l f u s e d (progressively accumulating cross-modal information), while the other receives single-modality features F l I R (preserving modality-specific characteristics). This asymmetric assignment creates two heterogeneous information flows that balance cross-modal collaboration and feature independence throughout the hierarchical pyramid.
Rationale for IR-Preserving Asymmetric Design: As illustrated in Figure 1 (left, backbone section), we deliberately design the asymmetric fusion to preserve IR modality-specific features while allowing the RGB branch to accumulate fused information. This design choice is grounded in the following considerations:
(1) Modality Characteristic Asymmetry: RGB images provide high spatial resolution with rich texture and color details but are sensitive to illumination variations. IR images capture illumination-invariant thermal signatures but suffer from lower resolution and coarse texture. This inherent asymmetry suggests that treating both modalities symmetrically in fusion is suboptimal.
(2) Information Capacity and Robustness: RGB features, with higher information density, serve as suitable “carriers” for accumulating cross-modal information. Preserving IR in a single-path flow prevents “information dilution” and ensures thermal detection capability remains intact under degraded RGB conditions (e.g., darkness, fog).
(3) Computational Efficiency: The asymmetric design routes concatenated features ( 2 C l channels) to only one branch while the other maintains single-modality channels ( C l channels), reducing computational redundancy compared to symmetric architectures.
(4) Empirical Validation: Experimental results (detailed in Section 3) show that compared to YOLOv8-Symmetric, our asymmetric YOLOv8-ARFM achieves +1.9 pp mAP50 (90.7% vs. 88.8%) and +2.1 pp mAP50-95 (64.1% vs. 62.0%), while reducing parameters by 8.7% (4.75 M vs. 5.20 M) and GFLOPs by 11.7% (12.1 vs. 13.7). This validates that asymmetric IR-preserving design delivers measurable improvements in both accuracy and efficiency.
Multi-Scale Context Aggregation. At the final level P 5 / 32 , SPPF applies cascaded 5 × 5 max pooling to capture multi-scale context, concatenating outputs at different receptive field scales before final convolution. This efficiently expands the receptive field while maintaining computational efficiency.
The ARFM design achieves three advantages: (1) progressive cross-modal integration across hierarchical levels rather than single-point fusion; (2) modality preservation through asymmetric information paths; (3) computational efficiency by avoiding symmetric dual-encoder parameter redundancy.
Table 1 details the asymmetric routing mechanism at each pyramid level, specifying channel dimensions before and after fusion operations. At each level, the RGB branch receives concatenated features (doubled channels) while the IR branch preserves single-modality features, enabling progressive cross-modal integration while maintaining modality-specific characteristics. The C2f_TDA modules are applied with 3 repeats at P 2 / P 5 levels and 6 repeats at P 3 / P 4 levels. Weights are not shared across pyramid levels, allowing level-specific feature learning.

2.3. C2f_TDA Module with Tri-Dimensional Attention

The C2f_TDA module extends the efficient C2f architecture by integrating Tri-Dimensional Attention (TDA) mechanism into its internal Triplet Bottleneck blocks. As illustrated in Figure 2, the figure shows the internal structure of a single Triplet Bottleneck, which is the fundamental building block within C2f_TDA. The complete C2f_TDA module follows the CSP (Cross-Stage Partial) design pattern: input features first pass through a 1 × 1 convolution for channel expansion, then split into two paths—an identity path and a transform path containing n cascaded Triplet Bottleneck blocks. Each Triplet Bottleneck applies two convolutions followed by the TDA mechanism before residual connection, enabling adaptive cross-modal feature enhancement.
C2f_TDA Overall Architecture: The complete C2f_TDA module follows the CSP design pattern. Input features F i n R H × W × C first pass through a 1 × 1 convolution to expand channels to 2 C , then split into two equal parts:
F s p l i t = Split ( Conv 1 × 1 ( F i n ) ) { F i d e n t i t y , F t r a n s f o r m }
The identity path F i d e n t i t y preserves original features for gradient flow, while the transform path F t r a n s f o r m undergoes n cascaded Triplet Bottleneck blocks (where n varies by pyramid level: 3 at P 2 / P 5 , 6 at P 3 / P 4 ). Each Triplet Bottleneck block (shown in Figure 2) applies two 3 × 3 convolutions followed by the TDA mechanism, with residual connection when input and output channels match.
Triplet Bottleneck with Tri-Dimensional Attention: The core innovation lies in the Triplet Bottleneck’s TDA mechanism, which captures cross-dimensional dependencies through three orthogonal branches. As shown in Figure 2, given intermediate features X R C × H × W after two convolutions, each branch processes a different dimensional permutation:
X C W = Permute ( C , H , W ) ( H , C , W ) ( X ) ( Channel - Width branch )
X H C = Permute ( C , H , W ) ( W , H , C ) ( X ) ( Height - Channel branch )
X H W = X ( Height - Width branch )
Each permuted representation feeds into an AttentionGate module that generates dimension-specific attention maps. The AttentionGate employs Z-Pooling to aggregate spatial information:
Z ( X ) = Concat [ MaxPool ( X ) , AvgPool ( X ) ] R 2 × H × W
where max and average pooling capture peak activation and average response respectively. A 7 × 7 convolution then generates attention weights:
α = σ ( Conv 7 × 7 ( Z ( X ) ) )
where σ denotes sigmoid activation. The three branches produce calibrated features F C W , F H C , and F H W after re-permuting to original dimensions. Final output averages the three attention-weighted representations:
F T D A = 1 3 ( F C W + F H C + F H W )
Feature Aggregation and Output: After n Triplet Bottleneck blocks, the transformed features concatenate with the identity path and all intermediate outputs, then pass through final 1 × 1 convolution:
F o u t = Conv 1 × 1 ( Concat [ F i d e n t i t y , F b o t t l e 1 , , F b o t t l e n ] )
The C2f_TDA design achieves three critical advantages: (1) orthogonal dimension modeling captures complementary cross-dimensional dependencies simultaneously; (2) lightweight attention gates introduce negligible parameters while significantly improving discriminability; (3) CSP-based feature reuse maintains computational efficiency suitable for real-time UAV applications. By recalibrating features across channel, height, and width dimensions, C2f_TDA effectively enhances salient cross-modal patterns while suppressing redundant information.

2.4. Multi-Scale Adaptive Feature Pyramid Network

The Multi-scale Adaptive Feature Pyramid Network (MAFPN) addresses two fundamental limitations of standard FPN: (1) unidirectional top-down information flow restricts low-to-high feature feedback, limiting fine-grained detail propagation; (2) equal-weight fusion across scales ignores target scale priors and feature importance differences. MAFPN constructs enhanced multi-scale representations through bidirectional information flow and multi-path adaptive aggregation.
Backbone Feature Extraction and Channel Reduction: The backbone extracts multi-scale features at four pyramid levels: P 2 / 4 (layer 9, 256 × 256 × 128 channels), P 3 / 8 (layer 14, 128 × 128 × 256 ), P 4 / 16 (layer 19, 64 × 64 × 512 ), and P 5 / 32 (layer 25, 32 × 32 × 1024 ). To reduce complexity, 3 × 3 convolutions compress P 2 , P 3 , and P 4 to half channels at layers 34, 30, and 26, denoted as F P 2 , F P 3 , and F P 4 .
Top-Down Pathway with Multi-Path Fusion: The top-down pathway propagates high-level semantic information from coarse to fine scales through multi-path aggregation. At layer 28, the P 5 features fuse with channel-reduced P 4 features:
F P 5 , 28 = C 2 f ( Concat ( F P 4 , F P 5 , 25 ) )
The enhanced P 5 features at layer 28 are then upsampled and aggregated with P 4 through three-path fusion at layer 32:
F P 4 , 32 = C 2 f ( Concat ( F P 3 , Upsample ( F P 5 , 28 ) , F P 4 , 19 ) )
This three-path strategy simultaneously utilizes high-level semantics from upsampled P 5 (providing global context), mid-level representations from reduced P 3 (adjacent scale information), and backbone features from original P 4 (preserving spatial details). Similarly, P 3 features at layer 36 aggregate information from three sources:
F P 3 , 36 = C 2 f ( Concat ( F P 2 , Upsample ( F P 4 , 32 ) , F P 3 , 14 ) )
An additional enhancement path at layer 38 further refines P 3 by concatenating upsampled P 4 features with the layer 36 P 3 output, processed through C2f module to generate the final top-down P 3 representation F P 3 , 38 .
Bottom-Up Pathway with Multi-Source Aggregation: The bottom-up pathway enhances features by propagating fine-grained details from small to large scales. At layer 42, P 4 features aggregate information from four distinct sources:
F P 4 , 42 = C 2 f ( Concat ( Conv s = 2 ( F P 3 , 38 ) , Conv s = 2 ( F P 3 , 36 ) , F P 4 , 32 , Upsample ( F P 5 , 28 ) ) )
where two downsampled P 3 branches (from layers 36 and 38 via stride-2 convolutions) provide detail information, intermediate P 4 features from layer 32 preserve mid-level representations, and upsampled P 5 features inject high-level semantic context. This four-path aggregation ensures comprehensive information fusion across scales. Similarly, P 5 features at layer 46 aggregate multi-path information to generate the final output F P 5 , 46 R 32 × 32 × 1024 for large object detection.
Channel Attention Mechanism: After each fusion operation, Squeeze-and-Excitation (SE) modules adaptively recalibrate channel-wise feature responses:
F ˜ i = σ ( W 2 · ReLU ( W 1 · GAP ( F i ) ) ) F i
where GAP ( · ) performs global average pooling for spatial compression, W 1 and W 2 denote channel reduction and expansion weights in the two-layer fully-connected bottleneck, and σ ( · ) applies sigmoid activation to generate adaptive importance weights in [ 0 , 1 ] range. This scale-aware dynamic weighting mechanism emphasizes discriminative channels while suppressing redundant features, particularly benefiting small object detection.
MAFPN provides three critical advantages for multimodal detection: (1) bidirectional pathways propagate both semantic context and spatial details across scales, enabling effective fusion of complementary RGB-IR features at multiple hierarchical levels; (2) multi-path aggregation adaptively combines cross-modal features from different pyramid levels, ensuring each detection scale leverages optimal multimodal information; (3) SE attention recalibrates fused multimodal channels, emphasizing discriminative RGB-IR patterns while suppressing redundant cross-modal responses. Integrated with ARFM’s asymmetric fusion and C2f_TDA’s tri-dimensional recalibration, MAFPN constructs enriched multi-scale representations that significantly improve small vehicle detection in complex UAV scenarios.

3. Experimental Results

3.1. Experimental Settings and Dataset Description

Experiments were performed on a workstation with an NVIDIA GeForce RTX 2080 Ti GPU (22 GB memory) using PyTorch 2.2.2, CUDA 11.8, Python 3.12.11, and Ultralytics YOLOv8 framework (v8.1.9). For input resolution, VEDAI experiments (both ablation studies and comparative experiments) use 1024 × 1024 to fully exploit the dataset’s native high-resolution aerial imagery, while DroneVehicle experiments adopt 640 × 640 resolution to ensure fair comparison with mainstream methods in the literature.
Training Configuration: Training employed 100 epochs with batch size 8 and SGD optimizer. The learning rate schedule uses initial learning rate l r 0 = 0.01 with final learning rate l r f = 0.01 , momentum 0.937, and weight decay 0.0005. Warmup strategy spans 3 epochs with warmup momentum 0.8 and warmup bias learning rate 0.1. The nominal batch size (nbs) is set to 64 for gradient accumulation scaling. Mixed precision training (AMP) is disabled for reproducibility. Random seed is fixed at 0 with deterministic mode enabled to ensure reproducible results.
Loss Function Configuration: The multi-task loss function combines three components with the following weights: bounding box regression loss weight λ b o x = 7.5 , classification loss weight λ c l s = 0.5 , and distribution focal loss (DFL) weight λ d f l = 1.5 . Label smoothing is set to 0.0 (disabled). No class-specific weighting is applied, treating all vehicle categories equally during training.
Inference Configuration: During inference, IoU threshold for Non-Maximum Suppression (NMS) is set to 0.7, with maximum detections per image limited to 300. Confidence threshold is automatically determined by the framework. For evaluation, mAP is computed at IoU threshold 0.5 (mAP50) and averaged over IoU thresholds from 0.5 to 0.95 with step 0.05 (mAP50-95), following COCO evaluation protocol.
Data Augmentation: Data augmentation encompasses HSV color adjustments (hue shift ±0.015, saturation scale 0.7, value scale 0.4), random translation (scale 0.1)/scaling (scale 0.5), Mosaic augmentation (probability 1.0), RandAugment (auto_augment policy), random erasing (probability 0.4), and horizontal flip (probability 0.5). MixUp and CopyPaste augmentations are disabled (probability 0.0). Rotation, shear, and perspective transformations are not applied (set to 0.0). All augmentations are synchronously applied to both RGB and infrared modalities to ensure spatial consistency.
This study employs two UAV aerial vehicle detection benchmark datasets for comprehensive experimental validation: VEDAI [35] serves as the primary evaluation benchmark for ablation studies, component analysis, and comparative experiments, while DroneVehicle [18] provides cross-dataset generalization assessment.
Table 2 summarizes the complete training hyperparameters used in our experiments.
Table 3 details the characteristics of both VEDAI and DroneVehicle datasets used in our experimental validation.

3.2. Evaluation Metrics

Primary metrics:
Precision:
Precision = T P T P + F P
Recall:
Recall = T P T P + F N
Mean Average Precision (mAP):
mAP = 1 K k = 1 K A P k
where mAP@[0.5:0.95] averages AP over IoU thresholds { 0.50 , 0.55 , , 0.95 } . Additional metrics include Params (learnable parameters in millions) and GFLOPs (floating-point operations for computational cost assessment).

3.3. Ablation Experiments

Comprehensive ablation experiments verify (1) single-modality vs. multimodal fusion necessity and (2) TDA and MAFPN independent contributions and synergies via incremental addition. All ablation experiments use VEDAI dataset with 1024 × 1024 input resolution to fully leverage high-resolution aerial imagery details and validate component effectiveness under optimal conditions. Table 4 compares single-modality (IR/RGB) with multimodal ARFM fusion on YOLOv8 baseline.
YOLOv8-Symmetric denotes a symmetric dual-encoder fusion architecture where both RGB and IR branches receive concatenated features ( 2 C l channels) at each fusion node, treating both modalities equally throughout the network hierarchy. In contrast, our YOLOv8-ARFM employs asymmetric fusion where only the RGB branch accumulates fused features while the IR branch preserves single-modality characteristics.
As shown in Table 4, YOLOv8-ARFM with RGB + IR multimodal fusion achieves 90.7% mAP50 and 64.1% mAP50-95, significantly outperforming both single-modality baselines and the symmetric fusion approach. Compared to YOLOv8 RGB (85.7% mAP50, 58.5% mAP50-95), YOLOv8-ARFM improves +5.0 pp mAP50 and +5.6 pp mAP50-95. Compared to YOLOv8 IR (83.2% mAP50, 56.5% mAP50-95), the improvement reaches +7.5 pp mAP50 and +7.6 pp mAP50-95. Notably, compared to YOLOv8-Symmetric (88.8% mAP50, 62.0% mAP50-95), our asymmetric ARFM design achieves +1.9 pp mAP50 and +2.1 pp mAP50-95 while using fewer parameters (4.75 M vs. 5.20 M, −8.7%) and lower computational cost (12.1 vs. 13.7 GFLOPs, −11.7%). This validates the effectiveness of asymmetric fusion over symmetric approaches: preserving IR modality-specific features while allowing RGB to accumulate cross-modal information achieves better accuracy–efficiency trade-off than treating both modalities equally.
Figure 3 compares detection results across five challenging aerial scenarios: urban arterial roads (column 1), parking lots (column 2), agricultural farmland (column 3), construction sites with unpaved terrain (column 4), and winding country paths (column 5).
The RGB modality (b) achieves reliable detection in high-contrast scenarios (columns 2, 4–5) where vehicles exhibit distinct color features against pavement or terrain backgrounds. However, RGB completely fails in low-contrast urban roads (column 1) where earth-tone vehicles merge with surrounding soil backgrounds, missing vehicles that are clearly visible in ground truth (a). RGB also shows degraded performance in agricultural scenes (column 3) with lower confidence scores. The infrared modality (c) provides grayscale thermal signatures independent of surface appearance, successfully detecting vehicles in challenging low-contrast scenes where RGB fails (column 1). Nevertheless, IR suffers from reduced spatial precision due to thermal blurring effects and lower resolution, resulting in imprecise bounding box localization and lower confidence scores across multiple scenarios.
Our ATM-Net (d) demonstrates superior robustness by synergistically fusing complementary cues: it maintains RGB’s sharp localization in favorable lighting while compensating with IR’s thermal contrast in adverse conditions. Notably, ATM-Net achieves consistent detection across all terrain types with significantly elevated confidence scores (0.75–0.97 range), validating the effectiveness of asymmetric recurrent fusion for handling diverse UAV surveillance scenarios.
To further analyze the independent contributions and synergies of core modules, Table 5 evaluates TDA and MAFPN through incremental addition.
As shown in Table 5, both TDA and MAFPN independently contribute to performance improvement over the baseline YOLOv8-ARFM (90.7% mAP50, 64.1% mAP50-95). Adding TDA alone (YOLOv8-ARFM+TDA) improves mAP50 to 91.7% (+1.0 pp) and mAP50-95 to 64.5% (+0.4 pp), with notably improved recall from 0.756 to 0.828 (+9.5%), indicating TDA’s effectiveness in capturing more true positives through tri-dimensional attention recalibration. Adding MAFPN alone (YOLOv8-ARFM+MAFPN) achieves similar gains with mAP50 of 91.6% (+0.9 pp) and mAP50-95 of 64.5% (+0.4 pp), while improving recall to 0.811 (+7.3%). The complete ATM-Net combining both modules achieves the best performance: 92.4% mAP50 (+1.7 pp over baseline) and 64.7% mAP50-95 (+0.6 pp), with balanced precision (0.913) and recall (0.816). The computational cost remains efficient: ATM-Net requires only 4.83 M parameters and 13.0 GFLOPs, representing minimal overhead (+1.7% parameters, +7.4% GFLOPs) compared to the baseline.
Inference speed (FPS) is measured on Huawei Atlas AIpro-20T mobile GPU with batch size 1 and 1024 × 1024 input resolution. The baseline achieves 38.9 FPS, while TDA slightly improves inference speed to 39.6 FPS (+1.8%), suggesting that TDA’s attention-based feature recalibration helps optimize the feature flow and computational efficiency. MAFPN reduces speed to 36.3 FPS (−6.7%) due to additional multi-path feature aggregation operations. The complete ATM-Net achieves 35.9 FPS (−7.7%), demonstrating acceptable computational overhead on edge computing platforms. All configurations maintain real-time inference capability (>30 FPS), demonstrating ATM-Net’s suitability for practical UAV deployment scenarios on resource-constrained mobile GPU platforms.
Figure 4 visualizes incremental module contributions across two representative scenarios: a parking lot scene (row 1) and an urban road scene (row 2). Comparing against ground truth (a), the baseline YOLOv8-ARFM (b) exhibits suboptimal performance with relatively low confidence scores (0.50 and 0.47), indicating insufficient feature discrimination in multimodal fusion. Adding MAFPN (c) substantially improves confidence to 0.70 and 0.72 across both scenarios, demonstrating the effectiveness of multi-scale adaptive feature aggregation in capturing vehicles at different spatial resolutions. Introducing TDA attention alone (d) shows uneven improvements: enhanced performance in the parking lot (0.73) but limited gains in the urban road scene (0.54), though still improving over baseline. This suggests MAFPN provides more consistent cross-scenario benefits while TDA offers stronger gains in spatially dense scenarios. The complete ATM-Net (e) combining both MAFPN and TDA achieves the highest confidence scores (0.81 and 0.85) across both scenarios, validating their complementary synergy—MAFPN provides robust multi-scale representations while TDA refines feature recalibration through tri-dimensional attention, collectively enabling superior multimodal vehicle detection.
To provide comprehensive evaluation metrics and address the need for per-class analysis, confidence calibration, and statistical rigor, we present additional diagnostic visualizations in Figure 5.

3.4. Comparative Experiments

To comprehensively evaluate ATM-Net’s performance and generalization capability, we conduct comparative experiments against state-of-the-art single-modality and multimodal fusion methods on both VEDAI and DroneVehicle datasets. VEDAI experiments use 1024 × 1024 input resolution to fully exploit the dataset’s native high-resolution imagery, while DroneVehicle experiments adopt 640 × 640 resolution for fair comparison with mainstream methods. All experiments use consistent training configurations: 100 epochs, batch size 8, SGD optimizer with momentum 0.9, and initial learning rate 0.01.
Table 6 presents quantitative comparisons on the VEDAI dataset, including seven single-modality RGB methods (YOLOv5n/s, YOLO-S, SuperDet, DS-YOLOv8, YOLOv8n, YOLOv10n), two single-modality Thermal methods (SPD-YOLOv8, DBD-YOLOv8), and eight multimodal RGB+Thermal fusion methods (LW-CNN, CMCA, EMCF, CMAFF, SuperYOLO, ICAFusion, GHOST, Multispectral DETR).
ATM-Net achieves 92.4% mAP50 and 64.7% mAP50-95 with only 4.83 M parameters, demonstrating superior parameter efficiency and detection accuracy. Compared to large-scale multimodal methods, ATM-Net significantly outperforms Multispectral DETR (+9.7 pp mAP50, +13.9 pp mAP50-95) while using 15.1× fewer parameters (4.83 M vs. 73.0 M), and surpasses GHOST (+12.1 pp mAP50, +15.7 pp mAP50-95) with 2.0× fewer parameters. Compared to lightweight multimodal methods, ATM-Net outperforms SuperYOLO by +17.3 pp (mAP50) while using 1.45× fewer parameters, surpasses LRFL-YOLO by +17.9 pp with 2.15× fewer parameters, and exceeds FGMF by +26.1 pp with 1.76× fewer parameters. Against the best RGB single-modality method DS-YOLOv8 (76.9% mAP50), ATM-Net gains +15.5 pp, and against the best Thermal single-modality method DBD-YOLOv8 (76.0% mAP50), ATM-Net gains +16.4 pp, demonstrating effective RGB-IR asymmetric fusion. Versus the baseline YOLOv8n, ATM-Net achieves +23.8 pp mAP50 and +15.5 pp mAP50-95 improvements with only +60.8% additional parameters.
Table 7 provides per-class performance on VEDAI’s eight categories.
ATM-Net achieves 92.4% mean performance and leads on six categories: car (96.0%), truck (93.8%), pickup (95.0%), tractor (95.2%), camping car (95.9%), and van (99.5%). Among the compared methods in this table, ATM-Net demonstrates strong performance particularly in vehicle subtypes. Compared to YOLOv8n, ATM-Net shows substantial improvements in challenging categories—+22.5% for van and +12.8% for truck—demonstrating the value of RGB-IR complementarity in detecting vehicles with low thermal contrast or visual ambiguity.
Figure 6 visualizes qualitative comparisons across seven methods. Single-modality methods (YOLOv5n, YOLOv8n) exhibit missed/false detections due to single-sensor limitations. Multimodal methods show progressive improvements: ARSOD-YOLO and ICAFusion provide enhanced detection over single-modality baselines and FGMF achieves better fusion performance, while ATM-Net demonstrates the highest completeness and accuracy, especially in challenging dense parking lots and complex road scenarios, validating the effectiveness of asymmetric recurrent multimodal fusion.
To further validate cross-dataset generalization capability, we extend comparative experiments to the large-scale DroneVehicle dataset, which provides complementary evaluation with over 28,000 RGB-IR image pairs captured across diverse urban scenarios under varying day–night illumination conditions. This dataset presents additional challenges including higher scene complexity, increased vehicle density, and more severe occlusion compared to VEDAI, making it an ideal testbed for evaluating multimodal fusion robustness in real-world UAV deployment scenarios.
Table 8 presents comprehensive per-class performance comparisons across five vehicle categories: Car, Freight Car, Truck, Bus, and Van. The evaluation encompasses three modality configurations: RGB single-modality methods (Faster R-CNN, Rol Trans), IR single-modality methods (Faster R-CNN, Rol Trans), and eight state-of-the-art RGB+IR multimodal fusion approaches spanning different architectural paradigms—early fusion (CIAN), attention-based fusion (AR-CNN, UA-CMDet, TSFADet, CALNet), transformer-based fusion (C2Former), end-to-end fusion (E2E-MFD), and dual-stream fusion (DMM).
ATM-Net achieves the highest overall mAP of 83.7% on DroneVehicle, demonstrating superior cross-dataset generalization. Compared to single-modality baselines, RGB methods (Faster R-CNN: 55.9%, Rol Trans: 61.6%) and IR methods (Faster R-CNN: 64.2%, Rol Trans: 65.5%) exhibit significant performance gaps due to modality-specific limitations—RGB suffers from poor night-time performance while IR lacks spatial detail for small vehicle discrimination.
Among multimodal fusion methods, ATM-Net outperforms the previous best DMM by +4.4 pp (mAP) while using only 5.5% of its parameters (4.83 M vs. 88.0 M), achieving 18.2× parameter efficiency. Compared to recent transformer-based methods, ATM-Net surpasses C2Former (+9.5 pp) with 20.9× fewer parameters, and exceeds TSFADet (+10.6 pp) with 21.7× fewer parameters. This substantial parameter efficiency advantage validates the effectiveness of asymmetric recurrent fusion architecture for lightweight multimodal vehicle detection.
Per-class analysis reveals ATM-Net’s balanced performance across all vehicle categories: it achieves the highest scores in Freight Car (74.7%), Bus (90.3%), and Van (67.1%), ties with DMM for Car (90.5%), and maintains competitive Truck detection (78.5%, only 0.8pp below E2E-MFD). Notably, ATM-Net demonstrates particular strength in detecting challenging small vehicle categories (Freight Car, Van) where precise multimodal feature alignment is critical, improving +1.5 pp over DMM in Freight Car and +2.0 pp in Van. The consistent high performance across diverse vehicle types (ranging from large buses to small vans) validates the robustness of tri-dimensional attention and multi-scale adaptive fusion mechanisms in handling significant intra-class scale variations inherent to UAV aerial vehicle detection.
Figure 7 provides qualitative visualization comparing ground truth annotations (row a) with ATM-Net’s detection results (row b) across six representative scenarios from the DroneVehicle dataset. The scenarios include densely packed urban intersections (columns 1–2), tree-occluded road intersection (column 3), open highway segment (column 4), large-scale parking facility with hundreds of vehicles (column 5), and nighttime residential area (column 6). Overall, ATM-Net demonstrates strong detection performance with high completeness and accurate localization across most scenarios. However, blue arrows highlight several misdetection cases that reveal current limitations: in columns 1, 2, and 5, false positives occur where non-vehicle objects (such as road markings, shadows, or ground textures) are incorrectly classified as vehicles, indicating challenges in distinguishing vehicles from visually similar background elements in complex urban environments. These failure cases suggest that while ATM-Net achieves competitive overall performance (83.7% mAP), further improvements in background discrimination and handling of visual distractors would enhance robustness for practical deployment in diverse real-world UAV scenarios.

4. Discussion

4.1. Experimental Validation and Performance Analysis

Comprehensive experiments on VEDAI and DroneVehicle datasets validate ATM-Net’s effectiveness in addressing fundamental UAV-based vehicle detection challenges. The proposed architecture achieves state-of-the-art performance through three synergistic components: Asymmetric Recurrent Fusion Module (ARFM), Tri-Dimensional Attention (TDA), and Multi-scale Adaptive Feature Pyramid Network (MAFPN).
On the VEDAI dataset, ATM-Net achieves 92.4% mAP50 and 64.7% mAP50-95 at 1024 × 1024 resolution. Compared to multimodal approaches, ATM-Net significantly outperforms Multispectral DETR (+9.7 pp mAP50: 92.4% vs. 82.7%; +13.9 pp mAP50-95: 64.7% vs. 50.8%) while maintaining significantly lower computational complexity (4.83 M vs. 73.0 M parameters, 15.1× parameter efficiency). Compared to single-modality baselines, ATM-Net demonstrates substantial improvements—+6.7 pp over YOLOv8n RGB (92.4% vs. 85.7%) and +9.2 pp over YOLOv8n IR (92.4% vs. 83.2%)—validating the necessity of effective RGB-IR fusion for robust all-weather vehicle detection. Ablation studies further confirm the individual and complementary contributions of each module: ARFM provides the foundation for asymmetric dual-stream fusion (90.7% mAP50, 64.1% mAP50-95), MAFPN adds multi-scale adaptive aggregation (+0.9 pp mAP50, +0.4 pp mAP50-95), TDA contributes orthogonal feature recalibration (+1.0 pp mAP50, +0.4 pp mAP50-95), and their combination yields synergistic gains (92.4% mAP50, 64.7% mAP50-95).
On the DroneVehicle dataset, ATM-Net demonstrates superior cross-dataset generalization with 83.7% mAP, improving over DMM by +4.4 pp (from 79.3% to 83.7%) while using only 5.5% of its parameters (4.83 M vs. 88.0 M, achieving 18.2× parameter efficiency). This exceptional parameter efficiency is particularly valuable for resource-constrained UAV deployment, where payload capacity and power consumption are critical constraints. Per-class analysis reveals ATM-Net’s balanced performance across diverse vehicle categories: it achieves high scores in Freight Car (74.7%), Bus (90.3%), and Van (67.1%), demonstrating particular strength in detecting challenging small vehicle categories where precise multimodal feature alignment is critical. The consistent high performance across both datasets (VEDAI: small-scale, agricultural/suburban scenarios; DroneVehicle: large-scale, complex urban scenarios) validates the robustness and generalization capability of the asymmetric recurrent fusion architecture.
The architectural design choices prove effective: (1) ARFM’s asymmetric dual-stream structure balances modality independence and cross-modal collaboration, with one stream accumulating complementary information while the other preserves modality-specific features; (2) TDA recalibrates features across three orthogonal dimensions (Channel-Width, Height-Channel, Height-Width) within the C2f_TDA block, capturing comprehensive spatial-channel dependencies; (3) MAFPN enhances small-object detection through three-path aggregation (top-down, bottom-up, and direct connections) with bidirectional information flow, addressing the scale variation challenge inherent to UAV aerial imagery.

4.2. Limitations and Failure Case Analysis

Despite achieving state-of-the-art performance, ATM-Net exhibits several limitations that warrant discussion and suggest directions for future improvement. Figure 7 visualizes representative failure cases that reveal current challenges in complex real-world scenarios.
False Positives in Complex Urban Environments: As highlighted by blue arrows in columns 1, 2, and 5 of Figure 7, ATM-Net generates false positives where non-vehicle objects such as road markings, shadows, or ground textures are incorrectly classified as vehicles. These misdetections typically occur in scenarios with high visual complexity, where background elements exhibit vehicle-like shapes or textures. For instance, in densely packed urban intersections (columns 1–2), zebra crossings and lane markings with rectangular patterns can trigger false detections, while in large parking facilities (column 5), ground textures and shadows cast by infrastructure create visual ambiguities.
We conduct a systematic diagnostic analysis to identify the root causes of these false positives:
(1) Feature Confusion between Vehicles and Background Elements: The convolutional feature extractors learn to recognize rectangular shapes, edge patterns, and texture gradients as vehicle-indicative features. However, urban infrastructure elements (zebra crossings, lane markings, parking lot boundaries) share similar geometric primitives with vehicle rooftops when viewed from aerial perspectives. This feature-level ambiguity causes the classifier to misinterpret high-confidence background regions as vehicle candidates, particularly when these elements exhibit vehicle-like aspect ratios (1:2 to 1:4) and uniform texture patterns.
(2) Modality Weight Imbalance in Fusion: While ARFM enables effective RGB-IR fusion, the current architecture applies uniform fusion weights across all spatial locations. In scenarios where RGB and IR modalities provide conflicting evidence (e.g., road markings visible in RGB but absent in IR thermal signatures), the fusion mechanism may amplify false positive signals rather than suppressing them. Adaptive modality weighting based on local feature reliability could mitigate this issue.
(3) Insufficient Spatial Context Modeling: The current detection pipeline primarily relies on local feature patterns within individual anchor regions. However, distinguishing vehicles from background distractors often requires broader spatial context—vehicles typically appear on roads with surrounding traffic patterns, while false positives often occur in isolation or in semantically inconsistent locations (e.g., on building rooftops or in pedestrian areas). The lack of explicit spatial context reasoning limits the model’s ability to leverage scene-level semantic priors for false positive suppression.
(4) Training Sample Diversity Limitations: The DroneVehicle dataset, while comprehensive, may not fully capture the diversity of urban background elements that can trigger false positives. Rare but confusing background patterns (unusual road markings, construction materials, temporary structures) are underrepresented in training data, causing the model to generalize poorly to these edge cases.
(5) Hard Negative Mining Deficiency: Standard training procedures sample negative examples uniformly from background regions, which may not adequately expose the model to challenging false positive candidates. Implementing hard negative mining strategies that specifically target vehicle-like background elements during training could improve the model’s discriminative capability for these ambiguous cases.
Future work should address these limitations through (i) attention-guided modality weighting that adaptively balances RGB-IR contributions based on local feature reliability; (ii) incorporating spatial context modules (e.g., non-local attention or graph neural networks) to model scene-level semantic relationships; (iii) augmenting training data with synthetic hard negatives generated from vehicle-like background patterns; and (iv) implementing curriculum learning with progressive hard negative mining to enhance discriminative feature learning.
Long-Tail Category Performance: Quantitative results reveal performance variations across vehicle categories. On VEDAI, while ATM-Net achieves excellent overall performance (92.4% mAP50), some categories show relatively lower scores: the “Boat” category achieves 77.7% mAP50, lower than well-represented categories like Car (96.0%) and Van (99.5%). This performance gap stems from two factors: (1) fewer training samples for certain vehicle types, causing the model to have less exposure to these categories; (2) high intra-class appearance diversity within some categories, which encompasses various vehicle types with heterogeneous visual characteristics. Addressing this limitation requires either collecting more diverse training data for underperforming categories or investigating few-shot learning and class-balanced training strategies.
Horizontal Bounding Box Constraints: The current implementation uses axis-aligned horizontal bounding boxes, which become suboptimal for densely-packed or arbitrarily oriented vehicles common in parking lots and tight urban spaces. While this design choice simplifies the detection pipeline and maintains computational efficiency, it can lead to overlapping bounding boxes and localization ambiguities in scenarios with closely spaced vehicles at various orientations. Future extensions should investigate rotated bounding box detection or oriented object detection frameworks to better handle arbitrary vehicle orientations.
Temporal Information Underutilization: ATM-Net processes individual frames independently without exploiting temporal continuity available in video sequences. This frame-by-frame approach misses opportunities to leverage motion cues, temporal consistency, and trajectory information that could improve detection robustness and reduce false positives through temporal filtering. Incorporating temporal modeling, such as through recurrent architectures or temporal attention mechanisms, represents a promising direction for enhancing continuous UAV monitoring applications.
These limitations, while not undermining ATM-Net’s strong overall performance (83.7% mAP on DroneVehicle, improving over DMM by +4.4 pp), highlight specific areas where further research could enhance robustness for diverse real-world deployment scenarios.

5. Conclusions

This paper proposes ATM-Net, a lightweight multimodal RGB-IR fusion network for UAV-based vehicle detection. The network integrates three key components: Asymmetric Recurrent Fusion Module (ARFM) for balanced cross-modal collaboration and modality independence, Tri-Dimensional Attention (TDA) for comprehensive feature recalibration across orthogonal dimensions, and Multi-scale Adaptive Feature Pyramid Network (MAFPN) for robust multi-scale representation.
Experiments on VEDAI and DroneVehicle datasets validate ATM-Net’s effectiveness, achieving 92.4% mAP50 on VEDAI and 83.7% mAP on DroneVehicle with only 4.83 M parameters. Compared to state-of-the-art methods, ATM-Net demonstrates significant improvements while maintaining 15.1× to 18.2× parameter efficiency, making it well-suited for resource-constrained UAV edge platforms. Future work will explore additional modalities, temporal modeling, and rotated detection for enhanced all-weather UAV perception.

Author Contributions

Conceptualization, J.C.; methodology, J.C.; software, J.C.; validation, J.H.; formal analysis, J.C.; investigation, J.C. and Z.Z.; resources, J.Y., Z.W. and R.L.; data curation, J.H.; writing—original draft preparation, J.C.; writing—review and editing, J.C., J.H., Z.W. and R.L.; visualization, J.H. and Z.Z.; supervision, J.Y., Z.W. and R.L.; project administration, R.L.; funding acquisition, Z.W. and R.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by grants from the National Natural Science Foundation of China (Grant Nos. 42571372, 42271345.)

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets used in this study are publicly available. The VEDAI (Vehicle Detection in Aerial Imagery) dataset can be accessed at https://github.com/liaoxuanzhi/VEDAI (accessed on 25 November 2025). The DroneVehicle dataset is available at https://github.com/VisDrone/DroneVehicle (accessed on 25 November 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Sezgin, A.; Boyacı, A. Advancements in Object Detection for Unmanned Aerial Vehicles: Applications, Challenges, and Future Perspectives. In Proceedings of the 2024 12th International Symposium on Digital Forensics and Security (ISDFS), San Antonio, TX, USA, 29–30 April 2024; pp. 1–6. [Google Scholar] [CrossRef]
  2. Tang, G.; Ni, J.; Zhao, Y.; Gu, Y.; Cao, W. A Survey of Object Detection for UAVs Based on Deep Learning. Remote Sens. 2024, 16, 149. [Google Scholar] [CrossRef]
  3. Leng, J.; Ye, Y.; Mo, M.; Gao, C.; Gan, J.; Xiao, B.; Gao, X. Recent Advances for Aerial Object Detection: A Survey. ACM Comput. Surv. 2024, 56, 1–36. [Google Scholar] [CrossRef]
  4. Nikouei, M.; Baroutian, B.; Nabavi, S.; Taraghi, F.; Aghaei, A.; Sajedi, A.; Moghaddam, M.E. Small object detection: A comprehensive survey on challenges, techniques and real-world applications. Intell. Syst. Appl. 2025, 27, 200561. [Google Scholar] [CrossRef]
  5. Elhagry, A.; Saeed, M. Investigating the Challenges of Class Imbalance and Scale Variation in Object Detection in Aerial Images. arXiv 2022, arXiv:2202.02489. [Google Scholar]
  6. Gu, Y.; Chen, W.; Peng, D. UAV-based multimodal object detection via feature enhancement and dynamic gated fusion. Pattern Recognit. 2026, 172, 112722. [Google Scholar] [CrossRef]
  7. Chen, C.; Bin, K.; Hu, T.; Qi, J.; Liu, X.; Liu, T.; Liu, Z.; Liu, Y.; Zhong, P. Fusion Meets Diverse Conditions: A High-diversity Benchmark and Baseline for UAV-based Multimodal Object Detection with Condition Cues. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Honolulu, HI, USA, 19–23 October 2025; pp. 27958–27967. [Google Scholar]
  8. Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015. [Google Scholar]
  9. Ding, J.; Xue, N.; Long, Y.; Xia, G.S.; Lu, Q. Learning RoI Transformer for Oriented Object Detection in Aerial Images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
  10. Ultralytics. YOLOv5, GitHub Repository, 2024, Version 7.0. Available online: https://github.com/ultralytics/yolov5 (accessed on 25 November 2025).
  11. Ultralytics. YOLOv8, GitHub Repository, 2025, Version 8.3.230. Available online: https://github.com/ultralytics/ultralytics (accessed on 25 November 2025).
  12. Zhang, Y.; Ye, M.; Zhu, G.; Liu, Y.; Guo, P.; Yan, J. FFCA-YOLO for Small Object Detection in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5611215. [Google Scholar] [CrossRef]
  13. Shen, L.; Lang, B.; Song, Z. DS-YOLOv8-Based Object Detection Method for Remote Sensing Images. IEEE Access 2023, 11, 125122–125137. [Google Scholar] [CrossRef]
  14. Ju, M.; Niu, B.; Jin, S.; Liu, Z. SuperDet: An Efficient Single-Shot Network for Vehicle Detection in Remote Sensing Images. Electronics 2023, 12, 1312. [Google Scholar] [CrossRef]
  15. Ren, Z. Enhanced YOLOv8 Infrared Image Object Detection Method with SPD Module. J. Theory Pract. Eng. Technol. 2024, 1, 1–7. [Google Scholar]
  16. Shen, L.; Lang, B.; Song, Z. Infrared Object Detection Method Based on DBD-YOLOv8. IEEE Access 2023, 11, 145853–145868. [Google Scholar] [CrossRef]
  17. Lan, X.; Zhang, S.; Bai, Y.; Qin, X. Fine-Grained Multispectral Fusion for Oriented Object Detection in Remote Sensing. Remote Sens. 2025, 17, 3769. [Google Scholar] [CrossRef]
  18. Sun, Y.; Cao, B.; Zhu, P.; Hu, Q. Drone-Based RGB-Infrared Cross-Modality Vehicle Detection Via Uncertainty-Aware Learning. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 6700–6713. [Google Scholar] [CrossRef]
  19. Zhang, L.; Liu, Z.; Zhang, S.; Yang, X.; Qiao, H.; Huang, K.; Hussain, A. Cross-modality interactive attention network for multispectral pedestrian detection. Inf. Fusion 2019, 50, 20–29. [Google Scholar] [CrossRef]
  20. Zhang, L.; Liu, Z.; Zhu, X.; Song, Z.; Yang, X.; Lei, Z.; Qiao, H. Weakly Aligned Feature Fusion for Multimodal Object Detection. IEEE Trans. Neural Netw. Learn. Syst. 2025, 36, 4145–4159. [Google Scholar] [CrossRef]
  21. Zhang, J.; Lei, J.; Xie, W.; Fang, Z.; Li, Y.; Du, Q. SuperYOLO: Super Resolution Assisted Object Detection in Multimodal Remote Sensing Imagery. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5605415. [Google Scholar] [CrossRef]
  22. Shen, J.; Chen, Y.; Liu, Y.; Zuo, X.; Fan, H.; Yang, W. ICAFusion: Iterative cross-attention guided feature fusion for multispectral object detection. Pattern Recognit. 2024, 145, 109913. [Google Scholar] [CrossRef]
  23. Qiu, Y.; Zheng, X.; Hao, X.; Zhang, G.; Lei, T.; Jiang, P. ARSOD-YOLO: Enhancing small target detection for remote sensing images. Sensors 2024, 24, 7472. [Google Scholar] [CrossRef]
  24. Yan, Y.; Wang, C.; Zhou, X.; Gao, X.; Qin, Z. LRFL-YOLO: A Large Receptive Field and Lightweight Model for Small Object Detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 20790–20804. [Google Scholar] [CrossRef]
  25. Zhang, J.; Lei, J.; Xie, W.; Li, Y.; Yang, G.; Jia, X. Guided Hybrid Quantization for Object Detection in Remote Sensing Imagery via One-to-One Self-Teaching. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5614815. [Google Scholar] [CrossRef]
  26. He, X.; Tang, C.; Zou, X.; Zhang, W. Multispectral Object Detection via Cross-Modal Conflict-Aware Learning. In Proceedings of the 31st ACM International Conference on Multimedia, New York, NY, USA, 29 October–3 November 2023; pp. 1465–1474. [Google Scholar] [CrossRef]
  27. Yuan, M.; Wang, Y.; Wei, X. Translation, scale and rotation: Cross-modal alignment meets RGB-infrared vehicle detection. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2022; pp. 509–525. [Google Scholar]
  28. Yuan, M.; Wei, X. C2Former: Calibrated and Complementary Transformer for RGB-Infrared Object Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5403712. [Google Scholar] [CrossRef]
  29. Zhang, J.; Cao, M.; Xie, W.; Lei, J.; Li, D.; Huang, W.; Li, Y.; Yang, X. E2E-MFD: Towards End-to-End Synchronous Multimodal Fusion Detection. In Proceedings of the Advances in Neural Information Processing Systems; Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C., Eds.; Curran Associates, Inc.: New York, NY, USA, 2024; Volume 37, pp. 52296–52322. [Google Scholar] [CrossRef]
  30. Zhu, J.; Chen, X.; Zhang, H.; Tan, Z.; Wang, S.; Ma, H. Transformer Based Remote Sensing Object Detection with Enhanced Multispectral Feature Extraction. IEEE Geosci. Remote Sens. Lett. 2023, 20, 5001405. [Google Scholar] [CrossRef]
  31. Zhou, M.; Li, T.; Qiao, C.; Xie, D.; Wang, G.; Ruan, N.; Mei, L.; Yang, Y.; Tao Shen, H. DMM: Disparity-Guided Multispectral Mamba for Oriented Object Detection in Remote Sensing. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5404913. [Google Scholar] [CrossRef]
  32. Qingyun, F.; Zhaokui, W. Cross-modality attentive feature fusion for object detection in multispectral remote sensing imagery. Pattern Recognit. 2022, 130, 108786. [Google Scholar] [CrossRef]
  33. Guo, J.; Gao, C.; Liu, F.; Meng, D.; Gao, X. DAMSDet: Dynamic Adaptive Multispectral Detection Transformer with Competitive Query Selection and Adaptive Feature Fusion. arXiv 2024, arXiv:2403.00326. [Google Scholar]
  34. Guo, H.; Sun, C.; Zhang, J.; Zhang, W.; Zhang, N. MMYFnet: Multi-Modality YOLO Fusion Network for Object Detection in Remote Sensing Images. Remote Sens. 2024, 16, 4451. [Google Scholar] [CrossRef]
  35. Razakarivony, S.; Jurie, F. Vehicle detection in aerial imagery: A small target detection benchmark. J. Vis. Commun. Image Represent. 2016, 34, 187–203. [Google Scholar] [CrossRef]
Figure 1. ATM-Net architecture. Left: Dual-branch backbone with asymmetric recurrent fusion establishing fusion nodes at P 2 (L9), P 3 (L14), P 4 (L19), and P 5 (L25) levels, with C2f_TDA modules and SPPF at P 5 . Right (upper): Top-down path of MAFPN neck propagating semantic features from P 5 to P 3 with multi-path aggregation. Right (lower): Bottom-up path of MAFPN enhancing features from P 3 to P 5 with detection heads at three scales.
Figure 1. ATM-Net architecture. Left: Dual-branch backbone with asymmetric recurrent fusion establishing fusion nodes at P 2 (L9), P 3 (L14), P 4 (L19), and P 5 (L25) levels, with C2f_TDA modules and SPPF at P 5 . Right (upper): Top-down path of MAFPN neck propagating semantic features from P 5 to P 3 with multi-path aggregation. Right (lower): Bottom-up path of MAFPN enhancing features from P 3 to P 5 with detection heads at three scales.
Drones 10 00067 g001
Figure 2. Triplet Bottleneck internal architecture within C2f_TDA module. The figure illustrates the internal structure of a single Triplet Bottleneck block, which serves as the core building block inside the C2f_TDA module. Left: Bottleneck structure with two convolutions followed by Tri-Dimensional Attention (TDA). Right: TDA mechanism with three orthogonal branches (C-W, H-C, H-W) processing permuted feature dimensions through attention gates, generating adaptive recalibration weights.
Figure 2. Triplet Bottleneck internal architecture within C2f_TDA module. The figure illustrates the internal structure of a single Triplet Bottleneck block, which serves as the core building block inside the C2f_TDA module. Left: Bottleneck structure with two convolutions followed by Tri-Dimensional Attention (TDA). Right: TDA mechanism with three orthogonal branches (C-W, H-C, H-W) processing permuted feature dimensions through attention gates, generating adaptive recalibration weights.
Drones 10 00067 g002
Figure 3. Single-modality vs. multimodal detection on VEDAI. (a) GT, (b) YOLOv8-RGB, (c) YOLOv8-IR, (d) ATM-Net. Five scenarios: urban roads, parking lots, farmland, construction sites, country paths under varied lighting. Box numbers: confidence scores.
Figure 3. Single-modality vs. multimodal detection on VEDAI. (a) GT, (b) YOLOv8-RGB, (c) YOLOv8-IR, (d) ATM-Net. Five scenarios: urban roads, parking lots, farmland, construction sites, country paths under varied lighting. Box numbers: confidence scores.
Drones 10 00067 g003
Figure 4. Ablation visual comparison with heatmap visualization. (a) Ground truth, (be) detection heatmaps overlaid on RGB images: (b) YOLOv8-ARFM, (c) YOLOv8-ARFM+MAFPN, (d) YOLOv8-ARFM+TDA, (e) ATM-Net. Rows: parking lot (top), urban roads (bottom). Box numbers indicate confidence scores. Heatmaps visualize model attention using Grad-CAM++.
Figure 4. Ablation visual comparison with heatmap visualization. (a) Ground truth, (be) detection heatmaps overlaid on RGB images: (b) YOLOv8-ARFM, (c) YOLOv8-ARFM+MAFPN, (d) YOLOv8-ARFM+TDA, (e) ATM-Net. Rows: parking lot (top), urban roads (bottom). Box numbers indicate confidence scores. Heatmaps visualize model attention using Grad-CAM++.
Drones 10 00067 g004
Figure 5. Comprehensive diagnostic metrics for ATM-Net on VEDAI dataset. (a) Training curves showing loss convergence and evaluation metrics across 100 epochs; (b) precision–recall curves illustrating per-class detection performance with mAP@0.5 = 92.4%; (c) F1-confidence curves for optimal threshold selection; (d) normalized confusion matrix visualizing per-class classification accuracy and inter-class confusion patterns.
Figure 5. Comprehensive diagnostic metrics for ATM-Net on VEDAI dataset. (a) Training curves showing loss convergence and evaluation metrics across 100 epochs; (b) precision–recall curves illustrating per-class detection performance with mAP@0.5 = 92.4%; (c) F1-confidence curves for optimal threshold selection; (d) normalized confusion matrix visualizing per-class classification accuracy and inter-class confusion patterns.
Drones 10 00067 g005
Figure 6. Method comparison on VEDAI. Rows: (a) GT, (b) YOLOv5n, (c) YOLOv8n, (d) ARSOD-YOLO, (e) ICAFusion, (f) FGMF, (g) ATM-Net. Five scenarios showing detection results across different environmental conditions. Box colors indicate vehicle categories.
Figure 6. Method comparison on VEDAI. Rows: (a) GT, (b) YOLOv5n, (c) YOLOv8n, (d) ARSOD-YOLO, (e) ICAFusion, (f) FGMF, (g) ATM-Net. Five scenarios showing detection results across different environmental conditions. Box colors indicate vehicle categories.
Drones 10 00067 g006
Figure 7. Qualitative detection comparison on DroneVehicle dataset. Row (a): Ground truth annotations. Row (b): ATM-Net detection results. Six columns represent diverse scenarios: urban intersections (1–2), tree-occluded road intersection (3), highway (4), large parking facility (5), nighttime residential area (6). Blue arrows indicate misdetection cases (false positives) where non-vehicle objects are incorrectly classified. Box colors represent different vehicle categories.
Figure 7. Qualitative detection comparison on DroneVehicle dataset. Row (a): Ground truth annotations. Row (b): ATM-Net detection results. Six columns represent diverse scenarios: urban intersections (1–2), tree-occluded road intersection (3), highway (4), large parking facility (5), nighttime residential area (6). Blue arrows indicate misdetection cases (false positives) where non-vehicle objects are incorrectly classified. Box colors represent different vehicle categories.
Drones 10 00067 g007
Table 1. Asymmetric routing details at each pyramid level.
Table 1. Asymmetric routing details at each pyramid level.
LevelResolutionRGB BranchIR BranchAfter ConcatC2f_TDA Repeats
P2256 × 256128 ch128 ch256 ch → RGB3
P3128 × 128256 ch256 ch512 ch → RGB6
P464 × 64512 ch512 ch1024 ch → RGB6
P532 × 321024 ch1024 ch2048 ch → RGB3
Note: “ch” denotes channels. At each fusion node, RGB and IR features are concatenated (doubling channels), and the fused features flow into the RGB branch for the next level, while the IR branch continues with its original single-modality features. This asymmetric design ensures that the RGB branch progressively accumulates cross-modal information while the IR branch preserves thermal-specific characteristics throughout the network hierarchy.
Table 2. Training hyperparameters and configuration details.
Table 2. Training hyperparameters and configuration details.
ParameterValueDescription
Optimization
OptimizerSGDStochastic Gradient Descent
Initial learning rate ( l r 0 )0.01Starting learning rate
Final learning rate ( l r f )0.01End learning rate
Momentum0.937SGD momentum
Weight decay0.0005L2 regularization
Batch size8Samples per iteration
Nominal batch size (nbs)64For gradient scaling
Epochs100Total training epochs
Warmup
Warmup epochs3.0Linear warmup duration
Warmup momentum0.8Initial momentum
Warmup bias lr0.1Bias learning rate
Loss Weights
Box loss ( λ b o x )7.5Bounding box regression
Classification loss ( λ c l s )0.5Class prediction
DFL loss ( λ d f l )1.5Distribution focal loss
Label smoothing0.0Disabled
Inference
NMS IoU threshold0.7Non-maximum suppression
Max detections300Per image limit
Reproducibility
Random seed0Fixed seed
DeterministicTrueReproducible mode
AMPFalseMixed precision disabled
Table 3. Comparison of VEDAI and DroneVehicle datasets.
Table 3. Comparison of VEDAI and DroneVehicle datasets.
CharacteristicVEDAIDrone Vehicle
Image number124628,439
Vehicle instances3640>50,000
Image size (pixels)1024 × 1024 (fixed)Variable
Number of vehicle classes85
Modality registrationSpatial (error < 2 pixels)Spatial
Annotation typeHorizontal bounding boxHorizontal bounding box
Train/Test split1121/125 (90%/10%)17,990/1469 (92%/8%)
ScenariosUrban roads, parking lots, agricultureComplex urban, day-night
Usage in this workPrimary benchmarkComparative evaluation
Ablation and hyperparameterCross-dataset generalization
Table 4. Performance comparison between single-modality and multimodal fusion.
Table 4. Performance comparison between single-modality and multimodal fusion.
ModelModalityPRmAP50mAP50-95Params (M)GFLOPs
YOLOv8IR0.8350.7910.8320.5653.018.1
YOLOv8RGB0.8050.7950.8570.5853.018.1
YOLOv8-SymmetricRGB + IR0.8080.7520.8880.6205.2013.7
YOLOv8-ARFMRGB + IR0.9090.7560.9070.6414.7512.1
Table 5. Ablation experiment results of core modules.
Table 5. Ablation experiment results of core modules.
ConfigurationTDAMAFPNPRmAP50mAP50-95Params(M)GFLOPsFPS
Baseline (YOLOv8-ARFM)0.9090.7560.9070.6414.7512.138.9
YOLOv8-ARFM+TDA0.8540.8280.9170.6454.7512.139.6
YOLOv8-ARFM+MAFPN0.8680.8110.9160.6454.8312.936.3
ATM-Net0.9130.8160.9240.6474.8313.035.9
Table 6. Performance comparison of different methods on VEDAI dataset (1024 × 1024 resolution).
Table 6. Performance comparison of different methods on VEDAI dataset (1024 × 1024 resolution).
MethodInput ModalitymAP50mAP50-95Params (M)GFLOPs
YOLOv8n [11]RGB0.8570.5853.018.1
YOLOv5n [10]RGB0.7030.4212.57.1
SuperDet [14]RGB0.776
FFCA-YOLO [12]RGB0.7170.4487.1251.2
DS-YOLOv8 [13]RGB0.7690.511
YOLOv8n [11]Thermal0.8320.5653.018.1
SPD-YOLOv8 [15]Thermal0.6370.521
DBD-YOLOv8 [16]Thermal0.7940.541
ARSOD-YOLO [23]RGB + Thermal0.7430.46912.3
LRFL-YOLO [24]RGB + Thermal0.7450.45610.4
FGMF [17]RGB + Thermal0.6638.5
SuperYOLO [21]RGB + Thermal0.7650.4764.8556.0
ICAFusion [22]RGB + Thermal0.7660.449120.2191.6
Multispectral DETR [30]RGB + Thermal0.8270.50873.0
GHOST [25]RGB + Thermal0.8030.4909.7
YOLOFusion [32]RGB + Thermal0.7860.49124.534.9
DAMSDet [33]RGB + Thermal0.9150.553
ATM-Net (Ours)RGB + Thermal0.9240.6474.8313.0
Table 7. Detection performance comparison of different methods on each category of VEDAI dataset (mAP50).
Table 7. Detection performance comparison of different methods on each category of VEDAI dataset (mAP50).
MethodCarTruckPickupTractorCamping CarBoatVanOtherMean
YOLOv5n [10]0.9250.6660.8780.7590.8180.7540.8060.6540.783
YOLOv8n [11]0.9440.8100.9190.8940.8690.8340.8120.7770.857
SuperYOLO [21]0.9000.8080.8680.6700.8220.7210.7250.6070.765
MMYNet [34]0.9080.8490.8830.7490.7450.6000.7460.7260.800
ICAFusion [22]0.8440.7620.8250.5990.7340.5300.8190.6950.726
YOLOFusion [32]0.9170.7810.8590.7190.7890.7110.7520.5470.786
FFCA-YOLO [12]0.8930.7470.8670.7850.7690.4970.7460.5100.717
FGMF [17]0.8090.6030.7680.6950.7240.6030.6060.4980.663
ATM-Net (Ours)0.9600.9380.9500.9520.9590.7770.9950.8580.924
Table 8. Performance comparison of different methods on DroneVehicle dataset (640 × 640 resolution).
Table 8. Performance comparison of different methods on DroneVehicle dataset (640 × 640 resolution).
MethodModalityCarFreight CarTruckBusVanmAPParams (M)
Faster R-CNN [8]RGB79.037.249.077.037.055.941.1
Rol Trans [9]RGB61.642.355.185.544.861.655.1
Faster R-CNN [8]IR89.448.353.587.042.664.241.1
Rol Trans [9]IR89.653.451.088.944.565.555.1
CIAN [19]RGB + IR89.9860.2262.4788.9049.5970.23
AR-CNN [20]RGB + IR90.162.164.889.451.571.6
UA-CMDet [18]RGB + IR87.546.860.787.138.064.0
TSFADet [27]RGB + IR89.963.767.989.854.073.1104.7
CALNet [26]RGB + IR90.363.076.289.158.575.4
C2Former [28]RGB + IR90.264.468.389.858.574.2100.8
E2E-MFD [29]RGB + IR90.364.679.389.863.177.4
DMM [31]RGB + IR90.573.277.790.065.179.388.0
ATM-Net (Ours)RGB + IR90.574.778.590.367.183.74.83
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Chen, J.; Huang, J.; Zhang, Z.; Yang, J.; Wu, Z.; Luo, R. ATM-Net: A Lightweight Multimodal Fusion Network for Real-Time UAV-Based Object Detection. Drones 2026, 10, 67. https://doi.org/10.3390/drones10010067

AMA Style

Chen J, Huang J, Zhang Z, Yang J, Wu Z, Luo R. ATM-Net: A Lightweight Multimodal Fusion Network for Real-Time UAV-Based Object Detection. Drones. 2026; 10(1):67. https://doi.org/10.3390/drones10010067

Chicago/Turabian Style

Chen, Jiawei, Junyu Huang, Zuye Zhang, Jinxin Yang, Zhifeng Wu, and Renbo Luo. 2026. "ATM-Net: A Lightweight Multimodal Fusion Network for Real-Time UAV-Based Object Detection" Drones 10, no. 1: 67. https://doi.org/10.3390/drones10010067

APA Style

Chen, J., Huang, J., Zhang, Z., Yang, J., Wu, Z., & Luo, R. (2026). ATM-Net: A Lightweight Multimodal Fusion Network for Real-Time UAV-Based Object Detection. Drones, 10(1), 67. https://doi.org/10.3390/drones10010067

Article Metrics

Back to TopTop