Next Article in Journal
Audiovisual Brain Activity Recognition Based on Symmetric Spatio-Temporal–Frequency Feature Association Vectors
Previous Article in Journal
A Symmetry-Aware Hierarchical Graph-Mamba Network for Spatio-Temporal Road Damage Detection
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Asymmetric Spatial–Frequency Fusion Network for Infrared and Visible Object Detection

1
Xi’an Key Laboratory of Human-Machine Integration and Control Technology for Intelligent Rehabilitation, School of Computer Science, Xijing University, Xi’an 710123, China
2
Xi’an Reseach Institute of High Technology, Xi’an 710025, China
*
Authors to whom correspondence should be addressed.
Symmetry 2025, 17(12), 2174; https://doi.org/10.3390/sym17122174
Submission received: 17 November 2025 / Revised: 9 December 2025 / Accepted: 13 December 2025 / Published: 17 December 2025
(This article belongs to the Section Computer)

Abstract

Infrared and visible image fusion-based object detection is critical for robust environmental perception under adverse conditions, yet existing methods still suffer from insufficient modeling of modality discrepancies and limited adaptivity in their fusion mechanisms. This work proposes an asymmetric spatial–frequency fusion network, AsyFusionNet. The network adopts an asymmetric dual-branch backbone that extends the RGB branch to P5 while truncating the infrared branch at P4, thereby better aligning with the physical characteristics of the two modalities, enhancing feature complementarity, and enabling fine-grained modeling of modality differences. On top of this backbone, a local–global attention fusion (LGAF) module is introduced to model local and global attention in parallel and reorganize them through lightweight convolutions, achieving joint spatial–channel selective enhancement. Modality-specific feature enhancement is further realized via a hierarchical attention module (HAM) in the RGB branch, which employs dynamic kernel selection to emphasize multi-level texture details, and a fourier spatial spectral modulation (FS2M) module in the infrared branch, which more effectively captures global thermal radiation patterns. Extensive experiments on the M 3 FD and VEDAI datasets demonstrate that AsyFusionNet attains 86.3 % and 54.1 % mAP 50 , respectively, surpassing the baseline by 8.8 and 6.4 points (approximately 11.4 % and 13.4 % relative gains) while maintaining real-time inference speed.

1. Introduction

Object detection, as a core task in computer vision, plays a pivotal role in applications such as autonomous driving, intelligent surveillance, and robotic navigation. With the rapid development of deep learning, CNN-based and ViT-based detectors have achieved remarkable advances in both accuracy and efficiency [1,2]. However, most existing detection methods rely on visible images and therefore suffer from severe performance degradation under low illumination, nighttime, or adverse weather conditions [3,4]. When ambient lighting changes drastically or is heavily influenced by haze, smoke, or dust, the texture and color cues in visible images are significantly degraded, leading to blurred object boundaries and reduced detection confidence, which in turn undermines the robustness and practicality of the system [5]. Infrared images capture thermal radiation emitted by objects and are inherently insensitive to illumination changes, while also exhibiting strong penetration capabilities. As a result, they can maintain stable image quality at night and in harsh environments [6]. Nevertheless, infrared images usually lack rich texture details and semantic information, making it difficult to distinguish targets with similar temperatures or objects with similar shapes. Therefore, a single-modality vision system is inadequate for achieving stable and reliable object perception in complex, dynamically changing real-world scenarios. By merging infrared and visible information, one can achieve complementary advantages at both the energy and semantic levels, thereby enabling more comprehensive and robust environmental perception. This has become an important research direction in multimodal object detection [7,8]. More broadly, multi-sensor data integration with machine learning has been successfully explored in various domains beyond object detection, further underscoring the versatility and application potential of multimodal perception frameworks [9,10].
Existing infrared and visible fusion-based detection methods can be broadly categorized into three types: pixel-level fusion, feature-level fusion, and decision-level fusion [11,12]. Feature-level fusion has become the main solution [13,14,15]. In recent years, researchers have further improved fusion-based detection by introducing attention mechanisms, cross-modal Transformers, and dynamically learned fusion weights [16,17]. Despite these advances, multimodal fusion detection still faces several key challenges:
(1) Insufficient modeling of modality-specific characteristics. Most existing approaches adopt symmetric network architectures, using identical or similar feature extractors for the infrared and visible branches, without fully considering the fundamental differences between the two modalities in imaging mechanisms, information density, and semantic expressiveness. Such designs limit the effective exploitation of information specific to each modality.
(2) Lack of adaptivity in the fusion mechanism. Many fusion strategies rely on fixed weights or simple feature concatenation, making it difficult to adaptively adjust the fusion behavior according to scene content, object properties, and modality quality. This often leads to redundancy of features, noise interference, and a mismatch of modality in complex scenarios, compromising the stability of the detection performance.
(3) Underutilization of frequency-domain information. Existing studies mainly focus on spatial-domain feature fusion, while overlooking the complementary properties of infrared and visible images in the frequency domain. The low-frequency components of the infrared images highlight the global distribution of the thermal targets, whereas the high-frequency components of the visible images contain rich edge and texture details. How to effectively combine frequency- and spatial-domain features to realize cross-modal spectral complementarity remains a crucial open problem.
To tackle the above challenges, we propose an asymmetric spatial–frequency fusion network, AsyFusionNet, which systematically addresses key technical issues in multimodal detection through innovative network architecture and fusion mechanisms. As shown in Figure 1a, traditional methods typically adopt a fully symmetric dual-branch structure, where the corresponding P3–P5 feature maps of the visible (RGB) and infrared (IR) branches are fused layer by layer using a generic fusion operator F, and then jointly fed into the subsequent Neck and Head modules. Although such a design is structurally concise, it ignores the intrinsic differences in information density and semantic representation between the two modalities, leading to wasted computational resources and limited feature representation capacity.
Unlike these approaches, our model does not simply plug additional attention or frequency blocks into an existing symmetric architecture, but instead redesigns both the backbone and the fusion pipeline to explicitly encode asymmetry between RGB and IR modalities and to unify spatial- and frequency-domain interactions within a single framework, as shown in Figure 1b. The RGB branch maintains a complete P1–P5 feature pyramid so that rich texture and semantic cues can be fully extracted, while the IR branch is only extended to the P4 level, which cuts down redundant deep-layer computation but still preserves key thermal radiation information. The RGB branch focuses on learning deeper texture and semantic representations, whereas the IR branch concentrates on shallow structures and thermal target cues; their features are dynamically fused across multiple levels to combine detailed information with high-level semantics.
On top of the above asymmetric design, AsyFusionNet employs three key components, the local–global attention fusion (LGAF) module, the hierarchical attention module (HAM), and the fourier spatial spectral modulation (FS2M) module, to accomplish multimodal feature extraction and efficient information interaction. The main contributions of this paper can be summarized as follows:
  • Propose an asymmetric spatial–frequency fusion network, AsyFusionNet, in which the RGB branch is extended to P5 and the IR branch to P4, and hierarchically coupled fusion is performed at P2–P4. This design reduces the overall computational cost compared with symmetric backbones and simple concatenation, without sacrificing accuracy, while better reflecting the imaging characteristics of the two modalities and enhancing feature complementarity.
  • Design the LGAF module, which concatenates the local attention branch, the global attention branch, and the baseline feature branch, and then uses lightweight convolutions for compact spatial–channel coupling modeling. Compared with fusion strategies that rely on fixed weights or straightforward concatenation, LGAF, with almost no extra computation, significantly improves texture fidelity and semantic consistency, while effectively suppressing noise and redundant information.
  • Complementary feature enhancement for RGB and IR via HAM and FS2M. For visible features, we propose the HAM module, which performs dynamic kernel selection and cross-layer information sharing to enhance feature extraction. For IR features, we propose FS2M, which jointly models information in the frequency, spatial, and spectral domains to strengthen low-frequency energy and thermal target saliency. Working in tandem, HAM and FS2M make it easier for the model to distinguish small objects from large-scale scenes under low-resolution or cluttered backgrounds.
The remainder of this paper is organized as follows. Section 2 reviews related work. Section 3 details the network architecture of AsyFusionNet and its main components. Section 4 reports experimental results and analysis. Section 5 concludes the paper and outlines future research directions.

2. Related Work

2.1. Visible and Infrared Object Detection

In recent years, with the rapid progress of deep learning, visible and infrared fusion-based object detection has made notable advances. Compared with single modal detectors, multimodal fusion can better cope with challenging scenarios such as illumination changes and adverse weather, and thus shows great potential in applications including autonomous driving, intelligent surveillance, and military reconnaissance [18,19]. According to the fusion strategy, existing multimodal object detection methods can be roughly divided into three categories: pixel-level fusion, feature-level fusion, and decision-level fusion.
Early work on pixel-level visible and infrared fusion for object detection mainly focused on multiresolution fusion techniques, where wavelet [20] and principal component analysis (PCA) [21] are used to integrate visible and infrared images at the pixel level. Qiu et al. [22] introduced a multiresolution data fusion scheme based on PCA, which combines pixel-level weights derived from wavelet analysis and jointly exploits thermal and visible information to improve automatic target recognition (ATR). Similarly, Wu et al. [23] proposed a fusion method that employs Daubechies wavelet bases (DWB) together with pixel wise weighting, aiming to enhance ATR performance through effective multisensor data integration. Zheng et al. [24] presented a multi stage fusion network (MSFAM) with an attention mechanism, which effectively leverages pixel-level information to exploit the rich textures of visible images and the salient responses of infrared images. Xie et al. [25] developed YOLO-MS, a multispectral object detection framework that adopts feature interaction and self attention guided fusion to alleviate the limitations of pixel-level fusion in terms of multimodal interaction and global dependency modeling. Wang et al. [26] proposed FMPFNet, a real time aerial multispectral detection framework that integrates a dynamic modality balancing pixel-level fusion strategy, such as DMDTM, to improve detection performance in difficult conditions like low light scenes. Although pixel-level fusion can noticeably improve image quality, it is prone to information loss, especially when the modalities provide conflicting cues; it cannot be easily tailored to specific detection tasks and struggles to handle semantic inconsistencies between modalities. As a result, research attention has gradually shifted toward higher-level fusion strategies.
Decision-level fusion operates on the outputs of detectors and produces the final result by combining detection predictions from different modalities. Guan et al. [27] proposed a confidence-based decision fusion method that determines fusion weights by analyzing the confidence distribution of detection results from each modality. Liu et al. [16] designed an adaptive non maximum suppression (NMS)-based fusion algorithm, which dynamically adjusts the fusion strategy according to the overlap and confidence of detection boxes. Wang et al. [28] introduced a decision-level fusion framework based on reinforcement learning, where an agent learns an optimal fusion policy. Zhang et al. [29] developed a graph neural network-based decision fusion method, which models detection results from different modalities as a graph and performs reasoning on this structure. Although such methods are flexible, the lack of interaction at the feature-level often makes it difficult to fully exploit the complementary information across modalities.
Feature-level visible and infrared object detection aims to harness the complementary information provided by multispectral modalities and has become the mainstream paradigm for multimodal detection. Li et al. [30] proposed YOLO-FIRI, which treats image fusion as a preprocessing step and uses a CNN to merge visible and infrared images, thereby improving feature extraction and detection accuracy. Sun et al. [31] introduced DetFusion, where object related cues learned by the detection network are exploited to guide the fusion process. In this approach, the fusion network is cascaded with the detection module, and the detection loss is used as a supervision signal, ensuring that the fused features are directly optimized for detection performance. YOLO-MS [32] adopts a multispectral detection framework with feature interaction and self attention guided fusion; their method enhances multimodal interaction and captures global dependencies to overcome limitations of prior approaches. CDC-YOLOFusion [33] further incorporates a novel CDCF module to adaptively extract and fuse bimodal features related to the data distribution, enabling the learned kernels to attend simultaneously to common salient patterns and modality specific characteristics, and thus produce more informative combined representations.

2.2. Feature Extraction and Fusion

In feature-level fusion, a central question is how to design effective feature extraction and fusion schemes that respect the characteristics of each modality. Existing studies can be broadly grouped along several lines.
Attention mechanisms play a crucial role in multimodal fusion, as they enable adaptive weighting of features from different modalities. YOLOFIV [34] adopts a dual stream backbone with attention-based fusion and introduces ECA attention to strengthen the focus on complex detection scenarios, achieving 64.71% mAP@0.5 on a UAV vehicle dataset. PIAFusion [35] is one of the first to incorporate illumination awareness into image fusion, where an illumination aware loss function guides the fusion process and yields high quality results under all day conditions. LASFNet [36] employs an attention guided self modulation fusion (ASFF) module to adaptively adjust fusion responses at both global and local levels, reporting 1–3% mAP gains on three benchmark datasets. IAIFNet [37] improves fusion performance in low light environments through a salient target aware module (STAM) and an adaptive differential fusion module (ADFM). These approaches can adapt to varying illumination and scene complexity, but their reliance on sophisticated attention modules often leads to relatively high computational cost.
Frequency-domain analysis offers a complementary perspective for multimodal feature fusion by decomposing and recombining different frequency components to achieve finer grained fusion. WaveMamba [38] uses discrete wavelet transform (DWT) to decompose complementary frequency components of visible and infrared images and combines them within a Mamba-based framework for comprehensive fusion of low and high frequency subbands, leading to an average mAP improvement of 4.5% on four benchmark datasets. MGFF [39] reconstructs and enhances the frequency components of visible and infrared images via wavelet transforms and employs a mask guided feature reconstruction module to ensure effective fusion under weak alignment. MCFusion [40] designs a frequency-domain feature enhancement (FCE) module that applies different frequency processing strategies to visible and thermal modalities, respectively. FD2-Net [41] introduces a high frequency unit (HFU) and a low frequency unit (LFU) to separately handle distinct frequency components from visible and infrared images. FAFusion [42] proposes a maximum frequency loss to ensure that the fused image preserves key frequency components from the source images. While these methods capture cross modal complementarity in a more refined manner, the frequency transforms and related operations inevitably introduce non negligible computational overhead.
To meet the demands of practical applications where efficiency is a key requirement, a number of lightweight fusion frameworks have been proposed. RSDet [43] adopts a coarse to fine strategy for feature purification and fusion, combining a redundant spectrum removal module with a dynamic feature selection module to achieve efficient fusion. MMI-Det [44] employs four cooperative modules contour enhancement, fusion focusing, contrast bridging, and information guidance to realize efficient multimodal information perception. IM-CMDet [45], targeting small-object detection in UAV aerial scenarios, strikes a balance between model compactness and accuracy through the joint optimization of three modules: DSJE, FRN, and DFWG.
From pixel-, feature-, to decision-level fusion, existing multimodal approaches each have merits but feature-level fusion has become the dominant. Nevertheless, modality-specific characteristics remain underexploited due to symmetric architectures, fusion mechanisms are insufficiently adaptive to scene and modality changes, and joint spatial–frequency modeling is still limited. To address these issues, we propose AsyFusionNet, which integrates an asymmetric backbone, adaptive fusion, and spatial–frequency joint modeling into an efficient and practically deployable multimodal detection framework.

3. Methodologies

3.1. Overall Structure

The overall architecture of AsyFusionNet is illustrated in Figure 2. The model adopts a lightweight dual-branch asymmetric fusion framework, designed to fully exploit the complementary properties of visible (RGB) and infrared (IR) images for object detection. Given paired RGB and IR images as inputs, two independent backbone networks are used for feature extraction. In the RGB branch, HAM is employed to extract salient features from levels P1–P5, adaptively enhancing task-relevant information while suppressing background clutter, thereby highlighting structural details and semantic cues in well-illuminated regions. In parallel, the IR branch uses FS2M to extract features from P1–P4, modeling the spatial- and frequency-domain distributions of thermal radiation patterns. This enables cross-modal consistency adjustment and information compensation, and strengthens the representation of regions with weak textures or low contrast.
Subsequently, features from the two branches are fed into the LGAF module for deep fusion. By jointly integrating local details and global contextual information, LGAF effectively captures multi-scale semantic correlations, thereby improving the complementary representation and spatial alignment of cross-modal features. The fused features are then passed to a PAFPN for multi-scale feature reconstruction and semantic enhancement. Finally, the detection head operates on the enhanced feature pyramid to produce the fused detection results, accomplishing multimodal object detection under both RGB and IR imaging conditions.

3.2. Local–Global Attention Fusion (LGAF)

In RGB–IR image fusion, the features from the two modalities differ significantly, and their representations in feature space are often strongly misaligned due to discrepancies in imaging mechanisms and semantic distributions. As a result, it is difficult for fusion models to strike a balance between preserving fine details and enhancing target saliency. Moreover, in CNN-based fusion methods, the receptive field of convolution operations is inherently limited, making it hard to capture long-range contextual dependencies. Purely global attention, on the other hand, tends to overlook local structural details, and thus cannot simultaneously preserve local cues and global semantics. In addition, during multimodal feature fusion, features at different scales are often insufficiently coupled along the spatial and channel dimensions, which leads to low fusion efficiency. Most existing methods also rely on fixed weights or simple concatenation, and therefore cannot dynamically adjust the importance of each modality according to local content, making them prone to noise and redundant information. Furthermore, many existing local–global or dual-path attention mechanisms either apply local and global attention sequentially or share a single attention map across modalities, which limits their ability to decouple local and global modeling from cross-modal fusion and to provide a stable modality-agnostic reference path.
To address the above limitations in a unified manner, we propose LGAF, as shown in Figure 3, which aims to jointly model local details and global semantic information, thereby alleviating discrepancies between RGB and IR features in terms of spatial distribution and semantic representation. In contrast to conventional local–global attention blocks, LGAF first builds modality-specific local and global responses and then performs cross-modal fusion around an explicit baseline branch, so that scale modeling and modality interaction are cleanly separated. Both the local attention (LocalAtt) and global attention (GlobalAtt) branches are built upon the parallelized patch-aware attention (PPA) module adopted in HCF-Net [46]. Let the RGB and IR input features be denoted respectively as:
F ir , F vi R C × H × W
First, a 1 × 1 convolution is applied to align and compress the feature channels:
F ir = Conv 1 × 1 F ir , F vi = Conv 1 × 1 F vi
After obtaining the aligned RGB and IR features, we feed them into two parallel branches: LocalAtt and GlobalAtt. The local branch A L adopts a smaller patch size p = 2 to model neighborhood dependencies and preserve fine details and edges. The global branch A G uses a larger patch size p = 4 to capture long-range dependencies, global semantics, and saliency. The outputs of the two branches are then concatenated and fused by a convolution layer, yielding a complementary representation that is both fine-grained and globally coherent, and alleviating the bias caused by using only local or only global modeling:
L vi = A L F vi , G vi = A G F vi , L ir = A L F ir , G ir = A G F ir .
The outputs of the two branches are then concatenated and fused by a convolution layer, yielding a complementary representation that is both fine-grained and globally coherent, and mitigating the bias caused by using only local or only global modeling. Subsequently, the local and global responses of each modality are first aggregated within each modality, so that the single-modal representation already exhibits multi-scale consistency, thereby avoiding information conflicts caused by scale mismatch during direct cross-modal interaction.
S vi = C L vi ; G vi , S ir = C L ir ; G ir
Next, an element-wise summation is applied to obtain a baseline cross-modal common feature, which is further refined by a 3 × 3 convolution for local mixing and denoising. This branch serves as a stable alignment anchor, reducing the dependence and sensitivity of the subsequent fusion process to the attention branches.
F mid = Conv 3 × 3 F vi + F ir
After that, the three types of features are concatenated to form a representation pool that is complementary while keeping redundancy under control, where the shared-information channel is explicitly preserved to facilitate selective emphasis on either fine details or saliency in the subsequent reorganization step:
F m = C S vi ; F mid ; S ir
Finally, feature reassembly and mapping are performed to produce the fused output:
F out = Conv 1 × 1 RepConv 3 × 3 Conv 3 × 3 ( F m )
This reconstruction pathway relies only on a small number of lightweight convolutions, and thus introduces merely a minor increase in FLOPs while enhancing information selection and feature recombination capabilities and maintaining good computational efficiency and deployability. As a result, the fused output preserves the structural details of the visible modality while strengthening the saliency of infrared targets, yielding a structurally consistent, semantically complete, and visually natural fusion effect.

3.3. Hierarchical Adaptive Mixer (HAM)

In multimodal image fusion, visible images play a key role in providing structured details and rich textures. However, the features of visible images are highly complex and dynamic, as they are strongly affected by illumination changes, occlusions, noise, and multi-scale structural variations. CNN-based models typically rely on fixed, static parameters, which makes it difficult to fully model multi-scale patterns when dealing with inputs exhibiting large-scale differences and complex spatial variations. The use of fixed-size convolution kernels constrains the balance between fine detail preservation and global context modeling, rendering the network sensitive to scale changes and lacking flexibility. In addition, feature maps from different layers contain complementary information: shallow layers emphasize texture and edge details, while deeper layers encode more semantic cues. Due to insufficient mechanisms for cross-layer interaction, traditional networks often fail to achieve effective collaboration among multi-level features, which in turn hampers semantic propagation between lower and higher layers. On the other hand, rigid convolutional structures introduce parameter and computation redundancy, reducing efficiency in large-scale scenarios and limiting their applicability to real-time tasks.
To overcome these limitations, we propose the HAM module, which aims to realize dynamic kernel selection, cross-scale feature fusion, and inter-layer information sharing, thereby forming a flexible and efficient feature extraction framework. As shown in Figure 4a, HAM consists of a normalization layer, a context-gated linear unit (CGLU), and a core hierarchical adaptive convolution (HAConv) unit. Residual connections and hierarchical feature fusion are employed inside the module so that high-level semantic context can be injected while preserving low-level structural information, leading to robust cross-layer feature representations.
HAConv is the core component of HAM, and its structure is illustrated in Figure 4b. Given an input feature X R C × H × W , we first split it into several parallel branches, and each branch applies a convolution kernel K i with a different receptive field to extract features:
Y i = Conv K i ( X ) , i = 1 , 2 , , n .
To adaptively select the most suitable convolution kernels according to the input features, HAM introduces a dynamic kernel weighting mechanism. Specifically, global statistics of the input are first obtained via global average pooling (GAP), and then passed through two 1 × 1 convolution layers followed by a Softmax function to generate dynamic weights for each branch:
α i = Softmax W 2 δ W 1 GAP ( X ) .
where W 1 and W 2 are learnable parameters, δ ( · ) denotes a nonlinear activation function, and α i is the adaptive coefficient for the i-th convolution branch, which satisfies i α i = 1 . The outputs of all branches are then fused in a weighted manner according to these coefficients to obtain cross-scale features:
Y = i = 1 n α i · Y i .
The fused feature is then passed through a 1 × 1 convolution for channel integration and added to the input feature via a residual connection to produce the final output of the module:
HAM ( X ) = X + Conv i = 1 n α i · Conv K i ( X ) .
This hierarchical and adaptive design allows HAM to adjust its feature extraction strategy across different scales and semantic levels according to the input content. The dynamic kernel selection mechanism enables the network to focus on local textures in detailed regions while strengthening global responses in structurally or semantically salient areas, thereby achieving efficient fusion of multi-scale features. Residual information flow between layers ensures semantic consistency and stable gradients, while the use of depthwise convolutions effectively reduces the number of parameters and computational cost.

3.4. Fourier Spatial Spectral Modulation (FS2M)

Infrared images primarily reflect the thermal radiation distribution of a scene and are dominated by low-frequency energy. Conventional convolutions operate locally in the spatial domain and thus struggle to fully capture global thermal information, leading to incomplete feature representations. Moreover, standard convolutions lack inherent cross-scale adaptivity, which easily causes attenuation of small target energy and loss of fine details. Due to the discrepancy in spectral response mechanisms between RGB and IR modalities, existing feature extractors also find it difficult to achieve spectral-level alignment and energy balance, which in turn introduces semantic shifts and information distortion during fusion.
We propose the FS2M module, which adaptively enhances infrared features through joint modeling in the frequency, spatial, and spectral domains. This is an improvement on the baseline model C3K2, as shown in Figure 5, the module mainly consists of cascaded fourier spectral block (FSB) units and CBS blocks. The overall pipeline is illustrated in Figure 5a. The input feature first passes through a standard CBS block for initial feature extraction and normalization. Then, a channel-wise split operation divides the feature into several parallel subchannels, each of which is fed into multiple stacked FSB units for deep feature transformation. Inside each FSB, a core FDConv [47] performs feature modulation and enhancement simultaneously in the frequency, spatial, and spectral domains. The outputs of multiple FSBs are concatenated along the channel dimension and passed through a final CBS block for nonlinear combination and reconstruction, yielding an infrared representation with higher semantic consistency and better detail preservation.
FDConv is the key component of FS2M. It breaks the limitation of traditional convolutions with fixed spatial-domain weights by parameterizing the convolution kernels in the frequency domain and enabling adaptive responses to different frequency components. By jointly leveraging frequency-domain modeling, spatial adaptive modulation, and spectral reconstruction, FDConv substantially improves the expressiveness and discriminability of infrared features, and provides FS2M with an efficient and physically interpretable feature extraction mechanism.

4. Experiments and Analysis

4.1. Datasets

We evaluate AsyFusionNet on two benchmarks, M 3 FD [48] and VEDAI [49]. The M 3 FD dataset is designed for multimodal and multispectral vehicle detection. It was released in 2017 by the Institute of Computing Technology of the Chinese Academy of Sciences and is mainly used to assess detection algorithms in complex environments. M 3 FD contains images captured under diverse environmental and weather conditions, including daytime, nighttime, rain, and haze, covering a wide range of challenging vehicle scenarios. The dataset comprises about 3000 images with a resolution of 1280 × 720 , and includes different types of vehicles such as cars, trucks, and buses, together with their bounding boxes and category labels. In addition, M 3 FD provides paired multimodal data, i.e., aligned RGB and IR images, which facilitates cross-modal vehicle detection studies.
The VEDAI dataset is specifically designed for vehicle detection in aerial imagery and was released by the University of Caen, France, in 2015. It contains various categories of vehicles observed from an aerial viewpoint and reflects different illumination conditions, occlusion patterns, and viewing directions. VEDAI provides images in two spectral bands, RGB and IR, in order to enhance the robustness of vehicle detection algorithms in complex environments. The dataset consists of 1246 high-resolution images with resolutions of either 512 × 512 or 1024 × 1024 , each annotated with accurate vehicle locations and class labels. Vehicle categories include cars, pickups, trucks, and several other types, covering a broad range of environments and scene configurations.
In our experiments, both M 3 FD and VEDAI are split into training, validation, and test sets with a 7:2:1 ratio. Since these are public multimodal benchmarks with well-aligned RGB–IR pairs, we do not perform additional registration beyond the common resize and pad operations applied to both modalities. The data augmentation strategy is kept consistent with the baseline detectors.
As illustrated in Figure 6, the normalized scale distribution of object instances in M 3 FD and VEDAI exhibits markedly different characteristics. For M 3 FD, the scatter points span a wide range, covering targets from extremely small to medium and large scales, with a clear long-tail pattern and multiple clustered groups. This indicates stronger scale diversity and larger intra-class scale variance. In contrast, the scatter distribution of VEDAI is highly concentrated in the small-scale region, and the color distribution is dominated by a few categories, revealing significant class imbalance. These factors pose substantial challenges for robust object detection.

4.2. Evaluation Metrics and Experimental Settings

4.2.1. Evaluation Metrics

In our experiments, the detection performance is evaluated in terms of precision (P), recall (R), average precision (AP) for each class, and mean average precision (mAP) over all classes. All predicted bounding boxes are treated as positive samples. According to their relationship with the ground truth, TP (true positive) denotes the number of correctly detected positive samples, FP (false positive) denotes the number of incorrectly detected positive samples, and FN (false negative) denotes the number of missed positive samples, i.e., ground truth objects that are not detected [50].
Precision measures the proportion of correct predictions among all positive predictions and reflects the reliability of the detector, whereas recall measures the proportion of correctly detected objects among all ground truth objects and reflects the detector’s ability to find as many targets as possible. These two metrics are closely related to the probabilities of false alarms and missed detections. The area under the precision–recall (P–R) curve for each class is taken as its average precision (AP), and the mean of AP over all classes yields the mean average precision (mAP). The formulas are given as follows:
precision = TP TP + FP , recall = TP TP + FN , AP = 0 1 Precision ( t ) d t , mAP = n = 1 N AP n N .
In this way, mAP can jointly account for the detection performance across multiple categories and avoids the limitations of single-class evaluation metrics. It provides a comprehensive and reliable measure of an object detector’s performance, is naturally suited to multi-class scenarios, reflects the stability of the model, and is straightforward to interpret and compare. Consequently, mAP has become one of the most widely used evaluation metrics in object detection.
We also report several additional indicators. Parameters denote the number of trainable weights in the model. GFLOPs measure the number of floating-point operations required during inference and serve as an important indicator of computational complexity.

4.2.2. Experimental Environment

To ensure fair training and comparison, all ablation studies and training procedures were conducted on the Supercomputing Center of Xijing University. The GPU used is an NVIDIA A800 with 80 GB of VRAM, and the CPU is an Intel 6338N Xeon [51], both sourced from Intel and NVIDIA Corporation respectively, Santa Clara, CA, USA. The operating system is Red Hat 4.8.5-28. All models are trained under an environment configured with CUDA 12.1, Python 3.11, and PyTorch 2.1. The configuration used for training and testing is summarized in Table 1.
We train for 300 epochs with a batch size of 16 and use 8 data loading workers. The input image resolution is fixed at 640 × 640 . Stochastic gradient descent (SGD) is adopted as the optimizer, with the learning rate gradually decayed from a maximum of 1 × 10 3 to a minimum of 1 × 10 5 . A weight decay of 5 × 10 4 is applied to mitigate overfitting, and the momentum is set to 0.937. In addition, an early stopping strategy is employed: when the validation loss tends to plateau and the model reaches a quasi-converged state, training is automatically terminated to prevent overfitting. Unless otherwise specified, all compared algorithms are trained and evaluated using their official default hyperparameters. The network structure details of AsyFusionNet are shown in Table 2.

4.3. Ablation Studies

To assess the effectiveness of the proposed components, we perform ablation studies on both datasets. As a baseline, we construct a dual-branch backbone by extending CSPDarknet53 in YOLOv11 [52]. In this model, RGB and IR features at corresponding pyramid levels (P3–P5) are fused in a layer-wise manner via a generic concatenation operator, and the resulting features are subsequently fed into the neck and detection head, as illustrated in Figure 1a.
Table 3 summarizes the ablation results for different RGB/IR backbone depth configurations. With the symmetric P5/P5 setting as baseline, the model achieves 78.9 % mAP 50 on M 3 FD and 46.6 % mAP 50 on VEDAI, with 60.0 M and 140.5 M parameters, respectively. Switching to the asymmetric P5/P4 configuration reduces the parameter count to 52.4 M on M 3 FD and 137.7 M on VEDAI, while improving mAP 50 to 79.3 % and 50.3 % . This indicates that increasing IR depth to P5 mainly introduces redundant parameters with limited accuracy gain.
Further truncating the IR branch to P3 leads to a substantial performance drop ( 68.5 % on M 3 FD and 42.6 % on VEDAI), suggesting that overly shallow IR features cannot provide sufficient thermal information for effective fusion. Conversely, making the IR branch deeper than the RGB branch (P4/P5) also degrades mAP 50 (to 72.1 % and 40.9 % ), implying that blindly increasing IR depth disrupts semantic alignment between modalities. Overall, the P5/P4 design offers the best trade-off between model size and detection accuracy, supporting the effectiveness of the proposed asymmetric architecture.
Based on the baseline model described above, we conduct ablation studies by incrementally adding each module to evaluate its impact on detection accuracy and computational efficiency. The experimental results are shown in Table 4. On the M 3 FD dataset, the baseline model achieves 78.9 % m AP 50 with 60 M parameters. After introducing HAM, the performance increases to 84.8 % , while the number of parameters drops to 32.4 M, indicating that HAM simultaneously improves accuracy and reduces model complexity. Adding FS2M further boosts m AP 50 to 85.4 % . Although the parameter count increases to 57.7 M, FS2M refines feature selection and enhancement, yielding an additional gain. Finally, incorporating LGAF raises m AP 50 on M 3 FD to 86.1 % with 59.1 M parameters, providing the best trade-off between detection accuracy and efficiency among all configurations.
On the VEDAI dataset, the overall performance is lower than on M 3 FD, reflecting the higher difficulty of this aerial vehicle detection benchmark. The baseline model attains only 46.6 % m AP 50 . After integrating the full set of HAM, FS2M, and LGAF, m AP 50 increases to 54.3 % , corresponding to a gain of 7.7 percentage. The ablation results show that each component contributes positively, with HAM providing the largest improvement by substantially increasing accuracy while reducing the parameter count. FS2M and LGAF offer progressive refinements, further enhancing feature extraction and multi-scale representation learning, and together yield a more powerful and robust multimodal detector.
To further justify the lightweight property of AsyFusionNet, we report both theoretical complexity and practical runtime in Table 3 and Table 4. The asymmetric backbone reduces the number of parameters compared with the symmetric baseline (from 60.0 M to 52.4 M on M 3 FD and from 140.5 M to 137.7 M on VEDAI) without increasing the backbone GFLOPs. Moreover, even after introducing the FFT-based FS2M and LGAF modules, the full model still runs at 137.3 FPS on M 3 FD and 92.1 FPS on VEDAI, corresponding to about 7.3 ms and 12.2 ms per image, respectively. Although FS2M slightly decreases the FPS compared to the variant without FS2M, it consistently improves m A P 50 : 95 , and the overall throughput remains far above the common real-time threshold of 30 FPS. In this work, we therefore refer to AsyFusionNet as lightweight in the sense that it maintains a compact parameter count and high frame rate while achieving superior detection accuracy.

4.4. Comparative Experiments

4.4.1. Results on the M 3 FD

We conduct a comprehensive quantitative comparison between AsyFusionNet and several representative SOTA detectors, as reported in Table 5. The compared methods include the classical YOLOv8, YOLOv10, YOLOv11, RT-DETR, Swin Transformer, and multimodal fusion detectors such as CenterNet2, Sparse R-CNN, CDDFusion and MM-DETR. All methods are evaluated under the same experimental setup, using m AP 50 and m AP 50 : 95 as the primary metrics, and are tested under three configurations: RGB, IR, and multimodal fusion (Multi).
The results show that AsyFusionNet achieves the best performance in the multimodal configuration, with m AP 50 reaching 86.3% and m AP 50 : 95 reaching 58.2%, significantly outperforming all competing approaches. For example, the multimodal version of YOLOv8 attains 78.0% m AP 50 and 51.7% m AP 50 : 95 . Compared with this baseline, AsyFusionNet brings gains of 8.3% in m AP 50 and 6.5% in m AP 50 : 95 , fully demonstrating the superiority of the proposed fusion strategy.
From the single-modality results, AsyFusionNet also delivers strong performance: it achieves 82.0% m AP 50 and 53.4% m AP 50 : 95 in the RGB modality, and 78.5% m AP 50 and 51.3% m AP 50 : 95 in the IR modality, all clearly exceeding the single-modality results of other methods. It is worth noting that the traditional YOLO series exhibits relatively limited gains after multimodal fusion, whereas our method realizes substantial improvements through an effective cross-modal feature fusion mechanism. In particular, compared with the Swin Transformer, AsyFusionNet improves multimodal m AP 50 by 13.8% and m AP 50 : 95 by 17.1%, highlighting the unique advantage of the proposed asymmetric fusion architecture in handling multimodal information.
Moreover, by comparing performance across different modalities, we observe that most methods perform better in the RGB setting than in the IR setting, which is consistent with the lower intrinsic information density of infrared images. In contrast, AsyFusionNet fully exploits the complementary signals of the two modalities and, after multimodal fusion, achieves a substantial performance boost beyond any single modality. This validates the effectiveness and robustness of the proposed cross-modal fusion strategy.
Figure 7 shows detection results of AsyFusionNet on the M 3 FD dataset, where the left column in each group is the RGB image and the right column is the IR image. As can be seen, the proposed method can stably localize different categories of targets in both modalities, and maintains high confidence scores, mostly in the range 0.7–0.9, even under low illumination and strong backlight conditions, effectively suppressing background clutter such as reflections, foliage textures, and building structures. The detector is also able to clearly delineate distant pedestrians and small and long-range targets, demonstrating strong robustness to small and long-range targets.
We further compare the attention distributions of the baseline and the proposed method in multimodal image fusion via visualization-based analysis. Figure 8 presents two representative pairs of RGB and IR images from the M 3 FD dataset together with their corresponding heatmaps. These scenes cover typical applications such as complex natural environments, human detection, and challenging water surface reflections.
From the heatmaps of the baseline method, it can be observed that conventional fusion strategies suffer from attention dispersion. In particular, the activation regions are overly spread out and lack precise focus on key target areas. In the first natural scene, the baseline’s attention is almost uniformly distributed over the entire image, failing to effectively distinguish foreground targets from the background. In the corresponding infrared scene, although the baseline can roughly highlight the target region, the attention boundaries are blurry and contain a large amount of redundant activation, which severely degrades detection accuracy. In the second lake scene, the baseline is easily disturbed by background factors such as water surface reflections and vegetation textures, causing the attention mechanism to break down and preventing accurate focus on the true targets.
In contrast, AsyFusionNet exhibits clearly superior behavior, with its attention distribution being much more accurate and targeted. Benefiting from the proposed asymmetric fusion architecture, our method can adaptively integrate the complementary cues from RGB and IR modalities to form more robust feature representations. Under the same test scenarios, the heatmaps produced by AsyFusionNet show well-defined target contours and precise boundary localization, while spurious activations in background regions are effectively suppressed. Notably, in the first human detection example, the proposed method accurately outlines the entire human body and remains sensitive to fine details. In the challenging water surface scene, AsyFusionNet, through its cross-modal attention design, successfully overcomes the interference of reflections and complex background textures, achieving precise localization of the true target regions.

4.4.2. Experiment Results of the VEDAI

The experimental results on the VEDAI dataset are reported in Table 6, where we compare the proposed method with the RGB, IR, and multimodal variants of three generations of general-purpose detectors: YOLOv8, YOLOv10, and YOLOv11. Overall, multimodal fusion consistently outperforms single-modality configurations. Taking YOLOv8/10/11 as examples, their Multi versions improve m AP 50 over the RGB counterparts by +2.1/+6.2/+1.8%, and over the IR counterparts by +6.3/+6.9/+3.5%, respectively. This confirms the strong complementarity between RGB and IR in terms of appearance textures and thermal radiation cues.
Among all methods, the multimodal version of AsyFusionNet achieves the best overall performance, with 54.1% m AP 50 and 34.7% m AP 50 : 95 . These results surpass the strongest YOLOv8–Multi (52.2%/32.2%) by 1.9% and 2.5%, respectively, and yield gains of +7.2/+5.1 and +6.4/+7.2 over YOLOv10–Multi and YOLOv11–Multi. AsyFusionNet shows particularly notable advantages on difficult categories such as boats and tractors, which are easily affected by background clutter or large shape variations. On boats, the improvement over the YOLO multimodal baselines is the most pronounced: +7.4 compared with YOLOv10–Multi and more than +20.0 compared with YOLOv8/11–Multi. This indicates that the proposed cross-modal selection and fusion mechanism can effectively exploit RGB texture details together with the stability of IR thermal imaging, thereby enhancing discrimination for weak-texture or low-contrast targets. For major categories such as car, pickup, and van, AsyFusionNet achieves performance comparable to or better than the best baselines, which in turn leads to consistently higher overall mAP. Although for a few categories (e.g., trucks), some YOLO variants attain slightly higher scores, these improvements tend to be unstable, whereas AsyFusionNet delivers more balanced performance across categories and contributes to a higher overall m AP 50 : 95 .
The single-modality comparison further reveals category-dependent modality advantages. For vehicle-like classes such as car and pickup, IR often matches or even surpasses RGB, reflecting the robustness of thermal cues in nighttime or low-contrast conditions. However, for categories such as boats and others, where the temperature difference from the background is small, IR is clearly inferior to RGB, and visible-domain texture and edge information become more crucial. Multimodal fusion can naturally combine the strengths of both modalities and compensate for their respective blind spots, leading to stable improvements in overall performance. In summary, multimodal detection is consistently superior to single-modality detection, and the proposed AsyFusionNet, through adaptive feature selection and fusion, achieves clear gains on both difficult categories and aggregate metrics, demonstrating better cross-modal generalization and robustness.
Detection examples on the VEDAI dataset cover typical aerial scenarios such as residential roads, parking lots, and open highways, which involve a variety of challenges including low-contrast, shadow occlusion, repetitive textures, and distant small targets. As shown in Figure 9, YOLOv11 produces multiple false alarms caused by high-intensity ground textures, road markings, and roof reflections, marked by red triangles. It also suffers from missed detections and inaccurate localization under partial occlusion and low-contrast backgrounds, for example with bounding boxes that extend beyond the true object extent and category confusions where cars are misclassified as pickup or camping vehicles. In contrast, the proposed method suppresses background clutter more reliably and recovers small targets more faithfully. In both parking lot scenes and long-range road scenes, it can still produce accurate bounding boxes with high confidence scores for vehicles that are small in size or partially covered by shadows, and the category predictions are more consistent. In addition, the bounding boxes generated by our method adhere more tightly to the minimum enclosing rectangles of vehicles, reflecting higher localization accuracy and robustness.

5. Conclusions

In this paper, we proposed AsyFusionNet, a multimodal object detection framework whose core innovation lies in an asymmetric dual-branch architecture that explicitly accounts for the different imaging mechanisms of RGB and IR sensors. By extending the RGB branch to the P5 layer while truncating the IR branch at P4, the model reduces computational cost without sacrificing detection accuracy. The LGAF performs joint selective enhancement in the spatial and channel dimensions through parallel local–global attention modeling and a three-branch feature concatenation scheme, thereby improving texture fidelity and semantic consistency. The HAM, tailored to the RGB branch, employs dynamic kernel selection and cross-layer information sharing to better capture both local textures and global structural patterns. The FS2M, designed specifically for infrared features, leverages frequency-domain dynamic convolution to jointly model frequency, spatial, and spectral cues, effectively capturing global thermal radiation patterns and low-frequency energy distributions in infrared images. Extensive experiments on multiple public datasets validate the effectiveness and superiority of AsyFusionNet. Compared with existing mainstream multimodal detection methods, the proposed framework achieves a more favorable accuracy–speed trade-off and exhibits stronger robustness, particularly in complex backgrounds and small-object detection scenarios. Ablation studies further confirm the contribution of each component, showing that the proposed modules work synergistically to boost overall detection performance.
Nonetheless, several limitations remain. First, although the asymmetric architecture improves computational efficiency, truncating the IR branch may lead to the loss of high-level semantic information in certain scenarios, potentially affecting the recognition of more complex targets. Second, the frequency-domain transforms in FS2M still incur non-negligible computational overhead for large input resolutions, which constrains its applicability to ultra-high-resolution imagery. In addition, the current framework is mainly optimized for RGB and IR fusion, and its adaptability to other modality combinations, such as RGB and depth, RGB and LiDAR, requires further investigation. Compared with RGB and IR, RGB–Depth and RGB–LiDAR fusion introduces additional challenges such as heterogeneous spatial resolution, sparse and irregular sampling (for LiDAR), missing depth measurements, and stricter requirements on geometric alignment and calibration, which may make simple extensions of the current design suboptimal. Improving robustness under extreme weather and severe occlusion remains an important direction for future work.
Future work will focus on exploring adaptive asymmetric architectures that dynamically adjust the depth and complexity of each branch according to application scenarios, so as to better balance accuracy and efficiency; developing more efficient frequency-domain processing schemes to reduce the computational burden of FS2M and enhance its scalability to high-resolution inputs; extending the framework to support a broader range of modality combinations and cross-domain applications; incorporating adversarial training or domain adaptation techniques to improve robustness under adverse conditions; and integrating knowledge distillation and model compression to further optimize deployment efficiency on edge devices with real-time requirements. In addition, we plan to conduct a more comprehensive analysis, including a comparison between symmetric and asymmetric designs in challenging scenarios such as dense fog, severe occlusion, and small-object detection. These directions will help push multimodal object detection toward more practical and intelligent applications.

Author Contributions

Conceptualization, J.L. and J.G.; Methodology, J.L., X.L., and P.L.; Software, J.L., X.L., and P.S.; Validation, J.T., C.G. and J.M.; Visualization, J.M., P.S. and P.L.; Writing—original draft, J.L., X.L. and P.S.; Writing—review and editing, J.G., J.T., J.M. and C.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The M 3 FD is available at https://github.com/JinyuanLiu-CV/TarDAL (accessed on 10 May 2025); the VEDAI is available at https://downloads.greyc.fr/vedai/ (accessed on 15 July 2025).

Acknowledgments

The authors acknowledge the referees and the editor for carefully reading this paper and giving many helpful comments. The authors also express their gratitude to the reviewers for their insightful comments.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Thawani, P.V.; Ajmre, P.E.; Chaurasia, S.; Ponnusamy, S. A Comprehensive Analysis of Deep Learning and Machine Learning for Semantic Segmentation, and Object Detection in Machine and Robotic Vision. Appl. Comput. Vis. Through Artif. Intell. 2025, 377–391. [Google Scholar]
  2. Sun, Y.; Sun, Z.; Chen, W. The evolution of object detection methods. Eng. Appl. Artif. Intell. 2024, 133, 108458. [Google Scholar] [CrossRef]
  3. Yao, H.; Zhang, Y.; Jian, H.; Zhang, L.; Cheng, R. Nighttime pedestrian detection based on Fore-Background contrast learning. Knowl.-Based Syst. 2023, 275, 110719. [Google Scholar] [CrossRef]
  4. Li, H.; Hu, Q.; Zhou, B.; Yao, Y.; Lin, J.; Yang, K.; Chen, P. CFMW: Cross-modality Fusion Mamba for Robust Object Detection under Adverse Weather. IEEE Trans. Circuits Syst. Video Technol. 2025, 1, 12066–12081. [Google Scholar] [CrossRef]
  5. Pandian V, B.; Prasath, T.A.; Rajasekaran, M.P. Dehazing, enhancing the boundaries and corners in hazed images using optimal adaptive technique. Int. J. Image Data Fusion 2024, 15, 414–429. [Google Scholar] [CrossRef]
  6. Torabi, A.; Massé, G.; Bilodeau, G.A. An iterative integrated framework for thermal–visible image registration, sensor fusion, and people tracking for video surveillance applications. Comput. Vis. Image Underst. 2012, 116, 210–221. [Google Scholar] [CrossRef]
  7. Hou, Z.; Yang, C.; Sun, Y.; Ma, S.; Yang, X.; Fan, J. An object detection algorithm based on infrared-visible dual modal feature fusion. Infrared Phys. Technol. 2024, 137, 105107. [Google Scholar] [CrossRef]
  8. Zhirui, Y.; Zhanpeng, Y.; Junyu, W.; Liang, Z.; Yuanxin, Y. Multimodal object detection method using adaptive dusion of infrared and visible features. Natl. Remote Sens. Bull. 2025, 29, 3006–3019. [Google Scholar] [CrossRef]
  9. Koca, R.; Koca, Y.B. Anatomy-Based Assessment of Spinal Posture Using IMU Sensors and Machine Learning. Sensors 2025, 25, 5963. [Google Scholar] [CrossRef]
  10. Koca, Y.B.; Gökçe, B.; Aslan, Y. Optimization of traction power conservation and energy efficiency in agricultural mobile robots using the TECS algorithm. Sci. Rep. 2025, 15, 32838. [Google Scholar] [CrossRef]
  11. Ma, J.; Ma, Y.; Li, C. Infrared and visible image fusion methods and applications: A survey. Inf. Fusion 2019, 45, 153–178. [Google Scholar] [CrossRef]
  12. Zhang, X.; Ye, P.; Xiao, G. VIFB: A visible and infrared image fusion benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 13–19 June 2020; pp. 104–105. [Google Scholar]
  13. Zhou, K.; Chen, L.; Cao, X. Improving multispectral pedestrian detection by addressing modality imbalance problems. In Proceedings of the European Conference on Computer Vision; Springer: London, UK, 2020; pp. 787–803. [Google Scholar]
  14. Konig, D.; Adam, M.; Jarvers, C.; Layher, G.; Neumann, H.; Teutsch, M. Fully convolutional region proposal networks for multispectral person detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 49–56. [Google Scholar]
  15. Wagner, J.; Fischer, V.; Herman, M.; Behnke, S. Multispectral pedestrian detection using deep fusion convolutional neural Networks. In Proceedings of the ESANN, Bruges, Belgium, 27–29 April 2016; Volume 587, pp. 509–514. [Google Scholar]
  16. Zhang, H.; Fromont, E.; Lefevre, S.; Avignon, B. Multispectral fusion for object detection with cyclic fuse-and-refine blocks. In Proceedings of the 2020 IEEE International Conference on Image Processing (ICIP), Abu Dhabi, United Arab Emirates, 25–28 October 2020; pp. 276–280. [Google Scholar]
  17. Sun, Y.; Cao, B.; Zhu, P.; Hu, Q. Drone-based RGB-infrared cross-modality vehicle detection via uncertainty-aware learning. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 6700–6713. [Google Scholar]
  18. deMas Giménez, G.; Subirana, A.; García-Gómez, P.; Bernal, E.; Casas, J.R.; Royo, S. Multimodal sensing prototype for robust autonomous driving under adverse weather conditions. In Proceedings of the Optical Measurement Systems for Industrial Inspection XIV, SPIE, Munich, Germany, 23–27 June 2025; Volume 13567, pp. 520–528. [Google Scholar]
  19. Wu, W.; Deng, X.; Jiang, P.; Wan, S.; Guo, Y. CrossFuser: Multi-modal feature fusion for end-to-end autonomous driving under unseen weather conditions. IEEE Trans. Intell. Transp. Syst. 2023, 24, 14378–14392. [Google Scholar]
  20. Li, H.; Manjunath, B.; Mitra, S.K. Multisensor image fusion using the wavelet transform. Graph. Model. Image Process. 1995, 57, 235–245. [Google Scholar] [CrossRef]
  21. Naidu, V. Image fusion technique using multi-resolution singular value decomposition. Def. Sci. J. 2011, 61, 479–484. [Google Scholar] [CrossRef]
  22. Qiu, Y.; Wu, J.; Huang, H.; Wu, H.; Liu, J.; Tian, J. Multi-sensor image data fusion based on pixel-level weights of wavelet and the PCA transform. In Proceedings of the IEEE International Conference Mechatronics and Automation, Online, 29 July–1 August 2005; Volume 2, pp. 653–658. [Google Scholar]
  23. Wu, J.; Qiu, Y.; Liu, J.; Tian, J. Multisensor image fusion using multiresolution analysis and pixel-level weights. In Proceedings of the Electronic Imaging and Multimedia Technology IV, SPIE, Beijing, China, 8–11 November 2005; Volume 5637, pp. 65–76. [Google Scholar]
  24. Zheng, X.; Yang, Q.; Si, P.; Wu, Q. A multi-stage visible and infrared image fusion network based on attention mechanism. Sensors 2022, 22, 3651. [Google Scholar] [CrossRef]
  25. Xie, Y.; Zhang, L.; Yu, X.; Xie, W. YOLO-MS: Multispectral object detection via feature interaction and self-attention guided fusion. IEEE Trans. Cogn. Dev. Syst. 2023, 15, 2132–2143. [Google Scholar]
  26. Wang, Z.; Zhang, Q. Real-Time Aerial Multispectral Object Detection with Dynamic Modality-Balanced Pixel-Level Fusion. Sensors 2025, 25, 3039. [Google Scholar]
  27. Li, C.; Song, D.; Tong, R.; Tang, M. Illumination-aware faster R-CNN for robust multispectral pedestrian detection. Pattern Recognit. 2019, 85, 161–171. [Google Scholar] [CrossRef]
  28. Wang, F.; Jiang, M.; Qian, C.; Yang, S.; Li, C.; Zhang, H.; Wang, X.; Tang, X. Residual attention network for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3156–3164. [Google Scholar]
  29. Chen, Y.; Li, J.; Xiao, H.; Jin, X.; Yan, S.; Feng, J. Dual path networks. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
  30. Li, S.; Li, Y.; Li, Y.; Li, M.; Xu, X. Yolo-firi: Improved yolov5 for infrared image object detection. IEEE Access 2021, 9, 141861–141875. [Google Scholar] [CrossRef]
  31. Sun, Y.; Cao, B.; Zhu, P.; Hu, Q. Detfusion: A detection-driven infrared and visible image fusion network. In Proceedings of the 30th ACM International Conference on Multimedia, Lisbon, Portugal, 10–14 October 2022; pp. 4003–4011. [Google Scholar]
  32. Chen, Y.; Yuan, X.; Wang, J.; Wu, R.; Li, X.; Hou, Q.; Cheng, M.M. YOLO-MS: Rethinking multi-scale representation learning for real-time object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 4240–4252. [Google Scholar] [CrossRef]
  33. Wang, Z.; Liao, X.; Yuan, J.; Yao, Y.; Li, Z. Cdc-yolofusion: Leveraging cross-scale dynamic convolution fusion for visible-infrared object detection. IEEE Trans. Intell. Veh. 2024, 10, 2080–2093. [Google Scholar] [CrossRef]
  34. Wang, H.; Wang, C.; Fu, Q.; Si, B.; Zhang, D.; Kou, R.; Yu, Y.; Feng, C. YOLOFIV: Object detection algorithm for around-the-clock aerial remote sensing images by fusing infrared and visible features. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 15269–15287. [Google Scholar] [CrossRef]
  35. Tang, L.; Yuan, J.; Zhang, H.; Jiang, X.; Ma, J. PIAFusion: A progressive infrared and visible image fusion network based on illumination aware. Inf. Fusion 2022, 83, 79–92. [Google Scholar] [CrossRef]
  36. Hao, L.; Xu, L.; Liu, C.; Dong, Y. LASFNet: A Lightweight Attention-Guided Self-Modulation Feature Fusion Network for Multimodal Object Detection. arXiv 2025, arXiv:2506.21018. [Google Scholar]
  37. Yang, Q.; Zhang, Y.; Zhao, Z.; Zhang, J.; Zhang, S. IAIFNet: An illumination-aware infrared and visible image fusion network. IEEE Signal Process. Lett. 2024, 31, 1374–1378. [Google Scholar] [CrossRef]
  38. Zou, W.; Gao, H.; Yang, W.; Liu, T. Wave-mamba: Wavelet state space model for ultra-high-definition low-light image enhancement. In Proceedings of the 32nd ACM International Conference on Multimedia, Melbourne, Australia, 28 October–1 November 2024; pp. 1534–1543. [Google Scholar]
  39. Chen, X.; Jin, S.; Zhao, L.; Yang, C.; Zhang, D.; Wang, X.; He, X.; Wang, H.; Chen, Z.; Zheng, Z. Mask Guided Frequency Feature Fusion for Visible-Infrared Remote Sensing Object Detection. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5407415. [Google Scholar] [CrossRef]
  40. Jiang, M.; Wang, Z.; Kong, J.; Zhuang, D. MCFusion: Infrared and visible image fusion based multiscale receptive field and cross-modal enhanced attention mechanism. J. Electron. Imaging 2024, 33, 013039. [Google Scholar] [CrossRef]
  41. Li, K.; Wang, D.; Hu, Z.; Li, S.; Ni, W.; Zhao, L.; Wang, Q. Fd2-net: Frequency-driven feature decomposition network for infrared-visible object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 4797–4805. [Google Scholar]
  42. Xiao, G.; Tang, Z.; Guo, H.; Yu, J.; Shen, H.T. FAFusion: Learning for infrared and visible image fusion via frequency awareness. IEEE Trans. Instrum. Meas. 2024, 73, 1–11. [Google Scholar] [CrossRef]
  43. Zhao, T.; Yuan, M.; Jiang, F.; Wang, N.; Wei, X. Removal then selection: A coarse-to-fine fusion perspective for RGB-infrared object detection. arXiv 2024, arXiv:2401.10731. [Google Scholar] [CrossRef]
  44. Zeng, Y.; Liang, T.; Jin, Y.; Li, Y. MMI-Det: Exploring multi-modal integration for visible and infrared object detection. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 11198–11213. [Google Scholar] [CrossRef]
  45. Luo, M.; Zhao, R.; Zhang, S.; Chen, L.; Shao, F.; Meng, X. IM-CMDet: An Intra-Modal Enhancement and Cross-Modal Fusion Network for Small Object Detection in UAV Aerial RGBT Imagery. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5008316. [Google Scholar] [CrossRef]
  46. Xu, S.; Zheng, S.; Xu, W.; Xu, R.; Wang, C.; Zhang, J.; Teng, X.; Li, A.; Guo, L. Hcf-net: Hierarchical context fusion network for infrared small object detection. In Proceedings of the 2024 IEEE International Conference on Multimedia and Expo (ICME), Niagara Falls, ON, Canada, 15–19 July 2024; pp. 1–6. [Google Scholar]
  47. Chen, L.; Gu, L.; Li, L.; Yan, C.; Fu, Y. Frequency Dynamic Convolution for Dense Image Prediction. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 10–17 June 2025; pp. 30178–30188. [Google Scholar]
  48. Liu, J.; Fan, X.; Huang, Z.; Wu, G.; Liu, R.; Zhong, W.; Luo, Z. Target-aware dual adversarial learning and a multi-scenario multi-modality benchmark to fuse infrared and visible for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5802–5811. [Google Scholar]
  49. Razakarivony, S.; Jurie, F. Vehicle detection in aerial imagery: A small target detection benchmark. J. Vis. Commun. Image Represent. 2016, 34, 187–203. [Google Scholar] [CrossRef]
  50. Liu, J.; Jing, D.; Zhang, H.; Dong, C. SRFAD-Net: Scale-Robust Feature Aggregation and Diffusion Network for Object Detection in Remote Sensing Images. Electronics 2024, 13, 2358. [Google Scholar] [CrossRef]
  51. Liu, J.; Jing, D.; Cao, Y.; Wang, Y.; Guo, C.; Shi, P.; Zhang, H. Lightweight Progressive Fusion Calibration Network for Rotated Object Detection in Remote Sensing Images. Electronics 2024, 13, 3172. [Google Scholar] [CrossRef]
  52. Jocher, G.; Qiu, J. Ultralytics YOLO11. Version 11.0.0, License AGPL-3.0. 2024. Available online: https://github.com/ultralytics/ultralytics (accessed on 5 December 2024).
  53. Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J. Yolov10: Real-time end-to-end object detection. Adv. Neural Inf. Process. Syst. 2024, 37, 107984–108011. [Google Scholar]
  54. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
  55. Zhou, X.; Krähenbühl, P. Joint COCO and LVIS workshop at ECCV 2020: LVIS challenge track technical report: Centernet2. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020. [Google Scholar]
  56. Sun, P.; Zhang, R.; Jiang, Y.; Kong, T.; Xu, C.; Zhan, W.; Tomizuka, M.; Li, L.; Yuan, Z.; Wang, C.; et al. Sparse r-cnn: End-to-end object detection with learnable proposals. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 14454–14463. [Google Scholar]
  57. Zhao, Z.; Bai, H.; Zhang, J.; Zhang, Y.; Xu, S.; Lin, Z.; Timofte, R.; Van Gool, L. Cddfuse: Correlation-driven dual-branch feature decomposition for multi-modality image fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 5906–5916. [Google Scholar]
  58. Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 16965–16974. [Google Scholar]
  59. Han, J.; Wang, Y.; Zhang, Y.; Chen, L. MM-DETR: An Efficient Multimodal Detection Transformer with Mamba-Driven Dual-Granularity Fusion and Frequency-Aware Modality Adapters. arXiv 2025, arXiv:2512.00363. [Google Scholar]
Figure 1. Comparison of feature-level fusion strategies for one stage detectors: (a) traditional feature-level fusion; (b) the asymmetric feature-level fusion proposed in this paper.
Figure 1. Comparison of feature-level fusion strategies for one stage detectors: (a) traditional feature-level fusion; (b) the asymmetric feature-level fusion proposed in this paper.
Symmetry 17 02174 g001
Figure 2. Overview of the proposed AsyFusionNet framework. Multimodal inputs (RGB and IR images) are processed through multi-stage feature blocks (P1–P5), adaptively fused via the LGAF and PAFPN modules, and finally fed into the detection head to produce object predictions.
Figure 2. Overview of the proposed AsyFusionNet framework. Multimodal inputs (RGB and IR images) are processed through multi-stage feature blocks (P1–P5), adaptively fused via the LGAF and PAFPN modules, and finally fed into the detection head to produce object predictions.
Symmetry 17 02174 g002
Figure 3. Illustration of the LGAF fusion and LocalAtt/GlobalAtt mechanism. Parallel convolutional branches produce local (p = 2) and global (p = 4) attention, which are fused via residual addition and concatenation to form multi-scale features.
Figure 3. Illustration of the LGAF fusion and LocalAtt/GlobalAtt mechanism. Parallel convolutional branches produce local (p = 2) and global (p = 4) attention, which are fused via residual addition and concatenation to form multi-scale features.
Symmetry 17 02174 g003
Figure 4. Schematic of the Hierarchical Adaptive Mixer (HAM). (a) The HAM block consists of a normalization layer, CGLU, and a HAConv for cross-layer information fusion. (b) HAConv extracts multi-scale features via multi-branch convolutions and performs dynamic cross-scale aggregation by weighting different branches with adaptive coefficients.
Figure 4. Schematic of the Hierarchical Adaptive Mixer (HAM). (a) The HAM block consists of a normalization layer, CGLU, and a HAConv for cross-layer information fusion. (b) HAConv extracts multi-scale features via multi-branch convolutions and performs dynamic cross-scale aggregation by weighting different branches with adaptive coefficients.
Symmetry 17 02174 g004
Figure 5. The overall structure of FS2M, which involves splitting the input into several FSBs after passing through CBS, concatenating them together, and then outputting through CBS; FSB consists of CBS and FDConv, and includes cross-layer shortcut connections.
Figure 5. The overall structure of FS2M, which involves splitting the input into several FSBs after passing through CBS, concatenating them together, and then outputting through CBS; FSB consists of CBS and FDConv, and includes cross-layer shortcut connections.
Symmetry 17 02174 g005
Figure 6. Normalized object size distributions on the M 3 FD and VEDAI datasets.
Figure 6. Normalized object size distributions on the M 3 FD and VEDAI datasets.
Symmetry 17 02174 g006
Figure 7. Cross-modal detection visualization results on the M 3 FD dataset.
Figure 7. Cross-modal detection visualization results on the M 3 FD dataset.
Symmetry 17 02174 g007
Figure 8. Comparison of heatmaps produced by different methods in multimodal image fusion. (a) Original RGB–IR image pairs; (b) heatmaps of the baseline method; (c) heatmaps of the proposed method.
Figure 8. Comparison of heatmaps produced by different methods in multimodal image fusion. (a) Original RGB–IR image pairs; (b) heatmaps of the baseline method; (c) heatmaps of the proposed method.
Symmetry 17 02174 g008
Figure 9. Qualitative comparison with YOLOv11 on the VEDAI dataset. (a) Ground truth annotations; (b) predictions of YOLOv11; (c) predictions of the proposed method. Red triangles indicate false detections.
Figure 9. Qualitative comparison with YOLOv11 on the VEDAI dataset. (a) Ground truth annotations; (b) predictions of YOLOv11; (c) predictions of the proposed method. Red triangles indicate false detections.
Symmetry 17 02174 g009
Table 1. Configuration of the experimental environment.
Table 1. Configuration of the experimental environment.
EnvironmentParameter
Operating SystemRed Hat 4.8.5-28
Programming LanguagePython 3.11
FrameworkPyTorch 2.1
CUDACUDA 12.1
GPUNVIDIA A800
CPUIntel 6338N Xeon
VRAM80 GB
Table 2. Configuration of HAM and FS2M in the backbone (input 640 × 640 ).
Table 2. Configuration of HAM and FS2M in the backbone (input 640 × 640 ).
ModuleBranchLevelFeature SizeChannels
HAMRGBP2 (stride 4) 160 × 160 256
HAMRGBP3 (stride 8) 80 × 80 512
HAMRGBP4 (stride 16) 40 × 40 512
HAMRGBP5 (stride 32) 20 × 20 1024
FS2MIRP2 (stride 4) 160 × 160 256
FS2MIRP3 (stride 8) 80 × 80 512
FS2MIRP4 (stride 16) 40 × 40 512
Table 3. Ablation study on the asymmetric architecture.
Table 3. Ablation study on the asymmetric architecture.
Asymmetric M 3 FDVEDAI
RGBIRParams (M)GFLOPs m AP 50 val (%) m AP 50 : 95 val (%) Params (M)GFLOPs m AP 50 val (%) m AP 50 : 95 val (%)
P5P560.01178.953.1140.51146.627.4
P5P452.41179.353.2137.71150.328.0
P5P330.68.868.543.572.61142.626.0
P4P557.11172.146.7137.71140.925.0
Table 4. Ablation experiments on M 3 FD and VEDAI datasets.
Table 4. Ablation experiments on M 3 FD and VEDAI datasets.
DatasetHAMFS2MLGAFComplexityPerformance
#P (M)GFLOPsFPS m AP 50 val (%) m AP 50 : 95 val (%)
M 3 FD ×××60.011.0231.378.953.1
××32.414.9174.084.856.3
×57.715.1151.085.457.9
59.119.3137.386.158.2
VEDAI×××140.511.0121.346.627.4
××90.715.4105.450.131.2
×138.715.294.451.533.0
139.819.992.154.334.3
Table 5. Experimental results on the M 3 FD dataset.
Table 5. Experimental results on the M 3 FD dataset.
MethodModality m AP 50 m AP 50 : 95
YOLOv8RGB73.847.4
IR71.046.7
Multi78.051.7
YOLOv10 [53]RGB74.247.9
IR75.547.5
Multi78.651.6
YOLOv11RGB75.048.1
IR72.647.4
Multi77.551.8
Swin Transformer [54]RGB75.243.3
IR71.340.5
Multi72.541.1
CenterNet2 [55]RGB77.151.3
IR64.241.1
Multi70.144.8
Sparse RCNN [56]RGB80.247.5
IR74.642.8
Multi76.545.8
CDDFuse [57]Multi81.546.3
RT-DETR [58]Multi85.851.6
MM-DETR [59]Multi73.3942.6
AsyFusionNetRGB82.053.4
IR78.551.3
Multi86.358.2
Table 6. Experimental results on the VEDAI dataset.
Table 6. Experimental results on the VEDAI dataset.
MethodModalityCarPickupCampingTrucksOthersTractorsBoatsVans m AP 50 m AP 50 : 95
YOLOv8RGB80.964.965.348.320.647.029.943.950.131.6
IR83.272.955.733.323.129.824.444.945.928.9
Multi82.974.267.441.434.069.225.025.652.232.2
YOLOv10 [53]RGB78.363.354.320.024.040.724.320.540.725.0
IR76.060.152.525.215.222.314.054.840.024.8
Multi82.366.660.526.118.151.537.632.946.929.6
YOLOv11RGB83.272.955.733.323.129.824.444.945.928.9
IR81.260.557.035.618.833.223.345.044.227.3
Multi76.966.265.847.732.646.922.920.347.727.5
AsyFusionNetRGB82.472.554.334.722.829.623.546.545.229.8
IR74.461.242.050.516.635.09.519.342.326.3
Multi83.571.765.144.128.251.245.043.754.134.7
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, J.; Gao, J.; Liu, X.; Tao, J.; Ma, J.; Guo, C.; Shi, P.; Li, P. Asymmetric Spatial–Frequency Fusion Network for Infrared and Visible Object Detection. Symmetry 2025, 17, 2174. https://doi.org/10.3390/sym17122174

AMA Style

Liu J, Gao J, Liu X, Tao J, Ma J, Guo C, Shi P, Li P. Asymmetric Spatial–Frequency Fusion Network for Infrared and Visible Object Detection. Symmetry. 2025; 17(12):2174. https://doi.org/10.3390/sym17122174

Chicago/Turabian Style

Liu, Jing, Jing Gao, Xiaoyong Liu, Junjie Tao, Jun Ma, Chaoping Guo, Peijun Shi, and Pan Li. 2025. "Asymmetric Spatial–Frequency Fusion Network for Infrared and Visible Object Detection" Symmetry 17, no. 12: 2174. https://doi.org/10.3390/sym17122174

APA Style

Liu, J., Gao, J., Liu, X., Tao, J., Ma, J., Guo, C., Shi, P., & Li, P. (2025). Asymmetric Spatial–Frequency Fusion Network for Infrared and Visible Object Detection. Symmetry, 17(12), 2174. https://doi.org/10.3390/sym17122174

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop