CLSANet: Cognitive Learning-Based Self-Adaptive Feature Fusion for Multimodal Visual Object Detection

Peng, Han; Liu, Qionglin; Ruan, Riqing; Yuan, Shuaiqi; Li, Qin

doi:10.3390/electronics14153082

Open AccessArticle

CLSANet: Cognitive Learning-Based Self-Adaptive Feature Fusion for Multimodal Visual Object Detection

by

Han Peng

^1,2

,

Qionglin Liu

¹,

Riqing Ruan

¹

,

Shuaiqi Yuan

¹ and

Qin Li

^1,2,*

¹

School of Intelligent Robotics, Hunan University of Technology and Business, Changsha 410205, China

²

Xiangjiang Laboratory, Changsha 410205, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(15), 3082; https://doi.org/10.3390/electronics14153082

Submission received: 21 May 2025 / Revised: 2 July 2025 / Accepted: 30 July 2025 / Published: 1 August 2025

(This article belongs to the Special Issue Digital Intelligence Technology and Applications)

Download

Browse Figures

Versions Notes

Abstract

Multimodal object detection leverages the complementary characteristics of visible (RGB) and infrared (IR) imagery, making it well-suited for challenging scenarios such as low illumination, occlusion, and complex backgrounds. However, most existing fusion-based methods rely on static or heuristic strategies, limiting their adaptability to dynamic environments. To address this limitation, we propose CLSANet, a cognitive learning-based self-adaptive network that enhances detection performance by dynamically selecting and integrating modality-specific features. CLSANet consists of three key modules: (1) a Dominant Modality Identification Module that selects the most informative modality based on global scene analysis; (2) a Modality Enhancement Module that disentangles and strengthens shared and modality-specific representations; and (3) a Self-Adaptive Fusion Module that adjusts fusion weights spatially according to local scene complexity. Compared to existing methods, CLSANet achieves state-of-the-art detection performance with significantly fewer parameters and lower computational cost. Ablation studies further demonstrate the individual effectiveness of each module under different environmental conditions, particularly in low-light and occluded scenes. CLSANet offers a compact, interpretable, and practical solution for multimodal object detection in resource-constrained settings.

Keywords:

multimodal object detection; infrared–visible fusion; cognitive learning; self-adaptive fusion; modality selection; lightweight detection networks; scene-aware perception

1. Introduction

Multimodal object detection plays an increasingly critical role in real-world applications such as surveillance [1,2], autonomous driving [3], and nighttime perception [4]. These scenarios are typically characterized by complex and dynamically changing visual conditions, including illumination variations, occlusion, and environmental interference such as smoke or fog. Under conditions of good visibility, visible images provide rich visual information such as texture, color, structure, and edge details, making them highly valuable for accurate object recognition. However, when exposed to adverse environments such as low-light scenes, severe occlusion, or strong backlighting, the representational capacity of RGB images degrades significantly, resulting in compromised detection performance. In contrast, infrared (IR) images capture thermal radiation emitted by objects and are independent of ambient light, allowing them to maintain stable perception under poor lighting conditions. Nevertheless, IR images often lack the detailed structural and textural cues available in RGB imagery. Therefore, relying solely on a single modality is often insufficient to address the diverse perceptual challenges in complex environments. By combining the fine-grained visual details of RGB with the illumination-invariant characteristics of IR, it is possible to exploit complementary information and enhance the reliability and adaptability of object detection systems across various environmental conditions.

Although multimodal fusion techniques have achieved considerable progress, current methods still suffer from the following three critical weaknesses:

Weakness 1: Lack of scene-aware modeling of modality reliability.

In practical applications, the perceptual effectiveness of each modality can fluctuate significantly with environmental factors such as illumination, occlusion, and thermal interference. However, most existing multimodal fusion methods—such as DenseFuse [5], CDDFuse [6], and CUFD [7]—rely on fixed weighting schemes or static heuristic rules. These methods lack the capacity to model the relationship between scene semantics and modality reliability, making it difficult to dynamically adjust modality contributions in response to changing conditions. As a result, models often misidentify the dominant modality, for instance, by over-relying on RGB images in nighttime scenes or amplifying low-quality IR features in thermally cluttered environments, ultimately leading to reduced detection accuracy and unstable performance.

As shown in Figure 1, RGB and IR modalities demonstrate distinct advantages under varying environmental conditions, highlighting their strong complementarity. In subfigure (a), under nighttime scenes with strong glare interference from vehicle headlights, the RGB image suffers from visual degradation and fails to reveal the pedestrian in the red-marked region, whereas the IR image clearly highlights the human target based on thermal radiation. Similarly, in subfigure (b), dense smoke causes severe occlusion in the RGB image, making it difficult to distinguish objects, while the IR modality still provides sufficient contrast to identify the pedestrian. In contrast, subfigures (c) and (d) depict well-lit daytime scenarios where the RGB modality excels. RGB images offer richer color, texture, and structure, allowing for more accurate recognition of dense crowds or small-scale pedestrians. In these cases, IR imagery may suffer from limited discriminability due to uniform thermal backgrounds or reduced temperature differences between targets and surroundings. These observations underscore the context-dependent superiority of each modality: IR is more effective in low-visibility or occluded environments, while RGB is preferable in scenes with sufficient illumination and complex structural information. They also reveal the limitations of static fusion strategies, which fail to fully exploit the complementary nature of RGB and IR under diverse conditions.

Therefore, it is crucial to design a scene-aware and dynamically adaptive fusion mechanism that can identify the dominant modality in real time, adjust modality contributions based on environmental semantics, and enable fine-grained cross-modal cooperation. This insight serves as a key motivation for the development of our proposed CLSANet framework.

Weakness 2: Static fusion strategies lack region-level adaptability.Most existing multimodal fusion methods apply a uniform fusion rule across the entire image [5,8,9], without accounting for the fact that different semantic regions—such as foreground versus background or object boundaries versus interior areas—may exhibit distinct modality preferences. Specifically, foreground objects often rely on the rich texture, color, and shape cues provided by RGB images to support accurate localization and classification, while background or peripheral regions are typically better represented by the contour-preserving and illumination-insensitive characteristics of IR imagery, especially under challenging lighting conditions. Such spatially invariant fusion designs ignore the coupling between regional semantics and modality-specific advantages [7,10], leading to the inclusion of redundant or even conflicting features. This not only increases the burden of feature redundancy, but may also enhance background interference or blur object boundaries, ultimately limiting the model’s ability to fully exploit complementary multimodal information at a fine-grained spatial level.

Weakness 3: Lack of modality contribution modeling hinders interpretability of detection results.Most existing multimodal object detection methods do not explicitly model the individual contributions of RGB and IR modalities to the final decision [11,12,13]. Instead, they typically generate highly entangled fused features by directly concatenating or weighting the modality-specific features. As a result, when the detector outputs bounding boxes or classification scores, it is difficult to determine which modality primarily influenced the decision. This lack of modality-level attribution limits the interpretability of the model, making it challenging to diagnose errors or performance fluctuations—particularly in failure cases. Moreover, current fusion strategies generally lack mechanisms to perceive and adapt to dynamic variations in modality quality. For example, under low-light conditions or when IR suffers from thermal interference, the model cannot selectively suppress noisy signals, which may degrade detection stability. In real-world applications such as nighttime surveillance or autonomous driving in adverse conditions, the inability to interpret and control modality behavior becomes a major obstacle to reliable deployment.

To address the above challenges, we propose a novel Cognitive Learning-based Self-Adaptive Feature Fusion Network (CLSANet). Inspired by the human perceptual ability to selectively focus on the most informative sensory input, CLSANet is designed to dynamically adjust modality contributions based on scene semantics and local visual complexity [14]. It adopts a modular and lightweight front-fusion architecture tailored to efficient RGB-T(RGB-IR) object detection and consists of the following three components:

Dominant Modality Identification (DMI): This module analyzes global scene context in real time and selects the most informative modality as the dominant one, guiding the subsequent fusion process.

Modality Enhancement (ME): An attention-driven feature disentanglement mechanism is introduced to explicitly decompose modality features into shared and differential components, enhancing semantic specificity and cross-modal complementarity.

Self-Adaptive Fusion (SAF): This module incorporates both global semantics and local region complexity to dynamically regulate modality-wise fusion weights, enabling fine-grained, interpretable multimodal integration.

We conduct extensive experiments on three public RGB-T detection benchmarks—M³FD, LLVIP, and MSRS—to validate the effectiveness of CLSANet under varying environmental conditions, including illumination changes and thermal noise.

The main contributions of this work are summarized as follows:

We propose a cognitive learning-based adaptive fusion strategy that explicitly models the relationship between scene semantics and modality reliability, enabling dynamic dominant modality selection.
We design a region-aware Modality Enhancement Module that disentangles and reinforces shared and differential features, improving the discriminative power and reliability of the fused representation.
We develop a lightweight, interpretable, and detector-friendly RGB-T fusion framework that achieves state-of-the-art performance on multiple benchmarks, with high inference efficiency and deployment flexibility.

The remainder of this paper is organized as follows: Section 2 reviews related work on multimodal fusion and object detection; Section 3 presents the CLSANet architecture and key modules; Section 4 details the experimental settings, datasets, and evaluation results; Section 5 provides an in-depth discussion of the findings, analyzes their implications, and addresses potential limitations; and Section 6 summarizes the main contributions and outlines future research directions.

2. Related Works

The fusion of infrared and visible images enhances object detection by leveraging complementary strengths across varied environments. While existing methods improve detection, many rely on static fusion weights, limiting adaptability in dynamic scenes. This section reviews recent advancements in infrared–visible fusion and multimodal detection, underscoring current limitations and positioning our cognitive learning-based, self-adaptive feature fusion network as a more flexible, context-aware solution.

2.1. Infrared and Visible Image Fusion

The fusion of infrared and visible images has become an essential technique for enhancing image clarity and object discernibility in complex and low-visibility environments [15]. Infrared images excel in capturing thermal data, highlighting critical targets in low-light conditions, while visible images provide rich texture and color details. Recent fusion methods have leveraged the complementary features of these two modalities to produce high-quality fused images. However, the adaptability of fusion strategies across diverse scenes remains an ongoing challenge.

Techniques such as BIDFuse [16] and Epnet [17] employ bidirectional attention and modality-specific encoders to retain modality-specific features while integrating complementary information. BIDFuse effectively enhances feature retention by focusing on both RGB and IR attributes. However, once trained, its attention weights remain static, limiting the adaptability to dynamic scene changes where modality relevance may shift. Similarly, the Progressive Semantic Enhancement (PSE) network [18] uses multi-scale cross-modality interaction to strengthen the integration of semantic information, though it lacks an adaptive mechanism to re-weigh contributions based on real-time requirements.

Methods like FAFusion [19] and FISCNet [20] integrate frequency and spatial domains to enhance texture and salience in fused images. FAFusion, in particular, focuses on preserving high-frequency details from visible images, ensuring clarity under variable lighting conditions. The FISCNet combines frequency domain analysis with spatial compensation, allowing it to maintain fine texture details and object prominence from IR data. However, these models typically operate with fixed fusion parameters, lacking the flexibility to adapt dynamically to changing detection scenarios.

Decomposition methods like CDDFuse [6] and diffusion-based models such as DDFM [21] focus on separating modality-specific and shared features to create fused images with both detail and depth. CDDFuse employs a dual-branch decomposition structure that extracts modality-shared low-frequency features and modality-specific high-frequency features for a balanced fusion result. DDFM, using denoising diffusion, provides stable fusion outputs in noisy conditions but lacks dynamic adjustment capabilities based on scene-specific demands.

2.2. Multimodal Object Detection

Multimodal object detection leverages the complementary information from visible and infrared modalities, enhancing detection capabilities in diverse lighting and environmental conditions [22,23,24]. A primary challenge in this area is effectively integrating both modalities without losing critical features unique to each. Traditional early-fusion [25], mid-fusion [26], and late-fusion [27] strategies have each shown varying levels of success. Early fusion combines modalities before feature extraction but often faces information interference due to domain gaps between the modalities, while late-fusion methods merge results post-detection, which may neglect the interactive potential between modalities during feature extraction.

Recent advancements emphasize joint optimization and adaptation of fusion processes to enhance detection performance. Rivadeneira et al. [22,28,29] proposed a guided super-resolution approach that improves object detection on low-resolution thermal images by enhancing resolution before detection, specifically designed for scenarios where visible light is inadequate, such as in low-light conditions. Zhang et al. [25] introduced a shape-priority early-fusion module (ShaPE) to mitigate the information interference problem in early fusion, significantly improving the detection performance of single-branch networks on edge devices.

One-stream and two-stream fusion architectures have also emerged as effective frameworks, each with unique benefits [30]. MMI-Det [31] combines a contour enhancement and fusion focus module to selectively enhance important features, showing strong performance in visible–infrared detection by focusing on different spatial and spectral properties. SeaDATE [32] utilizes a transformer-based dual attention mechanism to capture both local and global features, addressing the limitations of CNN-based approaches that struggle with long-term dependencies and cross-modal interactions.

Additionally, adaptive distillation methods have been explored to balance model complexity and accuracy. Chen et al. [33] introduced Adaptive Modal Fusion Distillation (AMFD), where a knowledge distillation framework compresses multispectral networks without significant performance loss, facilitating deployment on embedded devices. MetaFusion [34] extends this idea by employing a meta-feature embedding technique, allowing the model to dynamically adapt to varying modal complexities and achieve seamless integration of object detection and fusion tasks.

These developments highlight ongoing efforts to address the inherent challenges of multimodal object detection, specifically information interference, domain gaps, and computational efficiency. However, existing methods still struggle with fully adaptive fusion across varying scenes and environmental conditions, necessitating an approach to achieve more reliable and dynamic modality integration for accurate detection.

3. The Proposed Method

Our proposed CLSANet method dynamically adapts to the varying strengths of infrared and visible images across different environmental conditions, addressing key challenges in multimodal object detection. The framework is composed of three primary modules: the Dominant Modality Identification Module, the Modality Enhancement Module, and the Self-Adaptive Fusion Module. These components work synergistically to determine the dominant modality, enhance modality-specific features, and adaptively balance modality contributions in the fusion process to improve overall detection accuracy.

3.1. Problem Formulation

Let

V \in R^{H \times W \times C}

and

I \in R^{H \times W \times C}

denote the visible (RGB) and infrared (IR) images, where H, W, and C represent the image height, width, and number of channels, respectively. In this work, both modalities are mapped to three-channel representations (

C = 3

) to ensure consistent dimensionality for fusion.

The goal of CLSANet is to generate a fused feature representation

F_{fusion} \in R^{H \times W \times C}

that adaptively integrates complementary information from both modalities to enhance object detection performance under varying environmental conditions. As illustrated in Figure 2, the overall pipeline comprises three key stages:

Dominant Modality Identification dynamically identifies the dominant modality by analyzing real-time environmental cues (e.g., lighting conditions, scene complexity). The network selects the modality with the most informative features for each scene, enhancing detection accuracy across varying conditions [35].
Modality Enhancement refines modality-specific features by separating them into shared and differential components. This preserves unique modality characteristics while enhancing complementary information, improving feature quality and robustness for accurate multimodal detection.
Self-Adaptive Fusion integrates features from both modalities using an adaptive weighting mechanism that adjusts in real time based on scene complexity. This dynamic balancing ensures consistent performance across diverse conditions, maintaining detection reliability and accuracy in fluctuating scenarios.

3.2. Dominant Modality Identification Module

In multimodal object detection, the contribution of each modality to discriminative capability varies substantially across different scenes. To adaptively select the modality providing the most informative cues under varying conditions, we propose a Dominant Modality Identification (DMI) module that models global scene semantics to determine modality dominance.

As illustrated in Figure 3, the DMI module first concatenates the RGB image

V \in R^{H \times W \times C}

and the infrared image

I \in R^{H \times W \times C}

along the channel dimension to form a joint feature map:

[V, I] \in R^{H \times W \times 2 C} .

(1)

Direct concatenation may introduce statistical inconsistencies due to differences in modality-specific distributions. Thus, we employ a lightweight convolutional encoder

ϕ (\cdot)

comprising multiple

3 \times 3

convolutional layers with a stride of 2, each followed by BatchNorm and LeakyReLU activation. The number of output channels progressively increases from 6 to 16, 32, 64, and 128. This design allows the encoder to extract cross-modal local structural information and project features into a unified representation space:

F_{s} = ϕ ([V, I]) \in R^{H^{'} \times W^{'} \times k} .

(2)

We apply global average pooling to

F_{s}

to obtain a scene-level semantic vector:

g = GAP (F_{s}) \in R^{k} .

(3)

This vector is then passed through a fully connected layer with learnable weights

W_{d} \in R^{k \times 1}

and bias

b_{d}

, followed by a sigmoid activation to produce the probability that the RGB modality is dominant under the current conditions:

p_{dom} = σ (W_{d}^{⊤} g + b_{d}) .

(4)

Based on this probability, the dominant modality

M_{dom}

is selected as

M_{dom} = \{\begin{matrix} V, & p_{dom} > 0.5 \\ I, & otherwise \end{matrix}

(5)

Here,

p_{dom}

quantifies the likelihood that the RGB modality provides superior discriminative cues relative to the infrared modality in the current scene. The selected dominant modality guides the subsequent fusion stage, while the secondary modality complements the feature representation. Through end-to-end training, the model learns optimal decision boundaries, ensuring accurate dominant modality selection across diverse environments and enhancing the performance and generalization capability of the multimodal detection framework.

3.3. Modality Enhancement Module

To enrich modality-specific representations and improve the discriminative capacity of fused features, we propose a lightweight yet effective Modality Enhancement Module based on residual difference modeling with attention modulation [36]. This module is designed to enhance both strong and weak modality streams by injecting complementary residual cues derived from their mutual differences, guided by channel-level importance weights.

Let

F_{s} \in R^{B \times C \times H \times W}

and

F_{w} \in R^{B \times C \times H \times W}

denote the feature maps of the strong and weak modalities, respectively, as determined in the modality selection stage (Section 2.2). We compute bidirectional difference residuals:

R_{s \leftarrow w} = (F_{w} - F_{s}) \cdot σ (GAP (F_{w} - F_{s}))

(6)

R_{w \leftarrow s} = (F_{s} - F_{w}) \cdot σ (GAP (F_{s} - F_{w}))

(7)

Here,

GAP (\cdot)

denotes global average pooling applied along the spatial dimensions to each channel independently, and

σ (\cdot)

represents sigmoid activation. These operations produce soft attention weights that highlight semantically informative channels in the difference tensors.

The final enhanced feature maps are obtained by injecting the residuals back into the original modality streams:

F_{s}^{'} = F_{s} + R_{s \leftarrow w}, F_{w}^{'} = F_{w} + R_{w \leftarrow s}

(8)

Through this mechanism, both modalities are simultaneously refined by the residual differences computed from the other, while the attention modulation ensures that only relevant, high-activation channels are emphasized.

3.4. Self-Adaptive Fusion Module

After obtaining the enhanced features

F_{s}^{'}

and

F_{w}^{'}

from the dominant and subordinate modalities, respectively (as described in Section 3.3), we introduce a Self-Adaptive Fusion Module to dynamically integrate them into a unified representation for detection.

Instead of using fixed or equal weights, the module learns a spatially adaptive weight map

α \in R^{H \times W \times 2}

to determine the contribution of each modality at every spatial location. This allows the fusion to flexibly emphasize one modality over the other depending on the scene context, such as occlusion, lighting, or texture complexity.

The fusion weights are computed by first concatenating

F_{s}^{'}

and

F_{w}^{'}

along the channel dimension and applying a

3 \times 3

convolution layer:

α = softmax ({Conv}_{1} ([F_{s}^{'}, F_{w}^{'}]))

(9)

Here,

{Conv}_{1}

takes an input with

2 C

channels and outputs two channels, corresponding to the spatial weights

α_{s}

and

α_{w}

. A softmax is applied along the channel dimension at each spatial location to ensure that

α_{s} + α_{w} = 1

(10)

Using the learned weights, the fused feature map is calculated as

F_{fusion} = α_{s} \cdot F_{s}^{'} + α_{w} \cdot F_{w}^{'}

(11)

To further refine the fused representation, we apply a non-linear transformation via a convolutional block:

F_{fusion}^{refined} = {Conv}_{2} (ReLU (F_{fusion}))

(12)

where

{Conv}_{2}

is a

3 \times 3

convolution layer with input and output channels equal to C, and ReLU introduces non-linearity for noise suppression and salient feature enhancement.

The final refined output

F_{fusion}^{refined}

serves as input to the detection head, providing a balanced, semantically rich multimodal representation.

4. Experiments

In this section, we present a comprehensive evaluation of our proposed CLSANet method. We introduce the benchmark datasets, detail the experimental setup, and compare CLSANet with existing state-of-the-art methods. Furthermore, we perform ablation studies to investigate the contributions of each core component in the CLSANet architecture.

4.1. Datasets

To assess the effectiveness and generalization ability of CLSANet, we conduct experiments on three widely used multimodal object detection benchmarks, M³FD [37], LLVIP [38], and MSRS [39], as summarized in Table 1. These datasets provide challenging and diverse scenarios in terms of illumination conditions, object categories, and environmental complexity, enabling fair and comprehensive comparisons with prior multimodal detection methods.

M³FD:This dataset comprises 4200 pairs of RGB-T images with a resolution of 1024 × 768. It covers diverse scenes, including daytime, nighttime, and smoke-covered environments. With six object categories, it serves as a strong benchmark for evaluating performance consistency under illumination variations and occlusions.

LLVIP: Specifically designed for low-light pedestrian detection, LLVIP contains 15,488 aligned RGB-T image pairs at 1280 × 1024 resolution. It focuses on challenging nighttime scenes with a single object class (pedestrian), making it ideal for validating performance under extremely poor visibility.

MSRS: The MSRS dataset includes 1569 high-quality, pixel-aligned image pairs collected under both day and night conditions. Each pair is annotated with bounding boxes from three object classes. Its balanced composition across modalities and scenes makes it suitable for evaluating detection precision and fusion generalization.

4.2. Experimental Setup

All experiments were conducted using the PyTorch (version 2.4.0) framework with CUDA 12.4, running on a system equipped with three NVIDIA A100 GPUs. and one A6000 GPU. Our framework first applies the proposed CLSANet module to extract and fuse features from aligned RGB and infrared images. The fused representation is then passed to the YOLOv7 detector for object prediction. This design enables CLSANet to enhance cross-modal representation before entering the detection stage.

The network was trained using the Adam optimizer with an initial learning rate of 0.01, decayed via a cosine annealing schedule to facilitate convergence. Input images were resized to

640 \times 640

resolution, and the batch size was adjusted according to available GPU memory. Each model was trained for 300 epochs to ensure adequate convergence.

For all datasets (M³FD, LLVIP, and MSRS), we adopted a random 80–20% split for training and testing. To enhance generalization, standard data augmentation techniques such as random flipping and color jittering were applied during training.

Evaluation followed the COCO object detection protocol using two standard metrics: mAP@50 and mAP@95. The mAP@50 measures mean average precision under a fixed IoU threshold of 0.5, indicating performance under lenient matching conditions. In contrast, mAP@95 averages precision over IoU thresholds from 0.5 to 0.95 (in steps of 0.05), reflecting the model’s ability to handle stricter localization requirements. All reported results are averaged over three independent runs to ensure statistical reliability.

4.3. Quantitative Analysis

We evaluate the performance of CLSANet on three widely used benchmark datasets: M³FD, LLVIP, and MSRS. The evaluation includes comparisons with several state-of-the-art multimodal fusion methods. Performance is measured using mean average precision at IoU thresholds of 50% (

{mAP}_{50}

) and 95% (

{mAP}_{95}

), which assesses overall detection capability and localization precision under stricter conditions, respectively. This dual-metric setting provides a comprehensive view of the model’s accuracy and performance across diverse scenarios.

As shown in Table 2, CLSANet achieves the highest

{mAP}_{50}

of 0.950 on the M³FD dataset, significantly outperforming recent state-of-the-art methods such as MFMGF-Net (0.930) and TarDAL (0.927). Under the more stringent

{mAP}_{95}

criterion, CLSANet continues to exhibit superior performance with a score of 0.660. Furthermore, it achieves the best category-specific precision in challenging classes such as “Person” (0.924) and “Car” (0.964), demonstrating its strong capability to generalize across diverse and complex environments.

It is worth noting that several baseline methods, including DenseFuse and FusionGAN, adopt fixed-weight or simple concatenation strategies that fail to adaptively adjust to environmental changes. In contrast, models such as MFMGF-Net, TarDAL, and our CLSANet introduce scene-aware adaptive fusion mechanisms. Under identical evaluation settings, methods with adaptive fusion consistently outperform fixed-fusion baselines, providing quantitative evidence supporting the effectiveness and necessity of our proposed self-adaptive fusion module.

In addition to accuracy advantages, CLSANet exhibits excellent computational efficiency, requiring only 0.21M parameters and 1.19 GFLOPs—significantly less than heavier models like GANMcC (1002.56 GFLOPs) and MFMGF-Net (26.4 GFLOPs). This lightweight design substantially lowers deployment costs, making CLSANet particularly well-suited for real-time and resource-constrained application scenarios.

As shown in Table 3, CLSANet achieves the highest

{mAP}_{50}

of 0.983 on the low-light LLVIP dataset, slightly surpassing MFMGF-Net (0.979), and maintains a strong

{mAP}_{95}

of 0.675. The consistent performance improvement over traditional fusion-based detectors such as GAFF and Fusion-Mamba highlights the effectiveness of our self-adaptive fusion mechanism, which dynamically adjusts modality contributions in response to varying illumination conditions.

On the MSRS dataset, CLSANet achieves the best performance across all methods, reaching an

{mAP}_{50}

of 0.943 and

{mAP}_{95}

of 0.896, as reported in Table 4. These results validate its ability to maintain high detection precision across daytime and nighttime conditions, primarily due to its cognitive learning-based self-adaptive fusion mechanism, which dynamically balances the contributions of RGB and infrared inputs.

Overall, CLSANet demonstrates superior performance and computational efficiency across all three benchmarks, confirming its strong generalization, precision, and suitability for real-world multimodal detection tasks.

4.4. Qualitative Analysis

To further understand the behavior of the proposed CLSANet in complex environments, we present qualitative comparisons across four representative scenes in Figure 4. These examples illustrate how the model adaptively integrates modality-specific cues based on scene characteristics.

In Figure 4a, captured under low-light nighttime conditions, the baseline YOLOv7 model fails to detect the pedestrian in the center and incorrectly classifies a bus on the right, likely due to limited visibility in the RGB channel. CLSANet, by assigning higher attention to the infrared (IR) modality, is able to localize both the pedestrian and the bus more accurately, highlighting the model’s ability to respond to insufficient visible illumination by emphasizing thermographic cues.

Figure 4b depicts a scenario involving partial occlusion, where a pedestrian is obstructed by a foreground object. The IR stream alone fails to provide sufficient distinction, leading to a missed detection. In contrast, CLSANet leverages the RGB modality as the dominant source in this case, successfully identifying all pedestrians, indicating that the model can modulate modality emphasis based on contextual visibility.

Figure 4c illustrates a scene with smoke interference, which introduces noise and suppresses features in the RGB image. While the baseline method underperforms due to reduced RGB reliability, CLSANet allocates more attention to the IR channel, which is less affected by atmospheric scattering. This enables the model to maintain detection capability under degraded visible conditions.

Finally, in Figure 4d, where strong glare leads to specular highlights and reflections in the visible image, both the RGB- and IR-only baselines incorrectly identify a non-existent motorcycle. CLSANet, through joint consideration of both modalities, is able to suppress such false positives and retain valid target detections, demonstrating its ability to reconcile conflicting modality cues.

These case studies confirm that CLSANet’s self-adaptive fusion strategy enables spatially varying modality selection and weighting, allowing the model to better preserve relevant semantic cues under varying environmental degradations.

4.5. Ablation Studies

To better understand the contribution of each component in CLSANet, we conduct comprehensive ablation experiments on the M³FD, LLVIP, and MSRS datasets. We investigate both the modular composition of the network and the effectiveness of the Dominant Modality Identification Module (DMIM) under varying scene conditions.

Component-wise ablation: We first evaluate the impact of three key modules:

Ω_{1}

(Dominant Modality Identification Module),

Ω_{2}

(Modality Enhancement Module), and

Ω_{3}

(Self-Adaptive Fusion Module). Table 5 reports the results of models built by incrementally adding each component.

Model M1 (with only

Ω_{1}

) achieves better results than unimodal baselines but still underperforms compared to M2–M4, indicating that dominant modality selection alone is insufficient. Adding

Ω_{2}

(M2) leads to consistent improvement across all datasets. Replacing it with

Ω_{3}

(M3) yields further gains in

{mAP}_{95}

, demonstrating that adaptive fusion improves stability to spatial variation. Model M4, which integrates all modules, delivers the best overall performance, validating the complementary nature of selection, enhancement, and fusion mechanisms.

Ablation under scene-specific conditions: To further evaluate the effectiveness of the Dominant Modality Identification Module (DMIM), we conduct fine-grained ablation under three typical challenging conditions from the M³FD dataset: low light, occlusion, and strong light. We compare three strategies—fixed IR dominance, fixed RGB dominance, and adaptive selection via DMIM—as summarized in Table 6.

In low-light scenarios, IR-Only performs better than RGB-Only (0.952 vs. 0.894 in

{mAP}_{50}

), confirming the utility of infrared features in visibility-limited environments. However, our adaptive DMIM strategy achieves a further improvement of +0.024 in

{mAP}_{50}

and +0.044 in

{mAP}_{95}

, demonstrating its ability to flexibly prioritize modality based on the scene.

Under occlusion, RGB-Only slightly outperforms IR-Only due to stronger edge preservation, yet CLSANet still yields the highest detection performance (0.933/0.637), benefiting from its dynamic modality weighting and local enhancement.

In strong-light conditions, RGB features are usually dominant, but IR remains complementary in high-reflection areas. CLSANet again surpasses both baselines, reaching 0.945 (

{mAP}_{50}

) and 0.644 (

{mAP}_{95}

), showing resilience to brightness shifts.

Conclusions: These results confirm that DMIM’s scene-aware modality selection consistently improves detection performance across varied illumination and occlusion scenarios. It plays a crucial role in enabling CLSANet to dynamically balance visual cues for reliable perception perception in complex environments.

Module-wise ablation results and analysis. We further examine the contributions of three key components in CLSANet—DMI, the Modality Enhancement Module (ME), and Self-Adaptive Fusion (SAF)—through controlled ablation experiments under the same three scene types. For each run, one module is removed and the model is re-evaluated. The results are reported in Table 7.

All modules positively contribute to the overall performance. Notably, removing SAF results in the most significant drop under low-light conditions (−0.045 in

{mAP}_{95}

), indicating its effectiveness in spatially adapting to complex brightness patterns. In occlusion scenarios, ME proves essential for preserving semantic richness.

These combined ablation experiments demonstrate that each module—DMI, ME, and SAF—plays a complementary and indispensable role in enhancing CLSANet’s performance across varied environmental conditions. Together, they support the model’s adaptability and effectiveness for real-world multimodal detection.

5. Discussion

This study introduced CLSANet, a cognitively inspired multimodal object detection framework designed to adaptively adjust fusion strategies by leveraging global scene semantics and local visual complexity. Our experimental results, obtained across diverse and challenging conditions—including low illumination, partial occlusion, and complex backgrounds—provide empirical support for the initial hypothesis that context-aware, dynamic fusion mechanisms can substantially improve detection performance in multimodal settings.

The findings presented herein are consistent with prior studies that have highlighted the limitations of static- or heuristic-based fusion approaches, particularly under variable and unpredictable environmental conditions. Previous work has demonstrated that simple fusion schemes often fail to fully exploit the complementary strengths of heterogeneous modalities. In contrast, CLSANet extends this line of research by incorporating modules that explicitly model modality contributions based on both global semantic cues and local scene complexity, enabling the framework to selectively prioritize and integrate modality-specific information according to the contextual requirements of each scene.

These contributions have broader implications for the design of multimodal perception systems in safety-critical and resource-constrained domains, such as autonomous driving, robotics, and intelligent surveillance. Specifically, the results underscore the necessity of incorporating adaptive components capable of modulating information integration in response to dynamic scene characteristics. Such mechanisms may play a crucial role in enhancing system performance and reliability in real-world deployments characterized by unpredictable and complex conditions.

Despite these advances, certain limitations warrant further investigation. CLSANet assumes precise spatial alignment between modalities and relies on large-scale annotated datasets for supervised training. These requirements may hinder its practical deployment, particularly in scenarios where sensor calibration is imperfect or annotated data is limited. Future research should examine methods that reduce sensitivity to modality misalignment, including the development of alignment correction modules or architectures that are inherently tolerant of imperfect calibration. Additionally, exploring semi-supervised and unsupervised learning paradigms may mitigate dependency on extensive human annotation and improve scalability.

In conclusion, the results presented support the efficacy of context-aware, adaptive fusion mechanisms in enhancing multimodal detection capabilities. We recommend that subsequent research explore the integration of additional sensing modalities, such as depth and radar, and investigate temporal modeling strategies to extend these insights to video-based detection tasks and continuous perception scenarios.

6. Conclusions and Future Work

In this paper, we proposed CLSANet, a cognitively inspired multimodal object detection framework that adaptively adjusts fusion strategies based on global scene semantics and local visual complexity. Extensive experimental results demonstrate that CLSANet achieves significant improvements in detection accuracy across challenging conditions such as low illumination, partial occlusion, and complex backgrounds. Moreover, CLSANet maintains a favorable balance between performance and computational efficiency, making it suitable for real-time and resource-constrained applications. These findings validate the effectiveness of context-aware, dynamic fusion mechanisms in enhancing multimodal detection performance and provide a solid foundation for developing perception systems capable of operating reliably in complex, dynamic environments.

In our future work, we plan to further enhance CLSANet by improving its ability to handle modality misalignment and incorporating additional sensing modalities, such as depth and radar, to strengthen environmental understanding under more challenging conditions. We also intend to investigate semi-supervised and unsupervised learning strategies to reduce dependence on large-scale annotated datasets and improve the model’s generalization across diverse scenarios. Additionally, we will extend CLSANet to video and continuous perception tasks by introducing temporal modeling mechanisms, aiming to improve detection stability and long-term deployment performance in dynamic environments.

Author Contributions

All authors contributed equally to the conceptualization, methodology, analysis, writing, and review of this work. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the following projects: Xiangjiang Laboratory (Grant Nos. 24XJ01002, 23XJ01009, 24XJ01001, 25XJ01001), the Natural Science Foundation of Hunan Province (Grant No. 2025JJ60384), and the Research Foundation of Education Bureau of Hunan Province (Grant No. 24B0584).

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Bozcan, I.; Kayacan, E. Au-air: A multi-modal unmanned aerial vehicle dataset for low altitude traffic surveillance. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–4 June 2020; IEEE: New York, NY, USA, 2020; pp. 8504–8510. [Google Scholar]
Dong, L.; Liu, Z.; Jiang, F.; Wang, K. Joint optimization of deployment and trajectory in UAV and IRS-assisted IoT data collection system. IEEE Internet Things J. 2022, 9, 21583–21593. [Google Scholar] [CrossRef]
Feng, D.; Haase-Schütz, C.; Rosenbaum, L.; Hertlein, H.; Glaeser, C.; Timm, F.; Wiesbeck, W.; Dietmayer, K. Deep multi-modal object detection and semantic segmentation for autonomous driving: Datasets, methods, and challenges. IEEE Trans. Intell. Transp. Syst. 2020, 22, 1341–1360. [Google Scholar] [CrossRef]
Valverde, F.R.; Hurtado, J.V.; Valada, A. There is more than meets the eye: Self-supervised multi-object detection and tracking with sound by distilling multimodal knowledge. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 11612–11621. [Google Scholar]
Li, H.; Wu, X.J. DenseFuse: A fusion approach to infrared and visible images. IEEE Trans. Image Process. 2018, 28, 2614–2623. [Google Scholar] [CrossRef]
Zhao, Z.; Bai, H.; Zhang, J.; Zhang, Y.; Xu, S.; Lin, Z.; Timofte, R.; Van Gool, L. Yes, we confirm. Please proceed with removing the duplicate reference and rearranging the reference list. Cddfuse: Correlation-driven dual-branch feature decomposition for multi-modality image fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 5906–5916. [Google Scholar]
Xu, H.; Gong, M.; Tian, X.; Huang, J.; Ma, J. CUFD: An encoder–decoder network for visible and infrared image fusion based on common and unique feature decomposition. Comput. Vis. Image Underst. 2022, 218, 103407. [Google Scholar] [CrossRef]
Zhang, Y.; Liu, Y.; Sun, P.; Yan, H.; Zhao, X.; Zhang, L. IFCNN: A general image fusion framework based on convolutional neural network. Inf. Fusion 2020, 54, 99–118. [Google Scholar] [CrossRef]
Ma, J.; Yu, W.; Liang, P.; Li, C.; Jiang, J. FusionGAN: A generative adversarial network for infrared and visible image fusion. Inf. Fusion 2019, 48, 11–26. [Google Scholar] [CrossRef]
Ma, J.; Xu, H.; Jiang, J.; Mei, X.; Zhang, X.P. DDcGAN: A dual-discriminator conditional generative adversarial network for multi-resolution image fusion. IEEE Trans. Image Process. 2020, 29, 4980–4995. [Google Scholar] [CrossRef]
Li, J.; Tao, C.; Guan, D. Pmfnet: A progressive multichannel fusion network for multimodal sentiment analysis. In Proceedings of the International Conference on Neural Information Processing, Changsha, China, 20–23 November 2023; Springer: Singapore, 2023; pp. 270–281. [Google Scholar]
Zhang, H.; Ma, J. SDNet: A versatile squeeze-and-decomposition network for real-time image fusion. Int. J. Comput. Vis. 2021, 129, 2761–2785. [Google Scholar] [CrossRef]
Shi, C.; Xian, M.; Zhou, X.; Wang, H.; Cheng, H.D. Multi-slice low-rank tensor decomposition based multi-atlas segmentation: Application to automatic pathological liver CT segmentation. Med. Image Anal. 2021, 73, 102152. [Google Scholar] [CrossRef]
Sun, Y.; Jiang, W. Human Behavior Recognition Method Based on Edge Intelligence. Discret. Dyn. Nat. Soc. 2022, 2022, 3955218. [Google Scholar] [CrossRef]
Ma, W.; Wang, K.; Li, J.; Yang, S.X.; Li, J.; Song, L.; Li, Q. Infrared and visible image fusion technology and application: A review. Sensors 2023, 23, 599. [Google Scholar] [CrossRef] [PubMed]
Xing, W.; Chen, D.; Islam, M.A.; Zhou, J. Bidfuse: Harnessing Bi-Directional Attention with Modality-Specific Encoders for Infrared-Visible Image Fusion. In Proceedings of the 2024 IEEE International Conference on Image Processing (ICIP), Abu Dhabi, United Arab Emirates, 27–30 October 2024; IEEE: New York, NY, USA, 2024; pp. 2627–2633. [Google Scholar]
Liu, Z.; Huang, T.; Li, B.; Chen, X.; Wang, X.; Bai, X. Epnet++: Cascade bi-directional fusion for multi-modal 3d object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 8324–8341. [Google Scholar] [CrossRef] [PubMed]
Zhou, W.; Wang, Y.; Zuo, L.; Gao, Y.; Yi, Y. High-level vision task-driven infrared and visible image fusion approach: Progressive semantic enhancement based multi-scale Cross-modality Interactive network. Measurement 2024, 237, 114977. [Google Scholar] [CrossRef]
Xiao, G.; Tang, Z.; Guo, H.; Yu, J.; Shen, H.T. FAFusion: Learning for Infrared and Visible Image Fusion via Frequency Awareness. IEEE Trans. Instrum. Meas. 2024, 73, 5015011. [Google Scholar] [CrossRef]
Zheng, N.; Zhou, M.; Huang, J.; Zhao, F. Frequency Integration and Spatial Compensation Network for infrared and visible image fusion. Inf. Fusion 2024, 109, 102359. [Google Scholar] [CrossRef]
Zhao, Z.; Bai, H.; Zhu, Y.; Zhang, J.; Xu, S.; Zhang, Y.; Zhang, K.; Meng, D.; Timofte, R.; Van Gool, L. DDFM: Denoising diffusion model for multi-modality image fusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 8082–8093. [Google Scholar]
Zhou, X.; Xu, X.; Liang, W.; Zeng, Z.; Yan, Z. Deep-Learning-Enhanced Multitarget Detection for End-Edge Cloud Surveillance in Smart IoT. IEEE Internet Things J. 2021, 8, 12588–12596. [Google Scholar] [CrossRef]
Wu, T.; Dou, W.; Wu, F.; Tang, S.; Hu, C.; Chen, J. A Deployment Optimization Scheme over Multimedia Big Data for Large-Scale Media Streaming Application. ACM Trans. Multimed. Comput. Commun. Appl. (TOMM) 2016, 12, 1–23. [Google Scholar] [CrossRef]
Zhou, X.; Li, Y.; Liang, W. CNN-RNN based intelligent recommendation for online medical pre-diagnosis support. IEEE/ACM Trans. Comput. Biol. Bioinform. 2020, 18, 912–921. [Google Scholar] [CrossRef]
Zhang, X.; Cao, S.Y.; Wang, F.; Zhang, R.; Wu, Z.; Zhang, X.; Bai, X.; Shen, H.L. Rethinking Early-Fusion Strategies for Improved Multispectral Object Detection. arXiv 2024, arXiv:2405.16038. [Google Scholar] [CrossRef]
Dhanaraj, M.; Sharma, M.; Sarkar, T.; Karnam, S.; Chachlakis, D.G.; Ptucha, R.; Markopoulos, P.P.; Saber, E. Vehicle detection from multi-modal aerial imagery using YOLOv3 with mid-level fusion. In Proceedings of the SPIE Big Data II: Learning, Analytics, and Applications, Online, 21–25 September 2020; Volume 11395, pp. 22–32. [Google Scholar]
Gadzicki, K.; Khamsehashari, R.; Zetzsche, C. Early vs late fusion in multimodal convolutional neural networks. In Proceedings of the 2020 IEEE 23rd International Conference on Information Fusion (FUSION), Rustenburg, South Africa, 6–9 July 2020; IEEE: New York, NY, USA, 2020; pp. 1–6. [Google Scholar]
Rivadeneira, R.E.; Velesaca, H.O.; Sappa, A. Object Detection in Very Low-Resolution Thermal Images through a Guided-Based Super-Resolution Approach. In Proceedings of the 2023 17th International Conference on Signal-Image Technology & Internet-Based Systems (SITIS), Bangkok, Thailand, 8–10 November 2023; IEEE: New York, NY, USA, 2023; pp. 311–318. [Google Scholar]
Jiang, F.; Wang, K.; Dong, L.; Pan, C.; Xu, W.; Yang, K. AI driven heterogeneous MEC system with UAV assistance for dynamic environment: Challenges and solutions. IEEE Netw. 2020, 35, 400–408. [Google Scholar] [CrossRef]
Zhang, J.; Zhang, R.; Xu, L.; Lu, X.; Yu, Y.; Xu, M.; Zhao, H. FasterSal: Robust and Real-time Single-Stream Architecture for RGB-D Salient Object Detection. IEEE Trans. Multimed. 2024, 27, 2477–2488. [Google Scholar] [CrossRef]
Zeng, Y.; Liang, T.; Jin, Y.; Li, Y. MMI-Det: Exploring Multi-Modal Integration for Visible and Infrared Object Detection. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 11198–11213. [Google Scholar] [CrossRef]
Dong, S.; Li, Y.; Xie, W.; Zhang, J.; Tian, J.; Yang, D.; Lei, J. SeaDATE: Remedy Dual-Attention Transformer with Semantic Alignment via Contrast Learning for Multimodal Object Detection. arXiv 2024, arXiv:2410.11358. [Google Scholar] [CrossRef]
Chen, Z.; Qian, Y.; Yang, X.; Wang, C.; Yang, M. AMFD: Distillation via Adaptive Multimodal Fusion for Multispectral Pedestrian Detection. arXiv 2024, arXiv:2405.12944. [Google Scholar] [CrossRef]
Zhao, W.; Xie, S.; Zhao, F.; He, Y.; Lu, H. Metafusion: Infrared and visible image fusion via meta-feature embedding from object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 13955–13965. [Google Scholar]
Wang, Z.; Chai, Y.; Sun, C.; Rui, X.; Mi, H.; Zhang, X.; Yu, P.S. A Weighted Symmetric Graph Embedding Approach for Link Prediction in Undirected Graphs. IEEE Trans. Cybern. 2024, 54, 1037–1047. [Google Scholar] [CrossRef]
Zhang, R.; Li, L.; Zhang, Q.; Zhang, J.; Xu, L.; Zhang, B.; Wang, B. Differential feature awareness network within antagonistic learning for infrared-visible object detection. IEEE Trans. Circuits Syst. Video Technol. 2023, 34, 6735–6748. [Google Scholar] [CrossRef]
Liu, J.; Fan, X.; Huang, Z.; Wu, G.; Liu, R.; Zhong, W.; Luo, Z. Target-aware dual adversarial learning and a multi-scenario multi-modality benchmark to fuse infrared and visible for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5802–5811. [Google Scholar]
Jia, X.; Zhu, C.; Li, M.; Tang, W.; Zhou, W. LLVIP: A visible-infrared paired dataset for low-light vision. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 3496–3504. [Google Scholar]
Tang, L.; Yuan, J.; Zhang, H.; Jiang, X.; Ma, J. PIAFusion: A progressive infrared and visible image fusion network based on illumination aware. Inf. Fusion 2022, 83, 79–92. [Google Scholar] [CrossRef]
Ma, J.; Zhang, H.; Shao, Z.; Liang, P.; Xu, H. GANMcC: A generative adversarial network with multiclassification constraints for infrared and visible image fusion. IEEE Trans. Instrum. Meas. 2020, 70, 5005014. [Google Scholar] [CrossRef]
Xu, H.; Ma, J.; Jiang, J.; Guo, X.; Ling, H. U2Fusion: A unified unsupervised image fusion network. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 502–518. [Google Scholar] [CrossRef]
Zhao, F.; Lou, W.; Feng, H.; Ding, N.; Li, C. MFMG-Net: Multispectral Feature Mutual Guidance Network for Visible–Infrared Object Detection. Drones 2024, 8, 112. [Google Scholar] [CrossRef]
Zhang, H.; Fromont, E.; Lefevre, S.; Avignon, B. Multispectral fusion for object detection with cyclic fuse-and-refine blocks. In Proceedings of the 2020 IEEE International Conference on Image Processing (ICIP), Abu Dhabi, United Arab Emirates, 25–28 October 2020; IEEE: New York, NY, USA, 2020; pp. 276–280. [Google Scholar]
Lu, G.; Fang, Z.; Tian, J.; Huang, H.; Xu, Y.; Han, Z.; Kang, Y.; Feng, C.; Zhao, Z. GAN-HA: A generative adversarial network with a novel heterogeneous dual-discriminator network and a new attention-based fusion strategy for infrared and visible image fusion. Infrared Phys. Technol. 2024, 142, 105548. [Google Scholar] [CrossRef]
Zhao, T.; Yuan, M.; Jiang, F.; Wang, N.; Wei, X. Removal and selection: Improving rgb-infrared object detection via coarse-to-fine fusion. arXiv 2024, arXiv:2401.10731. [Google Scholar]
Dong, W.; Zhu, H.; Lin, S.; Luo, X.; Shen, Y.; Liu, X.; Zhang, J.; Guo, G.; Zhang, B. Fusion-mamba for cross-modality object detection. arXiv 2024, arXiv:2404.09146. [Google Scholar]
Ma, J.; Tang, L.; Fan, F.; Huang, J.; Mei, X.; Ma, Y. SwinFusion: Cross-domain long-range learning for general image fusion via swin transformer. IEEE/CAA J. Autom. Sin. 2022, 9, 1200–1217. [Google Scholar] [CrossRef]
Li, Z.; Pan, H.; Zhang, K.; Wang, Y.; Yu, F. Mambadfuse: A mamba-based dual-phase model for multi-modality image fusion. arXiv 2024, arXiv:2404.08406. [Google Scholar]

Figure 1. Representative examples illustrating the context-dependent performance of RGB and IR modalities. Red boxes indicate the regions of interest where the modality difference is particularly evident. Each pair compares RGB (top) and IR (bottom) images across four scenarios: (a) nighttime with strong glare, (b) smoke-induced occlusion, (c) dense crowds in daytime, and (d) small-scale pedestrians in urban scenes. RGB provides rich texture and appearance cues under good illumination, whereas IR offers enhanced visibility under poor lighting or occlusion.

Figure 2. CLSANet: Overall architecture of the proposed cognitive learning-based self-adaptive network for multimodal object detection, illustrating the key components including Dominant Modality Identification, Modality Enhancement, and Self-Adaptive Fusion.

Figure 3. Architecture of the Dominant Modality Identification (DMI) module in CLSANet. It performs joint feature encoding and semantic-based dominance scoring to select the modality best suited for the current scene.

Figure 4. Qualitative detection results across four scenarios: (a) nighttime with glare; (b) partial occlusion; (c) smoke interference; (d) specular reflection. CLSANet adaptively emphasizes informative modalities to improve detection over the baseline. Red and yellow boxes are used for emphasis; other colors indicate different detection targets.

Table 1. Overview of the datasets used in our experiments.

Attribute	M³FD	LLVIP	MSRS
Number of Classes	6	1	3
Modalities	RGB + IR	RGB + IR	RGB + IR
Image Resolution	1024 × 768	1280 × 1024	480 × 640
File Format	PNG	JPG	PNG
Image Pairs	4200	15,488	1569

Table 2. Performance comparison on the M³FD dataset. Red and blue indicate the best and second-best results, respectively. SIZE (M)refers to the number of model parameters (in millions) and FLOPs (G) indicate floating-point operations per image (in billions).

Model	Data	Person	Car	Bus	Motorcycle	Lamp	Truck	${mAP}_{50}$	${mAP}_{95}$	SIZE (M)	FLOPs (G)
Yolov7	RGB	0.879	0.933	0.949	0.901	0.917	0.892	0.911	0.612	-	-
Yolov7	IR	0.904	0.941	0.922	0.823	0.874	0.885	0.891	0.573	-	-
DenseFuse [5]	RGB + IR	0.879	0.943	0.959	0.903	0.920	0.904	0.918	0.626	0.074	48.92
GANMcC [40]	RGB + IR	0.881	0.953	0.970	0.901	0.921	0.924	0.925	0.638	1.864	1002.56
FusionGAN [9]	RGB + IR	0.862	0.943	0.951	0.911	0.891	0.921	0.913	0.627	0.925	497.76
U2Fusion [41]	RGB + IR	0.883	0.943	0.962	0.903	0.931	0.916	0.923	0.633	0.659	366.34
TarDAL [37]	RGB + IR	0.884	0.948	0.971	0.906	0.921	0.936	0.927	0.639	0.296	14.88
MFMGF-Net [42]	RGB + IR	0.907	0.941	0.960	0.915	0.935	0.923	0.930	0.658	21.8	26.4
Ours	RGB + IR	0.924	0.964	0.981	0.912	0.976	0.943	0.950	0.660	0.21	1.19

Table 3. Performance comparison on the LLVIP dataset. Red and blue indicate the best and second-best results, respectively.

Model	Data	${mAP}_{50}$	${mAP}_{95}$
Yolov7	RGB	0.919	0.540
Yolov7	IR	0.952	0.621
GAFF [43]	RGB + IR	0.963	0.636
GAN-HA [44]	RGB + IR	0.965	0.642
RSDet [45]	RGB + IR	0.958	0.632
U2Fusion [41]	RGB + IR	0.959	0.629
Fusion-Mamba [46]	RGB + IR	0.970	0.643
MFMGF-Net [42]	RGB + IR	0.979	0.696
Ours	RGB + IR	0.983	0.675

Table 4. Performance comparison on the MSRS dataset. Red and blue indicate the best and second-best results, respectively.

Model	Data	${mAP}_{50}$	${mAP}_{95}$
Yolov7	RGB	0.897	0.823
Yolov7	IR	0.854	0.798
SDN [12]	RGB + IR	0.929	0.863
Target [37]	RGB + IR	0.928	0.873
DDFM [21]	RGB + IR	0.932	0.863
CDDFuse [6]	RGB + IR	0.908	0.849
SwimF [47]	RGB + IR	0.920	0.868
MambaDFuse [48]	RGB + IR	0.935	0.879
Ours	RGB + IR	0.943	0.896

Table 5. Ablation study results showing the impact of each module (

Ω_{1}

: Dominant Modality Identification Module,

Ω_{2}

: Modality Enhancement Module,

Ω_{3}

: Self-Adaptive Fusion Module) on the performance across M³FD, LLVIP, and MSRS datasets.

Table 5. Ablation study results showing the impact of each module (

Ω_{1}

: Dominant Modality Identification Module,

Ω_{2}

: Modality Enhancement Module,

Ω_{3}

: Self-Adaptive Fusion Module) on the performance across M³FD, LLVIP, and MSRS datasets.

Model	Module			M³FD		LLVIP		MSRS
Model	$Ω_{1}$	$Ω_{2}$	$Ω_{3}$	${mAP}_{50}$	${mAP}_{95}$	${mAP}_{50}$	${mAP}_{95}$	${mAP}_{50}$	${mAP}_{95}$
YOLOv7 (RGB)	×	×	×	0.911	0.612	0.919	0.540	0.897	0.823
YOLOv7 (IR)	×	×	×	0.891	0.573	0.952	0.621	0.854	0.798
M1 (RGB + IR)	√	×	×	0.928	0.627	0.961	0.629	0.916	0.855
M2 (RGB + IR)	√	√	×	0.932	0.631	0.965	0.636	0.928	0.871
M3 (RGB + IR)	√	×	√	0.933	0.635	0.967	0.638	0.930	0.882
M4 (RGB + IR)	√	√	√	0.950	0.660	0.983	0.675	0.943	0.896

Table 6. Ablation study of DMIM under three challenging scene conditions from M³FD: low light, occlusion, and strong light. Each entry reports

{mAP}_{50}

/

{mAP}_{95}

.

Table 6. Ablation study of DMIM under three challenging scene conditions from M³FD: low light, occlusion, and strong light. Each entry reports

{mAP}_{50}

/

{mAP}_{95}

.

Model	Low Light		Occlusion		Strong Light
Model	${mAP}_{50}$	${mAP}_{95}$	${mAP}_{50}$	${mAP}_{95}$	${mAP}_{50}$	${mAP}_{95}$
IR-Only	0.952	0.612	0.901	0.589	0.883	0.573
RGB-Only	0.894	0.574	0.918	0.610	0.925	0.602
Ours (CLSANet)	0.976	0.656	0.933	0.637	0.945	0.644

Table 7. Ablation of DMI, ME, and SAF modules under typical scene conditions (

{mAP}_{50}

/

{mAP}_{95}

).

Table 7. Ablation of DMI, ME, and SAF modules under typical scene conditions (

{mAP}_{50}

/

{mAP}_{95}

).

Model	Low Light		Occlusion		Strong Light
Model	${mAP}_{50}$	${mAP}_{95}$	${mAP}_{50}$	${mAP}_{95}$	${mAP}_{50}$	${mAP}_{95}$
w/o DMI	0.960	0.621	0.915	0.603	0.929	0.616
w/o ME	0.966	0.638	0.921	0.619	0.934	0.629
w/o SAF	0.958	0.611	0.912	0.600	0.927	0.612
Ours (CLSANet)	0.976	0.656	0.933	0.637	0.945	0.644

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Peng, H.; Liu, Q.; Ruan, R.; Yuan, S.; Li, Q. CLSANet: Cognitive Learning-Based Self-Adaptive Feature Fusion for Multimodal Visual Object Detection. Electronics 2025, 14, 3082. https://doi.org/10.3390/electronics14153082

AMA Style

Peng H, Liu Q, Ruan R, Yuan S, Li Q. CLSANet: Cognitive Learning-Based Self-Adaptive Feature Fusion for Multimodal Visual Object Detection. Electronics. 2025; 14(15):3082. https://doi.org/10.3390/electronics14153082

Chicago/Turabian Style

Peng, Han, Qionglin Liu, Riqing Ruan, Shuaiqi Yuan, and Qin Li. 2025. "CLSANet: Cognitive Learning-Based Self-Adaptive Feature Fusion for Multimodal Visual Object Detection" Electronics 14, no. 15: 3082. https://doi.org/10.3390/electronics14153082

APA Style

Peng, H., Liu, Q., Ruan, R., Yuan, S., & Li, Q. (2025). CLSANet: Cognitive Learning-Based Self-Adaptive Feature Fusion for Multimodal Visual Object Detection. Electronics, 14(15), 3082. https://doi.org/10.3390/electronics14153082

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

CLSANet: Cognitive Learning-Based Self-Adaptive Feature Fusion for Multimodal Visual Object Detection

Abstract

1. Introduction

2. Related Works

2.1. Infrared and Visible Image Fusion

2.2. Multimodal Object Detection

3. The Proposed Method

3.1. Problem Formulation

3.2. Dominant Modality Identification Module

3.3. Modality Enhancement Module

3.4. Self-Adaptive Fusion Module

4. Experiments

4.1. Datasets

4.2. Experimental Setup

4.3. Quantitative Analysis

4.4. Qualitative Analysis

4.5. Ablation Studies

5. Discussion

6. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI