IAF-RTDETR: Illumination Evaluation-Driven Multimodal Object Detection Network for Infrared–Visible Dual-Source Fusion

Hu, Qi; Yu, Haiyan; Zhou, Zhiquan; Li, Simiao

doi:10.3390/electronics15061332

Open AccessArticle

IAF-RTDETR: Illumination Evaluation-Driven Multimodal Object Detection Network for Infrared–Visible Dual-Source Fusion

¹

Harbin Institute of Technology (Weihai) Qingdao Research Institute, Qingdao 266109, China

²

School of Information Science and Engineering, Harbin Institute of Technology, Weihai 264209, China

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(6), 1332; https://doi.org/10.3390/electronics15061332

Submission received: 25 January 2026 / Revised: 23 February 2026 / Accepted: 9 March 2026 / Published: 23 March 2026

Download

Browse Figures

Versions Notes

Abstract

Infrared–visible multimodal object detection has attracted increasing attention for its robustness under challenging conditions such as low illumination, occlusion, and complex backgrounds. However, existing fusion methods often suffer from coarse illumination modeling and insufficient cross-modal semantic alignment, leading to performance degradation in scenes with strong illumination variations or modality imbalance. To address these issues, this paper proposes IAF-RTDETR (Illumination-Aware Fusion RT-DETR), an illumination-aware fusion real-time detection network built upon the RT-DETR framework. The proposed method introduces a progressive fusion pipeline composed of four key modules: (1) a Modality-Specific Feature Enhancer to recalibrate modality-dependent representations and suppress low-quality feature interference; (2) a lightweight Global Light Estimator that learns a continuous illumination score via self-supervised proxy supervision derived from RGB image statistics; (3) a Light-Aware Fusion module that dynamically adjusts multi-scale fusion weights of infrared and visible features according to the estimated illumination; and (4) a Cross-Layer Dual-Branch Interaction Module that alleviates cross-modal semantic shift through bidirectional attention-guided interaction and channel reweighting. Extensive experiments on the M3FD dataset demonstrate that the proposed method achieves consistent performance improvements under diverse lighting conditions, outperforming RGB-only and IR-only baselines by 7.4% and 16.1% in mAP@50, respectively, while maintaining real-time inference speed (≈17.3 ms). Further evaluations on the LLVIP dataset validate the robustness and generalization ability of IAF-RTDETR in real low-illumination scenarios. Moreover, compared with representative multimodal fusion methods such as TFDet and TarDAL, the proposed method achieves superior detection accuracy. Visualization and quantitative semantic consistency analyses further confirm the effectiveness of the proposed illumination-aware fusion and cross-layer interaction mechanisms. These results indicate that IAF-RTDETR provides an effective and practical solution for real-time infrared–visible object detection under complex lighting environments.

Keywords:

multimodal fusion; transformer-based object detection; illumination perception

1. Introduction

With the widespread application of artificial intelligence technologies in fields such as security surveillance, autonomous driving, and industrial automation, there is an increasing demand for visual perception systems to adapt to complex environments. Single-modality perception methods (such as pure RGB or infrared) often suffer significant performance degradation under occlusion, low illumination, or strong light interference [1], failing to provide complete feature representation for the computer and unable to meet the needs of all-weather, all-scenario perception [2,3,4]. The combination of multispectral data can improve detection accuracy [5,6]. Visible light images offer clear texture details and strong color contrast but have poor resistance to interference, and the amount of information they can convey in foggy, rainy, and nighttime conditions is limited, often leading to substantial information loss [7]. On the other hand, infrared images are less affected by environmental changes and can effectively capture the target’s location and contour information, even in harsh conditions. However, infrared images suffer from severe loss of detail and low spatial resolution [8]. Therefore, multimodal object detection that fuses complementary information from infrared and visible light has gradually become an effective approach to enhancing robustness in complex scenarios [9].

Current research on visible–infrared object detection primarily focuses on feature-level fusion [10]. Cao et al. designed a novel framework based on box-level segmentation supervised learning to achieve high-precision and real-time detection of pedestrian instances in multispectral images [11]. Deng et al. constructed a new multi-layer fusion network (MLF-FRCNN), which extracts multi-scale feature maps from infrared and visible light channels using a residual network backbone and feature pyramid network for pedestrian detection in low-light environments [12]. Li et al. designed a dual-input object detection model based on the Faster RCNN framework. They achieved multimodal information complementarity by jointly using the RPN module and confidence score fusion [13]. Zhang et al. utilized the deep feature concatenation from both modalities to provide attention weights for shallow features [14]. Guan et al. proposed a dual-stream deep convolutional neural network and allocated the weight of dual-stream feature maps through FCN-perceived lighting [15]. Zhou et al. proposed a differential module to balance the two information modalities and a micro-network to perceive illumination, then weighted the confidence scores of infrared and visible light predictions based on the illumination information [16]. Although these methods have made certain progress in multimodal fusion detection, several common shortcomings remain: First, existing illumination perception mechanisms often use binary “day/night” classification or coarse-grained modeling, making it difficult to accurately represent complex scenarios such as shadow occlusion, local overexposure, strong reflections, and intermediate lighting transitions. Second, there is significant inconsistency in the responses of different modalities at different semantic levels, and simple feature concatenation or summation fails to fully leverage their deep complementary relationships [17]. This can lead to semantic shifts and information redundancy in the fused regions, affecting the final detection accuracy and robustness. Therefore, achieving efficient fusion with illumination adaptive control and cross-layer semantic consistency, while ensuring real-time performance, remains a key challenge in the field of infrared–visible multimodal object detection. Similar challenges also arise in other multi-source fusion tasks. For example, Dong et al. proposed LTS to improve robustness in diverse and complex scenes by jointly learning temporal distribution cues and spatial correlation for refinement, highlighting the importance of collaborative modeling and correlation-aware fusion across heterogeneous information sources [18]. Subsequent attention-related equations in this subsection follow the same SE formulation in [19].

Previous studies have shown that CNNs have strong information extraction capabilities for both infrared and visible light data, but CNNs primarily extract local features and struggle to capture global representations [20]. In recent years, Transformer-based object detection methods have demonstrated significant advantages in global modeling capabilities [21]. In 2020, Carion et al. first proposed the end-to-end Transformer-based object detection network DETR, showcasing the enormous potential of Transformers in the field of object detection [22]. However, the local information extraction ability of Transformer architectures is not as strong as that of CNNs. Recent work has combined attention mechanisms with CNNs. Ji proposed an effective and efficient spatial–spectral feature extraction network, PASSNet, for hyperspectral image classification [23], and designed convolutional blocks and Transformer blocks to simultaneously extract local and global features. RT-DETR [24], as the first end-to-end Transformer detection framework with real-time performance, achieves a good balance between accuracy and speed. RT-DETR adopts an end-to-end decoupled decoding structure, which, compared to existing object detection methods such as Faster R-CNN and YOLO series [25,26,27,28,29,30,31], offers advantages such as no need for anchors, fast training convergence, and quick inference speed, providing an ideal foundation for multimodal fusion detection. Therefore, this paper selects RT-DETR as the base framework to ensure stability and real-time performance, and designs a progressive fusion pipeline of “modality-specific enhancement → illumination estimation → illumination-aware multi-scale fusion → cross-layer bidirectional alignment”. Specifically, MFE first enhances modality-dependent representations to suppress noise and texture deficiency; GLE then learns a continuous illumination score via self-supervised proxy supervision to provide explicit lighting priors; LAF performs illumination-conditioned multi-scale fusion with complementary modality weights; finally, CL-DBIM reinforces cross-modal and cross-layer interaction to mitigate semantic shift and fusion imbalance.

The main contributions are as follows: (1) We propose a lightweight Global Light Estimator (GLE) that learns a continuous illumination score under a self-supervised proxy supervision constructed from RGB image statistics (mean brightness, contrast variation, and entropy), enabling fine-grained illumination perception beyond coarse day/night labels. (2) We propose a continuous illumination score-driven multi-scale fusion strategy (LAF), which dynamically assigns complementary fusion weights to RGB and IR features at each scale according to the estimated illumination, improving robustness under “dim–normal–bright” transitions and strong-light interference. (3) We design a Modality-Specific Feature Enhancer (MFE) and a Cross-Layer Dual-Branch Interaction Module (CL-DBIM) to recalibrate modality-specific responses and reduce cross-modal semantic shift via cross-layer attention and channel re-weighting, alleviating modality redundancy and fusion imbalance.

Extensive comparisons, ablation studies, visualization analyses, and computational cost evaluations on M3FD are provided to validate effectiveness, robustness, and real-time deploy ability. Standard formulations adopted from prior work are cited accordingly.

2. Materials and Methods

2.1. Network Structure

2.1.1. Overall Pipeline and Design Intuition

To enhance the robustness of infrared–visible multimodal object detection in complex lighting scenarios, we propose the IAF-RTDETR network framework based on RT-DETR, with its overall structure shown in Figure 1. The model takes visible light images and infrared images as dual inputs and uses a lightweight shared backbone network to extract multi-scale features, reducing the computational overhead caused by the dual-branch structure.

Design rationale: IAF-RTDETR adopts a progressive pipeline of “MFE → GLE → LAF → CL-DBIM”. MFE stabilizes modality-specific feature distributions before fusion; GLE provides an explicit illumination prior learned in a self-supervised manner; LAF uses the prior to adaptively modulate multi-scale fusion weights; and CL-DBIM further aligns cross-modal semantics across layers via bidirectional interaction, thereby improving robustness under diverse lighting conditions.

2.1.2. Network Modules

The network sequentially introduces four key modules:

MFE (Modality-Specific Feature Enhancer): This module enhances the internal representation ability of each modality, allowing for more balanced discriminability in the input fusion phase.

GLE (Global Light Estimator): This module generates an illumination score to provide environmental cues for subsequent dynamic fusion.

LAF (Light-Aware Fusion): This illumination-aware fusion module dynamically adjusts the fusion ratio of infrared and visible light features across multiple scales based on the illumination score, improving information utilization efficiency under complex lighting conditions.

CL-DBIM (Cross-Layer Dual-Branch Interaction Module): This module reduces modality discrepancies and addresses semantic shift by implementing cross-layer attention and channel recalibration.

Finally, the fused features are processed by the RT-DETR’s decoupled decoder to produce end-to-end detection results.

2.2. Module Function Description and Mathematical Modeling

In this section, we analyze the functions of the key modules in the network and provide their mathematical models. This will help clarify the input–output behaviors and working mechanisms of each module, aiding in the logical reproduction and structural explanation of the network.

2.2.1. Modality-Specific Feature Enhancer (MFE)

Infrared and visible light images differ significantly in terms of imaging mechanisms, texture distribution, and noise characteristics. Directly feeding them into the fusion module may result in increased discrepancies in the response amplitudes between the modalities, and low-quality modality interference may corrupt the higher-quality modality. To address this, we design the MFE to recalibrate the features within each modality. MFE employs a collaborative mechanism using channel attention and spatial attention to extract the importance distribution across channels and the salient responses in spatial regions. This enhances the texture salient regions in visible images and the thermal radiation salient regions in infrared images. In this way, MFE significantly improves the discriminability of each modality’s features before fusion, providing a more reliable input for subsequent illumination-aware fusion. The specific structure is shown in Figure 2.

The MFE module can adaptively highlight key information within each modality, helping to improve fusion quality and overall detection performance. The specific implementation for the RGB branch is as follows:

Let the multimodal feature map be:

F \in R^{B \times C \times H \times W}

(1)

where B, C, H, and W denote the batch size, channel number, height, and width of the feature map.

We introduce a learnable channel weight vector:

γ \in R^{C}

(2)

After broadcasting, the vector is used to apply row-wise and channel-wise weighting, followed by a 1 × 1 convolution for channel mapping as shown in Equation (3).

F^{'} = {C o n v}_{1 \times 1} (γ ⨀ F)

(3)

where

⊙

denotes element-wise multiplication, and

{C o n v}_{1 \times 1}

is the convolution used for channel mapping.

The specific implementation for the IR branch is as follows:

First, the feature map undergoes convolution and activation as shown in Equation (4).

F = R e L U (B N ({C o n v}_{1 \times 1} (F)))

(4)

Next, a spatial attention map is constructed as shown in Equation (5).

M_{s} = σ ({C o n v}_{7 \times 7} (A v g P o o l (F); M a x P o o l (F)))

(5)

Finally, the attention map is applied to the feature map as shown in Equation (6).

F^{'} = F ⊙ M_{s}

(6)

where

M_{s}

denotes the spatial attention map.

2.2.2. Global Light Estimator (GLE)

The main challenge of multimodal fusion under complex lighting conditions lies in the following: in low-light scenarios, the quality of visible light significantly decreases; in strong light or reflective areas, infrared images remain stable, but visible light becomes overexposed; and intermediate lighting transition areas are difficult to accurately describe using simple day/night labels. To address this issue, we propose the Global Light Estimator (GLE) module, which constructs an illumination feature vector based on statistical characteristics such as brightness mean, brightness standard deviation, and image entropy. The GLE then generates a normalized illumination score using a lightweight MLP. This score not only reflects global illumination intensity but also characterizes local brightness variation trends, providing continuous and fine-grained illumination representation for the fusion module. Unlike traditional coarse day/night classifications, GLE can express the continuous change in “dim—normal—bright” lighting in more detailed intervals, making the fusion strategy genuinely adaptive to environmental lighting. The illumination score is shown in Figure 3.

Considering that solely using neural networks to predict illumination may still encounter issues such as poor generalization, we adopt a self-supervised learning strategy to construct the training target for the illumination score. Specifically, we compute a proxy illumination score from simple RGB image statistics and use it as a self-supervised signal during network training. The mean squared error (MSE) is then used as the illumination loss to guide the network in estimating illumination.

The module takes RGB images as input, extracts global image features through lightweight convolution and pooling structures, and outputs an illumination score in the range of [0,1]. This score is passed as a control signal to the subsequent fusion module, indicating the reliability of the RGB information in the current scene. Specifically, the GLE module consists of two convolution layers and a global average pooling unit, providing a compact structure that can efficiently learn cross-scene illumination distribution differences, offering dynamic references for modality fusion. The GLE module predicts the illumination score

s

through the following process as shown in Equation (7).

s = (W_{2} \cdot G A P (R E L U (W_{1} * X_{r g b})))

(7)

where

W_{1}

and

W_{2}

are the convolution weights, and GAP represents global average pooling. This score represents the illumination score level of the current RGB image and is used to guide subsequent modality fusion.

This paper linearly combines the three metrics to form a unified illumination score as shown in Equation (8).

\hat{L} = c l a m p (μ + α \cdot σ + β \cdot H (Y), 0,1)

(8)

where

α = β = 0.3

are the weight coefficients. The Clamp operation normalizes the result to the [0,1] range to match the normalized illumination score in the GLE module.

In Equation (8), the mean brightness provides the primary cue of global exposure level, while the standard deviation and entropy serve as auxiliary cues capturing contrast dispersion and information complexity, respectively. The coefficients

a

and

b

control the strength of these auxiliary corrections. We set

a = b = 0.3

to keep the proxy illumination score mainly governed by the exposure level while allowing moderate refinement in ambiguous cases (e.g., strong shadows, local reflections, and over-exposure), where mean brightness alone is insufficient. The Clamp operation ensures a stable target range of [0,1] consistent with the normalized score predicted by GLE.

Here, the three image statistics—mean brightness, standard deviation, and entropy—describe global brightness, contrast distribution, and information complexity, respectively. They are complementary and can be computed directly from the RGB image, providing a lightweight yet effective supervisory signal for learning a continuous illumination score under complex lighting variations (e.g., shadows, reflections, and nighttime scenes).

The final illumination loss is defined as the mean squared error between the network output and the target score as shown in Equation (9).

L_{i l l u m} = \frac{1}{B} \sum_{b}^{B} ({\hat{L_{b}} - L_{b})}^{2}

(9)

The overall loss function is:

L = L_{t o t a l} + ω L_{i l l u m}

(10)

where

L_{t o t a l}

is the original loss function of RT-DETR, and

ω = 0.1

is the weight for the illumination loss.

The illumination loss is introduced as an auxiliary self-supervised objective to encourage illumination-discriminative representations. The weight

ω

balances the detection loss and the illumination regression loss. We set

ω = 0.1

to prevent the auxiliary illumination objective from dominating optimization, so that the learned illumination score primarily acts as a guidance signal for light-aware fusion rather than an end task.

Using proxy illumination has the following significant advantages:

(1): Trainability: It allows “illumination semantics” to be truly integrated into the backbone features.

If the mean, variance, and entropy are directly used as numerical inputs to the fusion layer, they have no gradient dependency on the backbone network features, and the network itself would never know how to improve brightness estimation. The self-supervised approach treats illumination estimation as a “soft label.” The MSE gradient is backpropagated to the GLE module, enabling the network’s early features to become more discriminative in terms of brightness and contrast.

(2): Semantic Flexibility: The network learns more appropriate representations.

A real nighttime scene is not simply equivalent to “low brightness.” High-contrast light spots and infrared high-heat targets can disrupt the mean value. The convolution estimator can automatically combine local textures, color saturation, noise textures, and other information, as long as it ultimately aligns with the illumination loss. This provides the model with the freedom to learn the illumination representations most helpful for detection, rather than rigidly relying on mathematical averages.

In summary, the network is able to learn more complex illumination patterns, thereby improving the model’s robustness.

2.2.3. Light-Aware Fusion (LAF)

Using the illumination scores obtained from the GLE module and the proxy illumination loss, the Light-Aware Fusion (LAF) module is designed for adaptive modality fusion. The core idea of LAF is to dynamically adjust the fusion ratio of RGB and IR features during the multi-scale fusion phase based on the illumination scores output by GLE. At each fusion scale, LAF applies weighting to the two modality features. To prevent one modality from being suppressed under extreme illumination scores, the illumination scores are smoothed in this approach.

The modality-weighted fusion process is as follows in Equation (11).

F_{f u s e d} = F u s i o n (σ (s - 0.5) \cdot α \cdot F_{r g b}, (1 - (σ (s - 0.5) \cdot α)) \cdot F_{i r})

(11)

where

σ

represents the Sigmoid function,

s

is the illumination score output by GLE, and

α

is the fusion sensitivity parameter. Fusion represents the combination of convolution, BatchNorm, and ReLU modules, which are used to further integrate the weighted modality features.

The weighted features are further integrated through concatenation and convolution, ultimately generating a more robust fused representation. This mechanism enhances the contribution of the infrared (IR) signal in low-light scenarios and retains the RGB details in bright environments, thereby improving the model’s consistency in perception and detection accuracy under different lighting conditions.

2.2.4. Cross-Layer Dual-Branch Interaction Module (CL-DBIM)

Due to the response differences between infrared and visible light images at different semantic levels, direct fusion often leads to semantic shifts. To address this, we design a Cross-Layer Dual-Branch Interaction Module (CL-DBIM), as shown in Figure 4. This module consists of four key subcomponents: channel alignment mechanism, cross-layer modality attention interaction module, SE-Attention, and weighted fusion mechanism.

Channel Alignment Mechanism:

For input feature maps

\{X_{1}, X_{2}\}

, their channel dimensions may not be consistent. To ensure the computability of subsequent fusion, the feature map with fewer channels needs to be adjusted to match the channels of the other feature map. The mapping formula is as follows in Equation (12).

{\hat{X}}_{0} = \{\begin{matrix} S i L U (B N (C o n v_{1 \times 1} (X_{0}))), C_{0} \neq C_{1} \\ X_{0}, C_{0} = C_{1} \end{matrix}

(12)

This step ensures the dimensional consistency of the input features by using a 1 × 1 convolution to avoid introducing too many parameters. Batch Normalization(BN) enhances training stability, while SiLU adds non-linearity to the expression.

Cross-Layer Modality Attention Interaction Module:

This module is designed to establish a complementary channel attention-guided interaction mechanism between modalities at different semantic levels, addressing the issue of semantic inconsistency between modalities at different layers. The input consists of two sets of modality feature maps

X_{1} \in R^{C_{1} \times H \times W}

and

X_{2} \in R^{C_{2} \times H \times W}

, which come from the infrared and visible light modalities, respectively. The interaction process is as follows:

(1): Feature Mapping: First, the features from both modalities are mapped to the same dimensional space. The purpose is to construct a semantically shared channel space, allowing effective information exchange and semantic alignment between the different modalities through the attention mechanism as shown in Equations (13) and (14).

X_{1 \to 2} = φ (X_{1}) = C o n v_{1 \times 1} (X_{1}) \in R^{C_{2} \times H \times W}

(13)

X_{2 \to 1} = ψ (X_{2}) = C o n v_{1 \times 1} (X_{2}) \in R^{C_{1} \times H \times W}

(14)

where

ϕ

and

ψ

represent learnable linear projection operators (realized via 1×1 convolutions, i.e., channel-wise linear layers) employed to align the respective channel spaces.

X_{1 \to 2}

is the feature map obtained by mapping

X_{1}

to the space of

X_{2}

. Similarly,

X_{2 \to 1}

is the feature map obtained by mapping

X_{2}

to the space of

X_{1}

.

(2): Attention-Guided Enhancement: The channel attention mechanism is employed to extract global channel weights from one set of feature maps, thereby guiding the enhancement of the features in the other modality. This process enables the model to selectively amplify the most informative channels, facilitating cross-modal alignment and improving the overall feature representation as shown in Equations (15) and (16).

A_{1} = σ (C_{2} \cdot R e L U (C_{1} \cdot A v g (X_{1})))

(15)

A_{2} = σ (C_{4} \cdot R e L U (C_{3} \cdot A v g (X_{2})))

(16)

where

C_{1}

and

C_{3}

represent the downsampling convolutions, while

C_{2}

and

C_{4}

are the upsampling convolutions.

σ

denotes the Sigmoid activation function.

A_{1}

and

A_{2}

correspond to the channel-wise attention weights, with the same shape as the projected features.

(3): Residual Enhancement: The projected features are multiplied by the attention map from the other modality, and the original modality features are enhanced through a residual learning mechanism. This process enables the model to retain essential information from the original features while integrating cross-modal enhancements as shown in Equations (17) and (18).

{\hat{X}}_{1} = X_{1} + X_{1 \to 2} ⊙ A_{2}

(17)

{\hat{X}}_{2} = X_{2} + X_{2 \to 1} ⊙ A_{1}

(18)

where

⊙

denotes element-wise multiplication across channels, which facilitates bidirectional alignment and complementarity of semantic information between modalities.

To further refine the key information and enhance the fusion accuracy, the SE (Squeeze-and-Excitation) channel attention mechanism is introduced. The features

{\hat{X}}_{1}

and

{\hat{X}}_{2}

after interaction are then concatenated.

X_{c o n c a t} = C o n c a t ({\hat{X}}_{1}, {\hat{X}}_{2}) \in R^{2 C \times H \times W}

(19)

The SE (Squeeze-and-Excitation) module is then applied to perform global channel-wise weighting.

z = A v g (X_{c o n c a t}) \in R^{2 C \times 1 \times 1}

(20)

w = σ (W_{2} \cdot R e L U (W_{2} \cdot z)) \in R^{2 C \times 1 \times 1}

(21)

X_{s e} = X_{c o n c a t} ⊙ w

(22)

W_{1}, W_{2} = S p l i t (X_{s e})

(23)

where

w

represents the channel-wise weight coefficient, which determines the extent to which channel information is preserved.

W_{1}

and

W_{2}

represent the re-weighted modality feature branches.

Finally, the weighted outputs are fused with the previously enhanced features through residual learning to obtain the final fused result:

F_{f i n a l} = C o n c a t ({\hat{X}}_{1} + W_{2}, {\hat{X}}_{2} + W_{1})

(24)

where

{\hat{X}}_{1}

and

{\hat{X}}_{2}

represent the modality features after interaction and enhancement.

W_{1}

and

W_{2}

denote the weighted residual information from the channel attention outputs. The final fused result is

F_{final} \in R^{2 C \times H \times W}

.

The CL-DBIM module achieves effective alignment and fusion of multimodal features across different semantic levels through a step-by-step structural design. First, the module uses a 1 × 1 convolution to automatically align the channels of the input modalities, ensuring uniform feature dimensions. Subsequently, the cross-modal interaction module, guided by a bidirectional attention mechanism, facilitates information complementarity, addressing the inconsistencies in semantic expression between modalities. Building on this, the module further utilizes SE (Squeeze-and-Excitation) channel attention to globally re-weight the concatenated features, enabling the model to automatically focus on the most semantically valuable feature channels. Finally, residual weighted fusion is applied to preserve the original information to the greatest extent and enhance semantic consistency.

3. Results and Discussion

3.1. Experimental Setup and Parameters

The entire algorithm is implemented in a Python 3.8 environment using the PyTorch framework. The operating system is Ubuntu 20.04, with an Intel^® Core i7-10700F CPU @ 2.90GHz × 16, and an NVIDIA GeForce RTX 3070 GPU, with CUDA 11.7 for acceleration. We set the learning rate to 0.0001, batch size to 4 and optimizer to AdamW. To improve the model deployment efficiency, a lightweight ResNet18 architecture is chosen for the backbone, with the total number of parameters controlled to approximately 37M. This represents an approximately 15% improvement over the standard RT-DETR model, while providing better detection accuracy without increasing memory consumption. The average inference time is 17.8 milliseconds, meeting the requirements for real-time object detection speed. Moreover, to explicitly report the computational cost, we added a module-wise complexity breakdown for the baseline and each added component in Table 1.

3.2. Main Performance Comparison

The model is validated on publicly available authoritative multimodal datasets:

M3FD [32]: Focused on multi-scale and multi-environment fusion in traffic scenarios, providing high-quality paired RGB and IR images.

As shown in Table 2, the RT-DETR architecture achieves an mAP@50 of 0.839, precision of 0.821, and recall of 0.766 in the RGB modality, all of which outperform the YOLO series methods, validating the superiority of the Transformer-based object detection framework. Compared to some RGB-T methods, it also has superiority.

After further incorporating the infrared modality, the fused model achieves significant improvements across all metrics, with mAP@50 increasing to 0.862, precision rising to 0.870, and recall improving to 0.811. This demonstrates that the inclusion of multimodal information effectively enhances detection robustness and accuracy.

The ablation experiment results are shown in Table 3.

3.3. Illumination Environment Adaptation Analysis

The accuracy comparison after adding illumination perception is shown in Table 4. It can be seen that the accuracy improved by 3.8% compared to the single-modality model, and by 1.5% compared to the dual-modality fusion without illumination perception.

To more thoroughly evaluate the model’s detection robustness under different lighting conditions, the M3FD dataset is divided into a daytime subset and a nighttime subset, corresponding to normal and low-light environments, respectively. By comparing the performance of different methods on these two subsets, the effectiveness of the proposed illumination perception mechanism can be effectively validated, along with the model’s adaptability in extreme environments.

As shown in Figure 5. The illumination score histogram shows a clear skewed distribution, with most scores concentrated between 0.75 and 0.9. This is mainly because the M3FD benchmark contains substantially more daytime/normal-illumination scenes than nighttime or extremely dark scenes, and the proxy illumination score is normalized/clamped to [0,1], which further compresses the high-illumination range. A portion of the images still fall into 0.3–0.7, corresponding to shadowed, dusk/dawn, or interference scenes. Considering this inherent dataset bias (and the resulting sample imbalance across illumination levels), we do not adopt equal-spacing bins. Instead, the dataset is adaptively divided into five subsets based on the actual distribution density of illumination scores: very dark (<0.4), dark (0.4–0.55), medium (0.55–0.7), bright (0.7–0.85), and very bright (0.85–1.0). After division, each subset retains paired RGB/IR images and labels, forming consistent test subsets with varying lighting conditions.

This division scheme not only facilitates group analysis of the model’s performance under different lighting scenarios but also helps identify the effectiveness and limitations of the illumination perception mechanism in specific regions, such as blurry mid-light areas or strong light interference zones.

As observed in Table 5, the fused model achieves the best or near-optimal detection performance across most lighting levels. It performs optimally in low-light, medium-light, and very bright scenarios, with accuracies of 0.898, 0.921, and 0.835, respectively, surpassing the single-modality models. In the very bright subset, the IR model shows a significant performance drop, with an accuracy of only 0.705, highlighting its sensitivity to overexposure or information degradation in infrared images. However, the fused model benefits from illumination perception adjustment, effectively suppressing the impact of redundant infrared information and enhancing the RGB features, ultimately outperforming the RGB model. This result demonstrates that the fused model possesses strong robustness and illumination adaptation capabilities, allowing it to dynamically adjust modality information based on different lighting conditions while consistently maintaining high detection accuracy. It also validates the practical effectiveness of the illumination perception mechanism in multimodal object detection. The detection results are shown in Figure 6.

3.4. Cross-Modal Branch Learning Analysis

To comprehensively evaluate the performance of the proposed fusion module CL-DBIM in multimodal feature expression and semantic alignment, a visualization mechanism for intermediate feature maps (P3) and semantic difference variations is designed. The image visualization is conducted from three aspects: input–output responses before and after fusion; semantic consistency measurement and change region analysis; and average cosine difference comparison.

3.4.1. Input–Output Responses Before and After Fusion

Figure 7 show the channel-wise average feature response heatmaps for RGB and IR modalities before fusion at the P3 layer. It can be observed that the RGB modality has a certain ability to express texture structures in the scene, with responses primarily concentrated in the lower and middle regions of the image. In contrast, the IR modality shows stable responses at the road edges and distant bright areas, but with lower overall energy, and the semantic boundaries of the targets are unclear.

After fusion through the CL-DBIM module, the responses of both modalities are significantly enhanced. The response range of the RGB feature map has been increased from [1.6, 2.6] to [2.2, 3.0], with a significant increase in semantic activity in the road center, target area, and background edges. Similarly, the response range of the IR modality has been improved from [1.2, 2.2] to [1.8, 2.8], especially in areas such as road boundaries, foreground objects, and distant bright spots, showing a stronger ability to perceive structures. Overall, after fusion, the responses of the RGB and IR branches in the semantic regions are closer, with the enhanced features concentrated in target-related areas and good suppression of background regions. This indicates that the CL-DBIM module effectively facilitates cross-modal information supplementation, improving semantic consistency and regional discrimination.

3.4.2. Semantic Consistency Measurement and Change Region Analysis

Figure 8a shows the semantic difference map between the RGB and IR modalities before and after fusion. The difference values, calculated by computing the cosine similarity and then inverting, reflect the level of semantic consistency between the two modalities in spatial terms. In the figure, many high-difference areas (shown in red) are concentrated in the lower, middle, and target boundary regions, indicating semantic discrepancies between the original modalities in these areas. Overall, compared to Figure 8a, the responses in Figure 8b are more uniform with smaller differences, suggesting that the fusion module has improved the cross-modal semantic alignment. Subsequent cosine-based metric equations in this subsection follow the standard definition in [34].

To further demonstrate the effectiveness of the fusion module, Figure 9 presents the semantic difference change map (Δ = After − Before), which quantifies the improvement in semantic consistency achieved by the fusion process. The blue regions in the figure indicate areas where the fusion has reduced semantic differences, while red represents increased differences. The overall trend is blue, indicating that the fusion process has enhanced semantic consistency in most regions.

3.4.3. Average Cosine Difference

To further validate the effectiveness of the fusion module in terms of semantic consistency from a quantitative perspective, we introduce the Average Cosine Difference (ACD) metric to measure the semantic consistency between the RGB and IR modality feature maps before and after fusion. The formula is as follows:

The cosine similarity for each pixel

(h, w)

between the two feature maps is denoted as

c o s_{(h, w)}

, and the average is computed across the entire feature map. The difference is then used as the final metric.

A C D = \frac{1}{H \cdot W} \sum_{h = 1}^{H} \sum_{w = 1}^{W} (1 - c o s_{(h, w)})

(25)

This study selected 50 pairs of infrared–visible images from the test set for cosine difference testing. The results show that the average cosine difference before fusion is 0.6621, while after fusion, the average cosine difference is reduced to 0.5713, representing a 13.72% decrease. This indicates that the CL-DBIM module effectively reduces the semantic difference between modalities, enhancing feature consistency.

This metric serves as a global quantitative evaluation of semantic alignment capabilities, and, in combination with the visualization results, further validates the effectiveness of the CL-DBIM module in modality adaptation and semantic consistency.

As shown in Table 6, the model performance improves significantly after the further introduction of the CL-DBIM module, with mAP@50 reaching 0.895, precision at 0.881, and recall rising to 0.869. Compared to the fusion model without CL-DBIM, mAP@50 improves by 3.3%, and recall increases by nearly 5.8%, indicating that the module has a significant impact on feature-level alignment and complementary expression.

In summary, the CL-DBIM module, through channel mapping, attention-guided interaction, and residual enhancement mechanisms, achieves bidirectional guidance and fine fusion between infrared and visible light modalities. This effectively alleviates the semantic inconsistency issues between modalities in deep-layer representations, thereby enhancing the model’s object detection capability.

3.5. Evaluation on LLVIP

To better evaluate the robustness of our method under real nighttime conditions and mitigate the limited sample size of the “very dark” subset in M3FD, we additionally conduct experiments on the LLVIP [35] infrared–visible paired dataset. A substantial portion of LLVIP samples are captured at night (approximately 18:00–22:00), providing more representative low-illumination scenarios. We follow the official training/testing split (or the commonly used split) and keep the same input resolution and training protocol as in M3FD for fair comparison. The results are shown in Table 7.

The ablation experiment results are shown in Table 8.

4. Conclusions

This paper addresses the challenges of infrared–visible multimodal object detection under complex and dynamically changing illumination conditions, particularly the issues of coarse illumination modeling, modality imbalance, and cross-modal semantic inconsistency. To this end, we propose IAF-RTDETR, an illumination-aware fusion detection framework built upon RT-DETR, which integrates modality-specific enhancement, continuous illumination estimation, illumination-guided multi-scale fusion, and cross-layer semantic alignment into a unified and efficient pipeline.

Specifically, the MFE module improves the discriminability and stability of single-modality features prior to fusion, mitigating degradation caused by noise and texture loss. The GLE module introduces a self-supervised illumination estimation strategy based on simple RGB image statistics, enabling fine-grained and continuous illumination perception beyond conventional day/night classification. Guided by the estimated illumination score, the LAF module dynamically adjusts the fusion weights of infrared and visible features across multiple scales, enhancing robustness under low-light, strong-light, and transitional illumination conditions. Furthermore, the CL-DBIM module strengthens cross-modal and cross-layer information interaction through bidirectional attention and channel recalibration, effectively reducing semantic shift and improving feature consistency.

Extensive experiments and ablation studies on the M3FD dataset demonstrate that each proposed module contributes stable and interpretable performance gains, particularly under challenging lighting scenarios. The proposed method achieves superior detection accuracy while maintaining real-time inference speed, meeting practical deployment requirements. Additional evaluations on the LLVIP dataset further confirm the effectiveness and generalization capability of the proposed approach in real environments.

Overall, IAF-RTDETR achieves a favorable balance between detection accuracy, illumination adaptability, cross-modal collaboration, and computational efficiency. The proposed framework provides a feasible and scalable solution for real-time infrared–visible object detection in complex environments and offers valuable insights for future research on illumination-aware multimodal perception.

Author Contributions

Conceptualization, H.Y. and Q.H.; methodology, Q.H.; software, Q.H.; validation, Q.H. and S.L.; writing—original draft preparation, Q.H.; review and editing, H.Y. and Z.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Key R&D Program of Shandong Province, China (Grant No. 2022ZLGX04).

Data Availability Statement

The data are available in the manuscript in the form of results. The source code for IAF-RTDETR will be made publicly available to facilitate reproducibility and future research. We will release the complete implementation in an online repository upon acceptance.

Conflicts of Interest

The authors declare no conflict of interest.

References

Cheng, H.; Long, W.; Li, Y.; Liu, H. Two Low Illuminance Image Enhancement Algorithms Based on Grey Level Mapping. Multimed. Tools Appl. 2020, 80, 7205–7228. [Google Scholar] [CrossRef]
Wang, X.; Sun, Z.; Chehri, A.; Jeon, G.; Song, Y. Deep Learning and Multi-Modal Fusion for Real-Time Multi-Object Tracking: Algorithms, Challenges, Datasets, and Comparative Study. Inf. Fusion 2024, 105, 102247. [Google Scholar] [CrossRef]
Jiao, T.; Guo, C.; Feng, X.; Chen, Y.; Song, J. A Comprehensive Survey on Deep Learning Multi-Modal Fusion: Methods, Technologies and Applications. Comput. Mater. Contin. 2024, 80, 1–35. [Google Scholar] [CrossRef]
Luo, Y.; Luo, Z. Infrared and Visible Image Fusion: Methods, Datasets, Applications, and Prospects. Appl. Sci. 2023, 13, 10891. [Google Scholar] [CrossRef]
González, A.; Fang, Z.; Socarras, Y.; Serrat, J.; Vázquez, D.; Xu, J.; López, A.M. Pedestrian Detection at Day/Night Time with Visible and FIR Cameras: A Comparison. Sensors 2026, 16, 820. [Google Scholar] [CrossRef]
Nataprawira, J.; Gu, Y.; Goncharenko, I.; Kamijo, S. Pedestrian Detection Using Multispectral Images and a Deep Neural Network. Sensors 2021, 21, 2536. [Google Scholar] [CrossRef] [PubMed]
Huang, N.; Liu, J.; Miao, Y.; Zhang, Q.; Han, J. Deep Learning for Visible-Infrared Cross-Modality Person Re-Identification: A Comprehensive Review. Inf. Fusion 2023, 91, 396–411. [Google Scholar] [CrossRef]
Liu, J.; Fan, X.; Jiang, J.; Liu, R.; Luo, Z. Learning a Deep Multi-Scale Feature Ensemble and an Edge-Attention Guidance for Image Fusion. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 105–119. [Google Scholar] [CrossRef]
Zhang, X.; Zhang, X.; Wang, J.; Ying, J.; Sheng, Z.; Yu, H.; Li, C.; Shen, H.-L. TFDet: Target Aware Fusion for RGB-T Pedestrian Detection. IEEE Trans. Neural Netw. Learn. Syst. 2025, 36, 13276–13290. [Google Scholar] [CrossRef]
Wu, J.; Shen, T.; Wang, Q.; Tao, Z.; Zeng, K.; Song, J. Local Adaptive Illumination-Driven Input-Level Fusion for Infrared and Visible Object Detection. Remote Sens. 2023, 15, 660. [Google Scholar] [CrossRef]
Cao, Y.; Guan, D.; Wu, Y.; Yang, J.; Cao, Y.; Yang, M.Y. Box-Level Segmentation Supervised Deep Neural Networks for Accurate and Real-Time Multispectral Pedestrian Detection. ISPRS J. Photogramm. Remote Sens. 2019, 150, 70–79. [Google Scholar] [CrossRef]
Deng, Q.; Tian, W.; Huang, Y.; Xiong, L.; Bi, X. Pedestrian Detection by Fusion of RGB and Infrared Images in Low-Light Environment. In Proceedings of the IEEE International Conference on Information Fusion (FUSION), Rustenburg, South Africa, 1–4 November 2021; pp. 1–8. [Google Scholar]
Li, C.; Song, D.; Tong, R.; Tang, M. Illumination-Aware Faster R-CNN for Robust Multispectral Pedestrian Detection. Pattern Recognit. 2019, 85, 161–171. [Google Scholar] [CrossRef]
Zhang, L.; Liu, Z.; Zhang, S.; Yang, X.; Qiao, H.; Huang, K.; Hussain, A. Cross-Modality Interactive Attention Network for Multispectral Pedestrian Detection. Inf. Fusion 2019, 50, 20–29. [Google Scholar] [CrossRef]
Guan, D.; Cao, Y.; Yang, J.; Cao, Y.; Yang, M. Fusion of Multispectral Data Through Illumination-Aware Deep Neural Networks for Pedestrian Detection. Inf. Fusion 2019, 50, 148–157. [Google Scholar] [CrossRef]
Zhou, K.; Chen, L.; Cao, X. Improving Multispectral Pedestrian Detection by Addressing Modality Imbalance Problems. In Proceedings of the European Conference on Computer Vision, Online, 23–28 August 2020; pp. 787–803. [Google Scholar]
Gong, J.; Yuan, Z.; Li, W.; Li, W.; Guo, Y.; Guo, B. A Lightweight Upsampling and Cross-Modal Feature Fusion-Based Algorithm for Small-Object Detection in UAV Imagery. Electronics 2026, 15, 298. [Google Scholar] [CrossRef]
Dong, G.; Zhao, C.; Pan, X.; Basu, A. Learning Temporal Distribution and Spatial Correlation Toward Universal Moving Object Segmentation. IEEE Trans. Image Process. 2024, 33, 2447–2461. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; IEEE: New York, NY, USA, 2014; pp. 580–587. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Proceedings of the European Conference on Computer Vision; Springer International Publishing: Berlin/Heidelberg, Germany, 2020; pp. 213–229. [Google Scholar]
Ji, R.; Tan, K.; Wang, X.; Pan, C.; Xin, L. PASSNet: A Spatial-Spectral Feature Extraction Network with Patch Attention Module for Hyperspectral Image Classification. IEEE Geosci. Remote Sens. Lett. 2023, 20, 1–5. [Google Scholar] [CrossRef]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs Beat YOLOs on Real-Time Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; IEEE: New York, NY, USA, 2024; pp. 16965–16974. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE: New York, NY, USA, 2016; pp. 779–788. [Google Scholar]
Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. YOLOX: Exceeding YOLO Series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar] [CrossRef]
Wu, T.H.; Wang, T.W.; Liu, Y.Q. Real-Time Vehicle and Distance Detection Based on Improved Yolo v5 Network. In Proceedings of the 2021 3rd World Symposium on Artificial Intelligence (WSAI), Guangzhou, China, 18–20 June 2021; pp. 24–28. [Google Scholar]
Reis, D.; Kupec, J.; Hong, J.; Daoudi, A. Real-Time Flying Object Detection with YOLOv8. arXiv 2023, arXiv:2305.09972. [Google Scholar] [CrossRef]
Tian, Y.; Ye, Q.; Doermann, D.J.A.P.A. YOLOv12: Attention-Centric Real-Time Object Detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar]
Lei, M.; Li, S.; Wu, Y.; Hu, H.; Zhou, Y.; Zheng, X.; Ding, G.; Du, S.; Wu, Z.; Gao, Y. YOLOv13: Real-Time Object Detection with Hypergraph-Enhanced Adaptive Visual Perception. arXiv 2025, arXiv:2506.17733. [Google Scholar]
Liu, J.; Fan, X.; Huang, Z.; Wu, G.; Liu, R.; Zhong, W.; Luo, Z. M3FD: A Multi-Scale and Multi-Scenario Multimodal Dataset for Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
Liu, J.; Fan, X.; Huang, Z.; Wu, G.; Liu, R.; Zhong, W.; Luo, Z. Target-Aware Dual Adversarial Learning and a Multi-Scenario Multi-Modality Benchmark to Fuse Infrared and Visible for Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5792–5801. [Google Scholar]
Ranjan, R.; Patel, V.M.; Chellappa, R. HyperFace: A Deep Multi-Task Learning Framework for Face Detection, Landmark Localization, Pose Estimation, and Gender Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 41, 121–135. [Google Scholar] [CrossRef] [PubMed]
Jia, X.; Zhu, C.; Li, M.; Tang, W.; Zhou, W. LLVIP: A Visible-Infrared Paired Dataset for Low-Light Vision. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 3496–3504. [Google Scholar]

Figure 1. Multimodal IAF-RTDETR Network Architecture Diagram.

Figure 2. MFE Architecture Diagram.

Figure 3. Multi-Indicator Illumination Score: (a) Natural illumination image; (b) Strong light interference image.

Figure 4. CL-DBIM Module Architecture Diagram.

Figure 5. Illumination Score Histogram.

Figure 6. Detection results under different lighting conditions. (a,b) are the RGB and IR detection result of “very bright”; (c,d) are the RGB and IR detection result of “bright”; (e,f) are the RGB and IR detection result of “medium”; (g,h) are the RGB and IR detection result of “dark”; (i,j) are the RGB and IR detection result of “very dark”.

Figure 7. Feature Heatmaps Before and After Fusion: (a) RGB feature heatmap before fusion; (b) RGB feature heatmap after fusion. (c) IR feature heatmap before fusion; (d) IR feature heatmap after fusion.

Figure 8. Semantic Difference Maps Before and After Fusion: (a) Semantic difference map before fusion; (b) Semantic difference map after fusion.

Figure 9. Before/After Fusion Semantic Difference Maps.

Table 1. Computational Cost Comparison.

Algorithm	Modality	GFLOPs
RT-DETR	RGB + IR	108.3
RT-DETR + MFE	RGB + IR	93.1
RTDETR+GLE + LAF	RGB + IR	93.5
RTDETR + CL-DBIM	RGB + IR	93.4
IAF-RTDETR	RGB + IR	94.1

Table 2. Detection Performance of the M3FD Dataset.

Algorithm	Modality	mAP@50	Precision	Recall
YOLOv11	RGB	0.802	0.841	0.73
YOLOv12	RGB	0.767	0.795	0.735
YOLOv13	RGB	0.813	0.832	0.746
Faster R-CNN	RGB	0.795	0.811	0.753
RT-DETR	IR	0.752	0.766	0.723
RT-DETR	RGB	0.839	0.821	0.766
RT-DETR	RGB+IR	0.862	0.87	0.811
MBNet [16]	RGB+IR	0.889	0.896	0.881
TarDAL [33]	RGB+IR	0.866	0.875	0.841
TFDet [9]	RGB+IR	0.891	0.885	0.874
IAF-RTDETR	RGB+IR	0.913	0.892	0.913

Table 3. Ablation Results on Each Module’s Accuracy in M3FD.

Algorithm	Modality	mAP@50	Precision	Recall
RT-DETR	RGB	0.839	0.821	0.766
RT-DETR	IR	0.752	0.766	0.723
RT-DETR	RGB + IR	0.862	0.87	0.811
RT-DETR + MFE	RGB + IR	0.873	0.889	0.805
RT-DETR + LAF	RGB + IR	0.877	0.861	0.843
RT-DETR + CL-DBIM	RGB + IR	0.895	0.881	0.869
RTDETR + MFE + LAF + CL-DBIM	RGB + IR	0.913	0.892	0.913

Table 4. Accuracy Comparison Table: With LAF Module Added.

Algorithm	Modality	mAP@50	Precision	Recall
RT-DETR	RGB	0.839	0.821	0.766
RT-DETR	RGB + IR	0.862	0.87	0.811
RTDETR + LAF	RGB + IR	0.877	0.861	0.843

Table 5. Per-Subset & Per-Model mAP50 Accuracy Table.

Illumination	Number	IR	RGB	RT-DETR + LAF
very dark	17	0.971	0.964	0.958
dark	46	0.887	0.838	0.898
medium	44	0.868	0.869	0.921
bright	154	0.816	0.862	0.859
very bright	159	0.705	0.823	0.835

Table 6. CL-DBIM Module Precision Comparison Table.

Algorithm	Modality	mAP@50	Precision	Recall
RT-DETR	RGB	0.839	0.821	0.766
RT-DETR	RGB + IR	0.862	0.87	0.811
RTDETR+CL-DBIM	RGB + IR	0.895	0.881	0.869

Table 7. Detection Performance of the LLVIP Dataset.

Algorithm	Modality	mAP@50	Precision	Recall
YOLOv11	RGB	0.882	0.897	0.891
YOLOv12	RGB	0.894	0.895	0.887
YOLOv13	RGB	0.903	0.912	0.896
RT-DETR	IR	0.915	0.924	0.902
RT-DETR	RGB	0.908	0.921	0.866
RT-DETR	RGB + IR	0.922	0.927	0.901
MBNet [16]	RGB + IR	0.919	0.916	0.907
TarDAL [33]	RGB + IR	0.926	0.935	0.901
TFDet [9]	RGB + IR	0.932	0.935	0.924
IAF-RTDETR	RGB+IR	0.941	0.942	0.936

Table 8. Ablation Results on Each Module’s Accuracy in LLVIP.

Algorithm	Modality	mAP@50	Precision	Recall
RT-DETR	RGB	0.908	0.921	0.886
RT-DETR	IR	0.915	0.924	0.902
RT-DETR	RGB + IR	0.927	0.927	0.911
RT-DETR+MFE	RGB + IR	0.932	0.940	0.917
RT-DETR+LAF	RGB + IR	0.934	0.936	0.927
RT-DETR+CL-DBIM	RGB + IR	0.936	0.935	0.924
RTDETR+MFE+LAF+CL-DBIM	RGB + IR	0.941	0.942	0.936

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Hu, Q.; Yu, H.; Zhou, Z.; Li, S. IAF-RTDETR: Illumination Evaluation-Driven Multimodal Object Detection Network for Infrared–Visible Dual-Source Fusion. Electronics 2026, 15, 1332. https://doi.org/10.3390/electronics15061332

AMA Style

Hu Q, Yu H, Zhou Z, Li S. IAF-RTDETR: Illumination Evaluation-Driven Multimodal Object Detection Network for Infrared–Visible Dual-Source Fusion. Electronics. 2026; 15(6):1332. https://doi.org/10.3390/electronics15061332

Chicago/Turabian Style

Hu, Qi, Haiyan Yu, Zhiquan Zhou, and Simiao Li. 2026. "IAF-RTDETR: Illumination Evaluation-Driven Multimodal Object Detection Network for Infrared–Visible Dual-Source Fusion" Electronics 15, no. 6: 1332. https://doi.org/10.3390/electronics15061332

APA Style

Hu, Q., Yu, H., Zhou, Z., & Li, S. (2026). IAF-RTDETR: Illumination Evaluation-Driven Multimodal Object Detection Network for Infrared–Visible Dual-Source Fusion. Electronics, 15(6), 1332. https://doi.org/10.3390/electronics15061332

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

IAF-RTDETR: Illumination Evaluation-Driven Multimodal Object Detection Network for Infrared–Visible Dual-Source Fusion

Abstract

1. Introduction

2. Materials and Methods

2.1. Network Structure

2.1.1. Overall Pipeline and Design Intuition

2.1.2. Network Modules

2.2. Module Function Description and Mathematical Modeling

2.2.1. Modality-Specific Feature Enhancer (MFE)

2.2.2. Global Light Estimator (GLE)

2.2.3. Light-Aware Fusion (LAF)

2.2.4. Cross-Layer Dual-Branch Interaction Module (CL-DBIM)

3. Results and Discussion

3.1. Experimental Setup and Parameters

3.2. Main Performance Comparison

3.3. Illumination Environment Adaptation Analysis

3.4. Cross-Modal Branch Learning Analysis

3.4.1. Input–Output Responses Before and After Fusion

3.4.2. Semantic Consistency Measurement and Change Region Analysis

3.4.3. Average Cosine Difference

3.5. Evaluation on LLVIP

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI