In this section, we analyze the functions of the key modules in the network and provide their mathematical models. This will help clarify the input–output behaviors and working mechanisms of each module, aiding in the logical reproduction and structural explanation of the network.
2.2.1. Modality-Specific Feature Enhancer (MFE)
Infrared and visible light images differ significantly in terms of imaging mechanisms, texture distribution, and noise characteristics. Directly feeding them into the fusion module may result in increased discrepancies in the response amplitudes between the modalities, and low-quality modality interference may corrupt the higher-quality modality. To address this, we design the MFE to recalibrate the features within each modality. MFE employs a collaborative mechanism using channel attention and spatial attention to extract the importance distribution across channels and the salient responses in spatial regions. This enhances the texture salient regions in visible images and the thermal radiation salient regions in infrared images. In this way, MFE significantly improves the discriminability of each modality’s features before fusion, providing a more reliable input for subsequent illumination-aware fusion. The specific structure is shown in
Figure 2.
The MFE module can adaptively highlight key information within each modality, helping to improve fusion quality and overall detection performance. The specific implementation for the RGB branch is as follows:
Let the multimodal feature map be:
where
B,
C,
H, and
W denote the batch size, channel number, height, and width of the feature map.
We introduce a learnable channel weight vector:
After broadcasting, the vector is used to apply row-wise and channel-wise weighting, followed by a 1 × 1 convolution for channel mapping as shown in Equation (3).
where
denotes element-wise multiplication, and
is the convolution used for channel mapping.
The specific implementation for the IR branch is as follows:
First, the feature map undergoes convolution and activation as shown in Equation (4).
Next, a spatial attention map is constructed as shown in Equation (5).
Finally, the attention map is applied to the feature map as shown in Equation (6).
where
denotes the spatial attention map.
2.2.2. Global Light Estimator (GLE)
The main challenge of multimodal fusion under complex lighting conditions lies in the following: in low-light scenarios, the quality of visible light significantly decreases; in strong light or reflective areas, infrared images remain stable, but visible light becomes overexposed; and intermediate lighting transition areas are difficult to accurately describe using simple day/night labels. To address this issue, we propose the Global Light Estimator (GLE) module, which constructs an illumination feature vector based on statistical characteristics such as brightness mean, brightness standard deviation, and image entropy. The GLE then generates a normalized illumination score using a lightweight MLP. This score not only reflects global illumination intensity but also characterizes local brightness variation trends, providing continuous and fine-grained illumination representation for the fusion module. Unlike traditional coarse day/night classifications, GLE can express the continuous change in “dim—normal—bright” lighting in more detailed intervals, making the fusion strategy genuinely adaptive to environmental lighting. The illumination score is shown in
Figure 3.
Considering that solely using neural networks to predict illumination may still encounter issues such as poor generalization, we adopt a self-supervised learning strategy to construct the training target for the illumination score. Specifically, we compute a proxy illumination score from simple RGB image statistics and use it as a self-supervised signal during network training. The mean squared error (MSE) is then used as the illumination loss to guide the network in estimating illumination.
The module takes RGB images as input, extracts global image features through lightweight convolution and pooling structures, and outputs an illumination score in the range of [0,1]. This score is passed as a control signal to the subsequent fusion module, indicating the reliability of the RGB information in the current scene. Specifically, the GLE module consists of two convolution layers and a global average pooling unit, providing a compact structure that can efficiently learn cross-scene illumination distribution differences, offering dynamic references for modality fusion. The GLE module predicts the illumination score
through the following process as shown in Equation (7).
where
and
are the convolution weights, and GAP represents global average pooling. This score represents the illumination score level of the current RGB image and is used to guide subsequent modality fusion.
This paper linearly combines the three metrics to form a unified illumination score as shown in Equation (8).
where
are the weight coefficients. The Clamp operation normalizes the result to the [0,1] range to match the normalized illumination score in the GLE module.
In Equation (8), the mean brightness provides the primary cue of global exposure level, while the standard deviation and entropy serve as auxiliary cues capturing contrast dispersion and information complexity, respectively. The coefficients and control the strength of these auxiliary corrections. We set to keep the proxy illumination score mainly governed by the exposure level while allowing moderate refinement in ambiguous cases (e.g., strong shadows, local reflections, and over-exposure), where mean brightness alone is insufficient. The Clamp operation ensures a stable target range of [0,1] consistent with the normalized score predicted by GLE.
Here, the three image statistics—mean brightness, standard deviation, and entropy—describe global brightness, contrast distribution, and information complexity, respectively. They are complementary and can be computed directly from the RGB image, providing a lightweight yet effective supervisory signal for learning a continuous illumination score under complex lighting variations (e.g., shadows, reflections, and nighttime scenes).
The final illumination loss is defined as the mean squared error between the network output and the target score as shown in Equation (9).
The overall loss function is:
where
is the original loss function of RT-DETR, and
is the weight for the illumination loss.
The illumination loss is introduced as an auxiliary self-supervised objective to encourage illumination-discriminative representations. The weight balances the detection loss and the illumination regression loss. We set to prevent the auxiliary illumination objective from dominating optimization, so that the learned illumination score primarily acts as a guidance signal for light-aware fusion rather than an end task.
Using proxy illumination has the following significant advantages:
- (1)
Trainability: It allows “illumination semantics” to be truly integrated into the backbone features.
If the mean, variance, and entropy are directly used as numerical inputs to the fusion layer, they have no gradient dependency on the backbone network features, and the network itself would never know how to improve brightness estimation. The self-supervised approach treats illumination estimation as a “soft label.” The MSE gradient is backpropagated to the GLE module, enabling the network’s early features to become more discriminative in terms of brightness and contrast.
- (2)
Semantic Flexibility: The network learns more appropriate representations.
A real nighttime scene is not simply equivalent to “low brightness.” High-contrast light spots and infrared high-heat targets can disrupt the mean value. The convolution estimator can automatically combine local textures, color saturation, noise textures, and other information, as long as it ultimately aligns with the illumination loss. This provides the model with the freedom to learn the illumination representations most helpful for detection, rather than rigidly relying on mathematical averages.
In summary, the network is able to learn more complex illumination patterns, thereby improving the model’s robustness.
2.2.3. Light-Aware Fusion (LAF)
Using the illumination scores obtained from the GLE module and the proxy illumination loss, the Light-Aware Fusion (LAF) module is designed for adaptive modality fusion. The core idea of LAF is to dynamically adjust the fusion ratio of RGB and IR features during the multi-scale fusion phase based on the illumination scores output by GLE. At each fusion scale, LAF applies weighting to the two modality features. To prevent one modality from being suppressed under extreme illumination scores, the illumination scores are smoothed in this approach.
The modality-weighted fusion process is as follows in Equation (11).
where
represents the Sigmoid function,
is the illumination score output by GLE, and
is the fusion sensitivity parameter. Fusion represents the combination of convolution, BatchNorm, and ReLU modules, which are used to further integrate the weighted modality features.
The weighted features are further integrated through concatenation and convolution, ultimately generating a more robust fused representation. This mechanism enhances the contribution of the infrared (IR) signal in low-light scenarios and retains the RGB details in bright environments, thereby improving the model’s consistency in perception and detection accuracy under different lighting conditions.
2.2.4. Cross-Layer Dual-Branch Interaction Module (CL-DBIM)
Due to the response differences between infrared and visible light images at different semantic levels, direct fusion often leads to semantic shifts. To address this, we design a Cross-Layer Dual-Branch Interaction Module (CL-DBIM), as shown in
Figure 4. This module consists of four key subcomponents: channel alignment mechanism, cross-layer modality attention interaction module, SE-Attention, and weighted fusion mechanism.
Channel Alignment Mechanism:
For input feature maps
, their channel dimensions may not be consistent. To ensure the computability of subsequent fusion, the feature map with fewer channels needs to be adjusted to match the channels of the other feature map. The mapping formula is as follows in Equation (12).
This step ensures the dimensional consistency of the input features by using a 1 × 1 convolution to avoid introducing too many parameters. Batch Normalization(BN) enhances training stability, while SiLU adds non-linearity to the expression.
Cross-Layer Modality Attention Interaction Module:
This module is designed to establish a complementary channel attention-guided interaction mechanism between modalities at different semantic levels, addressing the issue of semantic inconsistency between modalities at different layers. The input consists of two sets of modality feature maps and , which come from the infrared and visible light modalities, respectively. The interaction process is as follows:
- (1)
Feature Mapping: First, the features from both modalities are mapped to the same dimensional space. The purpose is to construct a semantically shared channel space, allowing effective information exchange and semantic alignment between the different modalities through the attention mechanism as shown in Equations (13) and (14).
where
and
represent learnable linear projection operators (realized via 1×1 convolutions, i.e., channel-wise linear layers) employed to align the respective channel spaces.
is the feature map obtained by mapping
to the space of
. Similarly,
is the feature map obtained by mapping
to the space of
.
- (2)
Attention-Guided Enhancement: The channel attention mechanism is employed to extract global channel weights from one set of feature maps, thereby guiding the enhancement of the features in the other modality. This process enables the model to selectively amplify the most informative channels, facilitating cross-modal alignment and improving the overall feature representation as shown in Equations (15) and (16).
where
and
represent the downsampling convolutions, while
and
are the upsampling convolutions.
denotes the Sigmoid activation function.
and
correspond to the channel-wise attention weights, with the same shape as the projected features.
- (3)
Residual Enhancement: The projected features are multiplied by the attention map from the other modality, and the original modality features are enhanced through a residual learning mechanism. This process enables the model to retain essential information from the original features while integrating cross-modal enhancements as shown in Equations (17) and (18).
where
denotes element-wise multiplication across channels, which facilitates bidirectional alignment and complementarity of semantic information between modalities.
To further refine the key information and enhance the fusion accuracy, the SE (Squeeze-and-Excitation) channel attention mechanism is introduced. The features
and
after interaction are then concatenated.
The SE (Squeeze-and-Excitation) module is then applied to perform global channel-wise weighting.
where
represents the channel-wise weight coefficient, which determines the extent to which channel information is preserved.
and
represent the re-weighted modality feature branches.
Finally, the weighted outputs are fused with the previously enhanced features through residual learning to obtain the final fused result:
where
and
represent the modality features after interaction and enhancement.
and
denote the weighted residual information from the channel attention outputs. The final fused result is
.
The CL-DBIM module achieves effective alignment and fusion of multimodal features across different semantic levels through a step-by-step structural design. First, the module uses a 1 × 1 convolution to automatically align the channels of the input modalities, ensuring uniform feature dimensions. Subsequently, the cross-modal interaction module, guided by a bidirectional attention mechanism, facilitates information complementarity, addressing the inconsistencies in semantic expression between modalities. Building on this, the module further utilizes SE (Squeeze-and-Excitation) channel attention to globally re-weight the concatenated features, enabling the model to automatically focus on the most semantically valuable feature channels. Finally, residual weighted fusion is applied to preserve the original information to the greatest extent and enhance semantic consistency.