This section presents the details of the overall architecture and key modules of the proposed infrared and visible image object detection framework, which focuses on accurate localization and boundary-aware prediction of salient objects. As illustrated in
Figure 1, the proposed framework mainly consists of the following core modules: the frequency residual selective transformer (FREFormer), the Homogeneous Frequency Refined Block (HFRB), the Heterogeneous Spatial-Channel Frequency Fusion Block (HSRB), and the Frequency Reconstruction Guided Module (FRGM), along with a joint loss function for optimization.
3.2. FREFormer Encoder of Infrared and Visible Image
In the proposed method, the extensive use of Fast Fourier Transform (FFT) and Inverse Fast Fourier Transform (IFFT) is essential. Theoretically, point-wise multiplication in the frequency domain is equivalent to circular convolution in the spatial domain. Since convolution operations inherently exhibit translational equivariance, this theoretical property holds between the frequency and spatial domains. However, in practical applications, FFT assumes signal periodicity, which does not hold for images, leading to the introduction of boundary artifacts. Specifically, due to the finite nature of images, the periodicity assumption of FFT causes discontinuities at the boundaries, generating artifacts that affect analysis. To address this issue, SCFusion adopts a hybrid design. Specifically, SCFusion integrates frequency-domain modules with MetaFormer-style blocks [
30] and alternates them with standard convolutional layers (e.g., in the downsampling and encoder stages). These convolutional layers introduce strong local inductive biases, effectively mitigating the impact of boundary artifacts. The local convolutional layers perform smoothing operations in local regions of the image, alleviating artifacts caused by boundary discontinuities. At the same time, this design enhances the network’s ability to capture local details, further improving the precision of image processing.
Based on this MetaFormer-style block structure, we designed the entire encoder module, which is the FREFormer Encoder. As illustrated in
Figure 2, the FREFormer Encoder adopts a two-stage hierarchical structure, with each stage containing two layers: the downsampling layer and the FREFormer Block. This hybrid approach leverages the local inductive bias of convolutions in early stages and the global receptive field of FFT-based filters in later stages. The downsampling layer reduces spatial resolution while increasing channel capacity. Given input feature
, the module of the downsampling layer can be expressed as the following equation:
where
denotes the Layer Normalization.
Following downsampling, the network retains significant frequency information, comprising both target-related details essential for detection and irrelevant background noise. The amplification or suppression of these frequency signals is critical for detection performance, creating a pressing need for an effective frequency-processing filter.
Conventional spectral filtering approaches typically learn a static, global complex filter. Once trained, this filter remains fixed and is applied indiscriminately to all input images. This “content-agnostic” paradigm assumes that a single spectral modulation strategy can generalize across all scenarios. However, real-world scenes are non-stationary; the frequency distribution of a cluttered scene differs significantly from that of a clean, simple object. A static filter lacks the flexibility to adapt to these varying semantic contents, often leading to suboptimal feature extraction where noise is amplified or edges are blurred.
To address these limitations, we propose a dynamic filtering architecture: the learnable selective filter. The learnable selective filter serves as the overarching architectural unit designed to replace the multi-head self-attention mechanism in the transformer, defining the complete pipeline for frequency-domain modulation. Embedded within the learnable selective filter is its core functional module, the learnable selective filter generation (LSFG), which acts as the brain of the filter.
The primary motivation behind LSFG is to introduce adaptability into frequency-domain processing. Unlike static approaches that use a fixed weight matrix, LSFG functions as a dynamic parameter generator, synthesizing a unique filter weight
tailored to each input instance
. The mathematical formulation of LSFG is defined in Equation (
2):
Specifically, the module initiates by capturing global semantics through , which functions as a global average pooling operator to compress spatial information into a concise global descriptor representing the image’s overall “energy.” Building on this statistical profile, the generator synthesizes the dynamic filter weights via a normalization process, expressed as . By explicitly conditioning the filter generation on these input-specific characteristics derived from , the model achieves “instance-aware” processing. This ensures that the filter evolves adaptively with the data, dynamically distinguishing informative frequency components from noise for each specific instance.
With the dynamic weights generated by LSFG, the overall learnable selective filter operates as described in Equation (
3):
The process involves two key steps: first, filter synthesis, where input features are processed via MLP and Softmax layers to generate instance-specific weights ; and second, dynamic modulation, where the synthesized filter modulates the input spectrum via element-wise multiplication, followed by feature reconstruction using the Inverse FFT ().
This design establishes a coherent mechanism where the spatial content (captured by ) explicitly governs the spectral bias . By adapting to the input, the model can intelligently preserve high-frequency components to maintain sharpness in textured regions or suppress specific bands to reduce noise in smooth backgrounds. Consequently, by synthesizing the global receptive field of FFT with adaptive, instance-specific weights, LSF offers a compelling alternative to multi-head self-attention.
Finally, integrating this mechanism into the network, the overall FREFormer Block is formulated as:
3.3. Cross-Frequency Guided Interaction Module (CFGIM)
Building upon the frequency-enhanced representations extracted by the FREFormer, we introduce the Cross-Frequency Guided Interaction Module (CFGIM), as illustrated in
Figure 3, to resolve inherent conflicts between infrared and visible modalities. Unlike traditional fusion methods that implicitly assume strict pixel-to-pixel alignment in the spatial domain, CFGIM reframes multi-modal interaction as a spectral decoupling and recombination process.
Our design is grounded in a fundamental property of Fourier analysis: phase encodes spatial structural information, while amplitude represents signal intensity. Motivated by this physical interpretability, the CFGIM employs a two-stage interaction mechanism:
(1) Homogeneous Frequency Refined Block (HFRB): This module serves as a content-aware active filter. Prior to fusion, it adaptively suppresses modality-specific noise (e.g., thermal grain or background clutter) within each branch, effectively preventing noise amplification during the interaction process.
(2) Heterogeneous Spatial-Channel Frequency Fusion Block (HSFB): This module performs the core structure–content decoupling. It explicitly aligns the structural edges of the two modalities via phase calibration—thereby addressing the spatial misalignment issue—while simultaneously fusing their salient intensities via amplitude interaction.
This divide-and-conquer strategy in the frequency domain provides a transparent mechanism for cross-modal integration, ensuring that the fused features inherit both the sharpest structures and the most salient semantics from the respective inputs.
3.3.1. Homogeneous Frequency Refined Block (HFRB)
We design a Homogeneous Frequency Refined Block (HFRB) from a dual-branch mechanism to predict two dense refined weights and for element-wise adjustment. Let us define uniformly as as the input of this block, . In this module, features will be processed through two branches: a global grouped aggregation branch and a local frequency calibration branch.
The upper branch is primarily responsible for capturing global context information and performing grouped weighted fusion, in which this strategy is similar to Split-Attention. First, the input feature is spatially compressed via global average pooling (AvgPool), which is then passed through a fully connected (FC) layer and a Sigmoid activation function to generate global channel weight . This global channel weight would be used for weighting the frequency-domain feature.
Furthermore, the frequency-guided mechanism aimed at enhancing frequency feature expression by capturing local dependencies between channels and performing the final feature calibration. This branch also utilizes an average pooling layer with 1D Convolution (Conv1D) and a fully connected (FC) layer. It can be formulated as:
This design effectively captures interaction information between adjacent frequency components (i.e., adjacent channels) in the frequency domain, avoiding the destruction of local frequency correlations caused by fully connected layers.
Simultaneously, the input feature
is split along the channel dimension into
K groups. These groups are individually weighted using the generated weights via element-wise multiplication (Hadamard product). Finally, these weighted grouped features are aggregated through element-wise addition (ADD) to obtain the intermediate feature
. This process can be formulated as:
It aims to adaptively retain salient global features from either infrared or visible modalities through a gating mechanism. Through this dual-attention mechanism (global grouped aggregation + local frequency calibration), the model effectively preserves both texture details and target features.
3.3.2. Heterogeneous Spatial-Channel Frequency Fusion Block (HSFB)
The detailed execution flow of the HSFB is summarized in Algorithm 1. Unlike standard CNN modules that operate on entangled features, the HSFB is designed to explicitly decouple the “where” (spatial structure) from the “what” (semantic intensity).
| Algorithm 1 Heterogeneous Spatial-Channel Frequency Fusion Block (HSFB). |
- Require:
Input features (IR and visible); - Require:
Learnable blocks , . - Ensure:
Fused feature map .
- 1:
for do ▹ Parallel processing for each modality - 2:
▹ Projection to latent space // Step 1: Spatial Decomposition - 3:
- 4:
▹ Save for final reconstruction // Step 2: Channel Frequency Interaction - 5:
▹ FFT along channel dimension - 6:
- 7:
▹ Learnable modulation // Step 3: Dual-Domain Reconstruction - 8:
- 9:
▹ Obtain refined intensity // Step 4: Structure–Content Recombination - 10:
▹ Inject refined intensity into original skeleton - 11:
- 12:
end for - 13:
▹ Element-wise fusion - 14:
return
|
Given the frequency-refined features
(infrared) and
(visible) generated by the preceding HFRB module, these features are first processed by a CBS block to project them into a shared embedding space. To initiate the decoupling, we transform the features into the spatial frequency domain via 2D FFT. This step extracts two distinct physical components: the spatial phase (
), which represents the “skeleton” of the image by encoding structural edges, and the spatial amplitude (
), which represents the “muscle” or signal intensity. This decomposition is formulated as:
where
denotes the Spatial FFT. By isolating
, we preserve the precise location of objects, effectively preventing the boundary blurring often caused by spatial misalignment.
While the spatial phase is preserved to maintain structure, the amplitude (intensity) requires cross-channel calibration to highlight salient targets. We map the spatial amplitude
into the channel-frequency domain using Channel-wise FFT (CFFT), treating the channel dimension as a temporal sequence to capture inter-channel dependencies:
In this domain, the CLC module (comprising
convolutions) acts as a learnable frequency filter. It dynamically modulates the channel spectrum to suppress background noise and enhance target-related responses:
Structure-Preserving Reconstruction is the most critical step for addressing cross-modal interference. We first reconstruct the refined spatial amplitude using the inverse channel FFT (
):
Then, strictly following the logic in Algorithm 1, we perform a Structure–Content Recombination. We combine the calibrated intensity (
) with the original spatial phase (
). This ensures that the enhanced semantic features are perfectly aligned with the original object boundaries, effectively solving the spatial misalignment issue:
Finally, the features from both branches are fused via element-wise addition to integrate the complementary information:
Through this mechanism, the HSFB achieves a physically interpretable fusion: it “transplants” the robust, noise-free semantic intensities onto the precise structural skeletons of the source images.
3.4. Frequency Reconstruction Guided Module (FRGM) for Decoder
Transitioning from deep semantic features back to high-resolution detection masks presents a critical challenge: the “Information Attenuation” dilemma. Standard decoders typically rely on spatial upsampling, which tends to smooth out high-frequency details, resulting in the common “blurred boundary” problem where object edges become ambiguous. To overcome this limitation, we design the Frequency Reconstruction Guided Module (FRGM). As illustrated in the decoder architecture, the FRGM operates on the rigorous principle of Multi-resolution Analysis (MRA), fundamentally transforming the decoding process from simple interpolation into a spectral-aware reconstruction.
Specifically, the FRGM decomposes the fused features into distinct frequency bands, assigning explicit physical roles to each component: Low-frequency bands serve as the Semantic Anchor. They guide the coarse localization of salient objects, ensuring that the global shape and category information remain consistent during upsampling. High-frequency bands serve as the Boundary Sharpener. They are explicitly isolated and enhanced to regress sharp contours and fine-grained textures, which are typically the first to be lost in deep networks.
This coarse-to-fine reconstruction strategy physically guarantees that the model produces detection masks with precise edges. By mathematically enforcing the recovery of high-frequency components to sharpen target boundaries.
The FRGM module decomposes the fused feature
into multi-frequency bands to enhance target edge localization in the decoder. First, the weight
of
is designed for frequency selection into four frequency bands with different thresholds:
where
is the binary mask of the
bth frequency band, and
is the frequency threshold. The frequency band feature is obtained by multiplying the Fourier transform of
with
and inverse Fourier transform:
A convolution layer + Sigmoid activation is used to generate a modulation map
for the
bth frequency band:
The final enhanced feature is:
Indeed, the FRGM is embedded into the decoder for object detection. The decoder block consists of a 3 × 3 convolution layer, batch normalization, and GELU activation. The decoder consists of four progressive stages, where the FRGM serves as the core enhancement module at each level. Let denote the output feature of the lth decoder stage (). To effectively leverage the multi-frequency information, the fused feature from the fusion bottleneck is downsampled or upsampled to match the spatial resolution of each decoder layer, denoted as . The FRGM then processes to generate the frequency-enhanced guidance feature .
Specifically, the
lth decoder block integrates the upsampled feature from the previous layer
and the frequency-guided feature
as follows:
The FRGM acts as a multi-scale frequency filter across all four stages, ensuring that both high-frequency edge details and low-frequency semantic consistency are adaptively injected into the reconstruction process. This hierarchical integration allows the decoder to precisely localize targets by reconstructing sharp boundaries from the multi-frequency components.