CFRANet: Cross-Modal Frequency-Responsive Attention Network for Thermal Power Plant Detection in Multispectral High-Resolution Remote Sensing Images

He, Qinxue; Cheng, Bo; Zhang, Xiaoping; Gan, Yaocan

doi:10.3390/rs17152706

Open AccessArticle

CFRANet: Cross-Modal Frequency-Responsive Attention Network for Thermal Power Plant Detection in Multispectral High-Resolution Remote Sensing Images

¹

Aerospace Information Research Institute, Chinese Academy of Sciences (CAS), Beijing 100094, China

²

College of Resources and Environment, University of Chinese Academy of Sciences, Beijing 100049, China

³

Key Laboratory of Earth Observation of Hainan Province, Hainan Aerospace Information Research Institute, Wenchang 571399, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(15), 2706; https://doi.org/10.3390/rs17152706

Submission received: 24 June 2025 / Revised: 25 July 2025 / Accepted: 1 August 2025 / Published: 5 August 2025

(This article belongs to the Section Remote Sensing Image Processing)

Download

Browse Figures

Versions Notes

Abstract

Thermal Power Plants (TPPs), as widely used industrial facilities for electricity generation, represent a key task in remote sensing image interpretation. However, detecting TPPs remains a challenging task due to their complex and irregular composition. Many traditional approaches focus on detecting compact, small-scale objects, while existing composite object detection methods are mostly part-based, limiting their ability to capture the structural and textural characteristics of composite targets like TPPs. Moreover, most of them rely on single-modality data, failing to fully exploit the rich information available in remote sensing imagery. To address these limitations, we propose a novel Cross-Modal Frequency-Responsive Attention Network (CFRANet). Specifically, the Modality-Aware Fusion Block (MAFB) facilitates the integration of multi-modal features, enhancing inter-modal interactions. Additionally, the Frequency-Responsive Attention (FRA) module leverages both spatial and localized dual-channel information and utilizes Fourier-based frequency decomposition to separately capture high- and low-frequency components, thereby improving the recognition of TPPs by learning both detailed textures and structural layouts. Experiments conducted on our newly proposed AIR-MTPP dataset demonstrate that CFRANet achieves state-of-the-art performance, with a

{mAP}_{50}

of 82.41%.

Keywords:

remote sensing; facility object detection; thermal power plants; spatial and frequency; deep learning; cross-modal

1. Introduction

Thermal Power Plants (TPPs) play a vital role in national energy infrastructure and environmental monitoring, making their automatic detection in remote sensing imagery increasingly important. As complex industrial facilities composed of diverse sub-components such as cooling towers, chimneys, coal yards, and boiler houses, TPPs exhibit irregular spatial arrangements and diverse appearances, posing significant challenges for object detection models [1,2]. Figure 1 shows representative samples of TPPs with varied spatial configurations and separate, irregular components.

Conventional object detection methods in remote sensing have largely focused on compact, visually coherent targets such as vehicles or buildings [3,4,5]. Fan et al. [6] proposed a CSDP-enhanced YOLOv7 framework with MPDIoU loss to improve the detection of small, ambiguous ships under complex conditions. Wang et al. [7] introduced a general foundation model for building interpretation that unifies building extraction and change detection tasks, demonstrating strong generalization and cross-task learning capabilities across diverse remote sensing datasets. These methods often rely on RGB imagery and are typically optimized for targets with consistent shape and texture. In contrast, TPPs are representative composite objects, where the spatial configuration and heterogeneous components must be jointly interpreted. However, several obstacles such as lack of public datasets and ambiguous definitions still exist. Plenty of works [8,9] have been previously carried out. For example, Liu et al. [8] proposed ABNet, which is adaptive to multiscale object detection, effective for detecting airports, harbors and train stations. Cai et al. [9] designed a weight balanced loss function for hard example mining task, making it more robust for detecting a few hard examples. Sun et al. [10] introduced an anchor-free network for irregular objects in everyday lives. Yin et al. [11] introduced a two-stage detection framework, sal-MFN, for TPP detection in remote sensing images. Their model incorporates a saliency-enhanced module to emphasize prominent regions and a multi-scale feature network (MFN) to adapt to components of varying scales. Yin et al. [12] proposed PCAN, a one-stage detection framework for TPP in remote sensing imagery. Their method integrates context attention and multi-scale feature extraction via deformable convolutions, and introduces a part-based attention module to enhance the recognition of structural components in facility-type objects. Recently, Yuan et al. [2] introduced REPAN, which combines part-level proposals with a Transformer-based global context modeling strategy, demonstrating strong capabilities in complex composite object detection. Most of the aforementioned methods rely on RGB imagery from Google Earth for analysis, while the use of other high-resolution multispectral satellite data remains limited. Overall, CNN-based detectors [13,14] struggle with such complex semantics due to their limited receptive fields and lack of explicit spatial reasoning capabilities. Part-based approaches [1,12,15] attempt to mitigate this by modeling constituent components, but frequently depend on unsupervised clustering methods like K-means for part localization, leading to unstable learning and poor interpretability.

Figure 1. Representative TPPs featuring diverse and irregularly distributed components, including chimneys (green boxes), coal yards (red boxes), pools (blue boxes), and other industrial buildings. (a) Bathinda; (b) Ukai; (c) Korba [12].

Another limitation of most current TPP detection methods lies in their reliance on a single modality. RGB images alone may not sufficiently capture spectral diversity, especially in scenes with vegetation occlusion, thermal interference, or complex backgrounds [16]. Recent advances in multispectral imaging, particularly the use of near-infrared (NIR) bands, have shown that combining RGB and NIR modalities can enhance material and structural contrast [17,18], thus improving detection under challenging conditions. Compared with single-modal object detection methods (e.g., SSD [19], RetinaNet [20], Faster R-CNN [21], DETR [22]), multimodal approaches integrate complementary spectral information from different modalities (such as RGB, thermal, or depth data), leading to more robust and accurate object detection under challenging conditions. By leveraging the strengths of each modality, these methods can better handle issues such as low lighting, occlusion, or background clutter, which often hinder the performance of single-modal detectors. Consequently, multimodal object detection exhibits greater adaptability and reduced limitations, making it more suitable for real-world applications in complex environments. However, simply concatenating multispectral channels often fails to exploit the full potential of cross-modal interactions. This has motivated a growing body of research focused on designing more effective fusion strategies to fully leverage the complementary information between modalities. For example, to enhance multispectral object detection, Shen et al. [23] proposed a dual cross-attention fusion framework that models global feature interactions across RGB and thermal modalities, demonstrating that effective cross-modal integration significantly improves detection accuracy and efficiency. Fang et al. [24] introduced CMAFF, a lightweight cross-modality fusion module based on joint attention to common and differential features, enabling high-performance multispectral detection with minimal computational overhead.

Most of the commonly used attention mechanisms, such as SE [25], CBAM [26], ECA [27], CA [28], and Swin Transformer [29], operate in the spatial or channel domain. These spatial-domain attention mechanisms typically rely on convolutional or MLP-based structures to process and enhance the original image features. In parallel, frequency-domain information has demonstrated its effectiveness in enhancing visual representations. For example, Chi et al. [30] proposed Fast Fourier Convolution (FFC), which uses Fourier spectral theory to construct non-local receptive fields and perform cross-scale fusion efficiently. Suvorov et al. [31] proposed LaMa, a large mask inpainting network that leverages Fast Fourier Convolutions (FFCs) to achieve a global receptive field even in early layers. This design enables the network to generalize well to high-resolution images despite being trained on low-resolution data. Chaudhuri et al. [32] proposed a Fourier-Guided Attention (FGA) module that integrates Fast Fourier Convolution (FFC) with attention mechanisms to capture both local and global context in crowd counting tasks. Lyu et al. [33] introduced DFENet, the FFT-based supervised network for RGB-T salient object detection (SOD), which efficiently processes high-resolution bi-modal inputs. By leveraging frequency-domain modules such as FRCAB and CFL, DFENet achieves competitive performance with reduced memory usage. Furthermore, frequency-aware attention has shown promise in preserving both global and local information by analyzing feature maps in the frequency domain [34]. Low-frequency components can encode overall structure, while high-frequency details capture texture and edge information. Despite this, few works integrate frequency priors with multispectral data for detecting large, composite objects such as TPPs.

To tackle these limitations, we propose CFRANet (Cross-Modal Frequency-Responsive Attention Network), a novel dual-stream detection framework specifically designed for TPP detection in complex remote sensing scenes. CFRANet introduces two core modules: the Modality-Aware Fusion Block (MAFB), which selectively integrates RGB and NIR features to exploit modality-specific strengths, and the Frequency-Responsive Attention (FRA) module, which enhances feature representations via dual-branch attention in both spatial and frequency domains.

Our contributions are summarized as follows:

We propose CFRANet, a dual-stream multispectral detection framework that effectively leverages RGB and NIR modalities, tailored for detecting thermoelectric power plants with diverse and irregular structures.
We design a lightweight Modality-Aware Fusion Block (MAFB) to perform adaptive, hierarchical fusion of spectral features, enhancing cross-modal complementarity.
We introduce the Frequency-Responsive Attention (FRA) module, which integrates spatial and frequency-domain cues to simultaneously capture global structures and local textures.
We construct a new high-resolution multispectral dataset, AIR-MTPP, and demonstrate that CFRANet achieves state-of-the-art performance, with an average precision of 82.41%.

The remainder of this paper is organized as follows. Section 2 describes the proposed CFRANet in detail. Section 3 presents experiments and analysis. Section 4 provides discussion, and Section 5 concludes the paper.

2. Materials and Methods

2.1. Overview

To tackle the challenges of multispectral composite object detection, we propose CFRANet—a novel network designed to effectively exploit complementary information from RGB and NIR modalities. As shown in Figure 2, the architecture is composed of three main components: a dual-stream feature extraction backbone, a frequency-aware cross-modal attention mechanism, and a modality-aware loss. The overall pipeline, including the backbone, cross-modal attention, Feature Pyramid Network (FPN), and modality-specific detection heads, is formulated as:

Output = Head (FPN (DualStreamResNet (RGB, NIR))) .

(1)

In the beginning, separate ResNet branches are employed to extract modality-specific features, which are later enhanced and fused using the MAFB. To bridge the semantic gap between modalities and enable cross-modal feature interaction, MAFB uses an FRA module. This module combines spatial attention and frequency-domain decomposition to reinforce both global and local semantics. The fused multi-scale features are then processed by an FPN and finally passed through detection heads for object localization and classification. A modality-aware loss function guides the learning process by balancing performance on both RGB and NIR modalities. Although the network extracts modality-specific features using two separate ResNet branches and generates intermediate regression and classification predictions for both RGB and NIR modalities, it does not independently output two final detection results. Instead, modality-specific predictions are first decoded and filtered using Non-Maximum Suppression (NMS) individually. Subsequently, the retained detections from both modalities are merged, and a second round of NMS is applied to obtain a unified final detection output. This late fusion strategy leverages complementary information from both modalities while maintaining a single consistent output for downstream evaluation.

2.2. Dual-Stream Feature Extraction Backbone and Modality-Aware Fusion Block

To effectively capture complementary characteristics of visible and NIR imagery, we introduce a dual-stream backbone, which consists of two independent ResNet branches: one dedicated to the RGB modality and another to the NIR modality. Both streams share the same architectural configuration (e.g., ResNet-50 or ResNet-101 depending on

ϕ

), but they maintain separate weights to preserve modality-specific representations. The four residual blocks, layer1 to layer4, correspond to the feature stages C2 to C5 in standard ResNet terminology. In this work, we focus on layer2, layer3, and layer4 (i.e., C3, C4, C5) for cross-modal feature fusion.

Each stream initially processes its input through a convolutional stem comprising a

7 \times 7

convolution, batch normalization, ReLU activation, and a max-pooling layer. This is followed by four sequential residual blocks, denoted as layer1 through layer4.

To facilitate semantic-level cross-modal interaction, we incorporate a MAFB module (Figure 2) after layer2, layer3, and layer4. At each of these levels, the corresponding RGB and NIR features are summed element-wise and passed through the FRA module to generate a shared cross-modal attention map. This attention map enhances both modality streams via channel-wise modulation.

Formally, let

F_{l}^{RGB}

and

F_{l}^{NIR}

denote the RGB and NIR feature maps at level

l \in {2, 3, 4}

. The shared fused feature is computed as:

F_{l}^{fusion} = F_{l}^{RGB} + F_{l}^{NIR} .

(2)

This fused representation is then passed to the FRA module to generate the attention map:

A_{l} = FRA_Module (F_{l}^{fusion}),

(3)

which is used to modulate each modality-specific feature via:

{\tilde{F}}_{l}^{RGB} = F_{l}^{RGB} \cdot σ (A_{l}), {\tilde{F}}_{l}^{NIR} = F_{l}^{NIR} \cdot σ (A_{l}),

(4)

where

σ (\cdot)

denotes the sigmoid activation.

This shared attention mechanism allows the two modalities to benefit from common semantic cues while retaining modality-specific characteristics. The outputs of this dual-stream backbone are the modulated multi-scale features

{{\tilde{F}}_{2}, {\tilde{F}}_{3}, {\tilde{F}}_{4}}

from both streams. These correspond to the C3, C4, and C5 stages in the traditional ResNet and are passed to a dual FPN for further refinement.

2.3. Frequency-Responsive Attention Module

To enhance cross-modal feature interaction in dual-stream networks and better integrate both spatial-local and frequency-global information in feature extraction, we propose an FRA Module, as shown in Figure 3. The module incorporates both spatial convolution and frequency domain processing in a dual-branch structure, effectively enhancing feature representation capabilities for downstream tasks. The full processing steps of the proposed FRA Module are summarized in Algorithm 1, which outlines the training procedure and feature fusion pipeline.

2.3.1. Dual-Branch Local and Global Feature Extraction

The FRA Module consists of two major branches: the Local Branch, which applies two

3 \times 3

convolutional layers to independently extract local spatial features and scales them by

(1 - α)

, and the Global Branch, which employs one

3 \times 3

convolution followed by a LHF module to extract frequency-enhanced global features, scaled by

α

.

Algorithm 1 FRA Module—Frequency-Responsive Attention Module Pipeline

FRA Module integrates frequency-enhanced global features and local CNN features to improve attention mechanisms.

Input: Feature map x, frequency ratio

ρ

, global weight

α

, max epochs T
Output: Trained FRA module and fused feature map

F_{o u t}

1:: Initialize parameters of FRA Module
2:: for $e p o c h = 1$ to T do
3:: Extract $F_{l o c a l}^{(1)}, F_{l o c a l}^{(2)} = Conv 3 x 3 (x)$ , scaled by $(1 - α)$
4:: Extract $F_{g l o b a l}^{(1)} = Conv 3 x 3 (x)$
5:: $F_{g l o b a l}^{(2)} \leftarrow L H F (x, ρ)$ , scaled by $α$
6:: $C A \leftarrow C H A N N E L A T T E N T I O N (Concat (F_{l o c a l}^{(1)}, F_{g l o b a l}^{(1)}))$
7:: $S A \leftarrow S P A T I A L A T T E N T I O N (Concat (F_{l o c a l}^{(2)}, F_{g l o b a l}^{(2)}))$
8:: $F_{o u t} \leftarrow Conv 1 x 1 (Concat (C A, S A))$
9:: end for
10:: return trained FRA Module and $F_{o u t}$

2.3.2. Low-High Frequency Decomposition

To enhance global feature extraction, we incorporate a Fourier-based Low-High Frequency (LHF) module into the global attention branch. This module decomposes the input features in the frequency domain using a learnable frequency mask, explicitly separating low-frequency (global structure) and high-frequency (detail boundary) components. These frequency-specific features are then independently processed and adaptively fused. The motivation is that global semantic context and fine-grained texture cues often manifest in different frequency bands, and directly modeling this distinction facilitates more discriminative representation learning, especially for complex targets like TPPs. The core of the global branch is the LHF (Figure 4), which performs low- and high-frequency decomposition in the frequency domain. Given an input feature map

x \in R^{B \times C \times H \times W}

, we apply a 2D Fourier Transform

F

to obtain complex-valued frequency features:

X_{f} = F (x)

(5)

A binary mask

M \in {0, 1}^{H \times W}

is constructed to separate the spectrum into low- and high-frequency components based on a ratio parameter

ρ

:

X_{l o w} = X_{f} \cdot M, X_{h i g h} = X_{f} \cdot (1 - M)

(6)

These components are then transformed back to the spatial domain via the inverse Fourier transform

F^{- 1}

, with only the real part retained:

x_{l o w} = ℜ [F^{- 1} (X_{l o w})], x_{h i g h} = ℜ [F^{- 1} (X_{h i g h})]

(7)

Each branch is followed by a convolutional block, and the outputs are concatenated and fused via a

1 \times 1

convolution to form the final output of the LHF module. In our experiments, we set

ρ = 0.25

to balance retaining essential low-frequency components, which capture overall structure and semantics, as well as preserving high-frequency details like edges and textures.

The detailed process for low- and high-frequency decomposition in the frequency domain is described in Algorithm 2.

Algorithm 2 LHF—Low/High Frequency Separation

Input: Feature map x, frequency ratio

ρ

Output: Fused low/high frequency feature

F_{g l o b a l}^{(2)}

1:: $X_{f} \leftarrow F (x)$ ▹ Apply 2D Fourier Transform
2:: Create binary mask M based on ratio $ρ$
3:: $X_{l o w} \leftarrow X_{f} \cdot M$ , $X_{h i g h} \leftarrow X_{f} \cdot (1 - M)$
4:: $x_{l o w} \leftarrow ℜ [F^{- 1} (X_{l o w})]$ , $x_{h i g h} \leftarrow ℜ [F^{- 1} (X_{h i g h})]$
5:: $f_{l o w} \leftarrow ConvBlock (x_{l o w})$ , $f_{h i g h} \leftarrow ConvBlock (x_{h i g h})$
6:: return $Conv 1 x 1 (Concat (f_{l o w}, f_{h i g h}))$

2.3.3. Attention-Guided Feature Fusion

Two attention mechanisms guide the fusion of local and global features: Channel Attention (CA), where the local feature

F_{l o c a l}^{(1)}

and the global convolutional feature

F_{g l o b a l}^{(1)}

are concatenated and processed through a

1 \times 1

convolution followed by a channel attention block, and Spatial Attention (SA), where the local feature

F_{l o c a l}^{(2)}

and the frequency-enhanced global feature

F_{g l o b a l}^{(2)}

are concatenated and passed through another fusion block and a spatial attention module.

The outputs of both attention paths are concatenated and passed through a final

1 \times 1

convolution for fusion:

F_{o u t} = {Conv}_{1 \times 1} ([CA (F_{l o c a l}^{(1)}, F_{g l o b a l}^{(1)}), SA (F_{l o c a l}^{(2)}, F_{g l o b a l}^{(2)})])

(8)

This architecture ensures that both frequency-aware global context and spatially detailed local patterns are effectively captured and integrated.

2.4. Loss Function

To effectively leverage complementary information from both RGB and NIR modalities, we design a multi-component loss function that integrates modality-specific detection losses. This enables the model to optimize each modality independently while benefiting from joint training. Specifically, the total loss is defined as:

L_{total} = (1 - λ) \cdot (L_{cls}^{RGB} + L_{reg}^{RGB}) + λ \cdot (L_{cls}^{NIR} + L_{reg}^{NIR}),

(9)

where

L_{cls}

and

L_{reg}

denote the classification and regression losses used in RetinaNet [20], respectively. The hyperparameter

λ \in [0, 1]

balances the contributions of the two modalities. In our experiments, we set

λ = 0.4

to emphasize RGB features while still leveraging the complementary information from NIR.

This loss formulation encourages the model to adaptively learn from both modalities and improves detection performance, particularly under challenging visual conditions.

3. Results

3.1. Datasets

Since there are few open-source composite object detection datasets in RSIs, we constructed a high-resolution multispectral dataset named AIR-MTPP, which contains 481 images of TPPs. The images were carefully selected from 18 provinces and regions across China, including Xinjiang, Inner Mongolia, Ningxia, Heilongjiang, Jilin, Liaoning, Shaanxi, Shanxi, Hebei, Beijing, Tianjin, Shandong, Henan, Anhui, Jiangsu, Zhejiang, Guangdong, and Guizhou. These locations were chosen to ensure a diverse geographical distribution, covering plains, mountainous areas, and coastal zones. The primary data source is the Gaofen-6 (GF-6) satellite, and the acquisition period spans from January 2023 to March 2025. The dataset captures TPPs under various seasonal conditions (spring, summer, autumn, and winter), ensuring both spatial and temporal diversity. In total, 49 scenes were utilized, with 2024 as the main acquisition period. When image quality was insufficient (e.g., due to cloud cover or atmospheric distortion), imagery from 2023 or early 2025 was used as a supplement.

All raw scenes underwent a rigorous preprocessing pipeline including atmospheric correction, pansharpening fusion, and bit-depth adjustment. The resulting multispectral images have a 2-meter spatial resolution. Each sample was then cropped around the TPPs to a fixed size, corresponding to a ground area of

4096 m \times 4096 m

. After strict quality control and filtering, we obtained 481 valid images. All objects were annotated with horizontal bounding boxes in VOC format, resulting in a well-structured, composite-object-focused dataset suitable for training and evaluating object detection models in complex remote sensing scenarios. The dataset was randomly divided into training, validation, and test sets with proportions of 81%, 9%, and 10%, respectively. Specifically, 90% of the data was allocated to the combined training and validation set. Within this subset, a further split was performed: 90% for training and 10% for validation. This ensures a balanced and representative partitioning of the dataset for model development and evaluation.

3.2. Evaluation Metrics

To comprehensively evaluate our TPP detection method in practical engineering scenarios, we employ standard quantitative metrics, including precision, recall, average precision (AP), frames per second (FPS), and Giga floating-point operations (GFLOPs).

Precision and Recall: A predicted bounding box is considered a true positive (TP) if its intersection over union (IoU) with a ground-truth box exceeds a predefined threshold [35]; otherwise, it is a false positive (FP). Unmatched ground-truth boxes are false negatives (FN). The IoU is defined as:

IoU = \frac{| G \cap D |}{| G \cup D |},

(10)

where G and D are the ground-truth and predicted boxes [36]. Precision and recall are then computed as:

Precision = \frac{T P}{T P + F P}, Recall = \frac{T P}{T P + F N} .

(11)

AP and mAP: Average precision (AP) is the area under the PR curve [35]. Mean average precision (mAP), denoted as

{mAP}_{[0.5 : 0.95]}

, averages AP across IoU thresholds from 0.5 to 0.95 in steps of 0.05 [37]. In this study, we adopt

{mAP}_{50}

(i.e., AP at IoU threshold of 0.5) as the primary evaluation metric due to its widespread usage and interpretability in remote sensing object detection.

Efficiency Metrics: Frames per second (FPS) measures inference speed, GFLOPs (giga floating point operations) reflects computational complexity, and the number of parameters (Params) indicates the model’s size and memory requirements..

All models are evaluated under identical hardware settings for fair comparison.

3.3. Experimental Settings

All experiments are conducted using the PyTorch 1.11.0 deep learning framework on a workstation equipped with an NVIDIA RTX 3090 GPU (24GB memory) and CUDA 11.3. To initialize the network, we adopt a ResNet-50 backbone pre-trained on the ImageNet dataset [38], which provides a good feature representation and accelerates convergence.

To balance the demand of large-scale scenes and the efficiency of deep network training, all input images are resized to

600 \times 600

pixels through random cropping and horizontal flipping as data augmentation.

The model is trained using the Adam optimizer, with an initial learning rate of 0.001, momentum set to 0.9, and a minimum learning rate of 0.00001. In the classification loss, we adopt the focal loss formulation following RetinaNet [20], where the focusing parameter is set to

γ = 2

and the balancing factor

α_{t} = 0.25

.

To improve optimization, we adopt a cosine annealing learning rate schedule with a warm-up strategy. Specifically, the learning rate linearly increases from

0.0001

to the initial value over the first 5% of total training iterations (warm-up phase), followed by a smooth decay to the minimum learning rate using the cosine function. During the final 5% of iterations, the learning rate is fixed at the minimum value to ensure training stability. The learning rate is updated dynamically before each iteration according to this schedule.

Unless otherwise specified, the training is performed for a fixed number of epochs, and the learning rate follows the above schedule. All other hyperparameters are kept consistent across experiments for fair comparison.

3.4. Comparisons with State of the Art

To comprehensively evaluate the performance of our proposed CFRANet, we compare it with several state-of-the-art object detectors, covering both single-modality (RGB-only) and multi-modality (RGB+NIR) approaches. Among the single-modality methods, RetinaNet [20] is a one-stage detector that introduces Focal Loss to address the extreme foreground–background class imbalance, making it particularly suitable for dense object detection tasks; Faster R-CNN [21] is a classical two-stage detector that combines Region Proposal Networks (RPN) with Fast R-CNN, offering high accuracy at the cost of slower inference; SSD [19] is a fast one-stage detector that uses multiple feature maps for multi-scale detection, offering a good trade-off between speed and accuracy; and FCOS [39] is an anchor-free one-stage detector that predicts object centers and regresses boxes in a per-pixel fashion, effectively simplifying the detection pipeline. For multi-modality detectors, CSSA [40] utilizes channel switching and spatial attention mechanisms to fuse information from RGB and NIR modalities, aiming to improve cross-modal detection performance, while ICAFusion [23] employs a query-guided feature fusion framework, showing effectiveness in diverse multispectral scenarios. These two methods were chosen due to their strong performance and widespread adoption as general multimodal detectors in remote sensing, providing meaningful baselines for comparison.

Table 1 presents a quantitative comparison of CFRANet with these representative object detectors on multispectral object detection tasks. Among all methods, our proposed CFRANet, which also falls into the multi-modality category, achieves the highest detection accuracy with an mAP

_{50}

of 82.41%, outperforming the second-best method, FCOS (73.36%), by a significant margin of 9.05%. This substantial improvement highlights the strong discriminative capability and cross-modal robustness of CFRANet, which benefits from its dual-stream architecture and the integration of the FRA module and MAFB for adaptive feature representation and fusion. While CFRANet incurs the highest computational complexity (384.12 GFLOPs) due to its dual-stream design and separate processing of RGB and NIR inputs, it still maintains a moderate inference speed of 30.96 FPS. Compared with other multi-modality methods like ICAFusion (26.45 FPS), CFRANet offers better performance at comparable speed. Although single-modality detectors such as SSD achieve higher FPS (e.g., 176.77), they fall short in terms of accuracy. Overall, the results demonstrate that CFRANet provides an effective trade-off between accuracy and efficiency, making it particularly suitable for remote sensing scenarios where robustness and precision in multispectral environments are crucial.

Figure 5 presents a qualitative comparison of detection results produced by different methods on six representative TPP images. These TPPs are located in diverse regions of China—specifically, Changchun (Jilin), Daqing (Heilongjiang), Dongguan (Guangdong), Guangzhou (Guangdong), and Huainan (Anhui)—covering a wide range of environmental and visual conditions, which enhances the credibility and representativeness of the comparison. Each column in the figure corresponds to one TPP image and includes the ground truth annotations alongside the detection results from RetinaNet, Faster R-CNN, SSD, FCOS, CSSA, ICAFusion, and the proposed CFRANet. As illustrated, CFRANet consistently delivers the most accurate and complete detections, demonstrating strong robustness and high precision across various complex scenes. In contrast, RetinaNet, SSD, FCOS, and ICAFusion exhibit varying degrees of missed detections, with SSD showing the highest miss rate. CSSA also presents a few missed and false detections. Meanwhile, Faster R-CNN tends to produce more false positives, leading to a higher false alarm rate. These qualitative results further validate the effectiveness and reliability of CFRANet in handling challenging multispectral remote sensing TPP detection tasks.

We also analyzed the correlation between the Params and FPS across the evaluated detectors. The Pearson correlation coefficient was −0.32 with a p-value of 0.4819, suggesting a weak and statistically insignificant negative correlation. This indicates that while model size may affect runtime to some extent, other architectural factors—such as model design, parallelizability, and feature fusion strategy—also play a crucial role in determining inference efficiency.

3.5. Ablation Study

3.5.1. Effect of MAFB

To verify the effectiveness of the proposed cross-modal fusion and attention modules, we conduct an ablation study based on RetinaNet using RGB and NIR modalities. The results are summarized in Table 2, where all models are sorted by mAP

_{50}

in ascending order. We focus on analyzing the impact of different fusion locations and modality combinations on detection accuracy and inference speed.

Compared with the baseline RetinaNet (RGB), using only NIR images leads to a significant performance drop (69.63% vs. 72.13% mAP

_{50}

, −2.50%), indicating that NIR images alone lack sufficient discriminative information. Simply concatenating RGB and NIR modalities (concat fusion) improves mAP

_{50}

to 73.83% (+1.70%), demonstrating the potential of multi-modal fusion. However, this naive fusion strategy lacks fine-grained cross-modal interactions. To address this limitation, we progressively introduce our proposed MAFB into different feature levels (

C_{3}

,

C_{4}

,

C_{5}

) of the RetinaNet backbone. The variants “MAFB@C5”, “MAFB@C3,C5”, “MAFB@C4,C5”, and “MAFB@C3,C4” denote fusion at different layer combinations. We observe consistent improvements across all configurations, confirming that cross-modal interactions at multiple levels significantly boost detection performance. Notably, fusing only at

C_{3}

(MAFB@C3) achieves 79.74% mAP

_{50}

(+7.61%) with relatively low computational overhead, indicating that early-layer attention is particularly effective for fine-grained detail extraction. Using MAFB solely at

C_{4}

(MAFB@C4) also achieves strong performance (79.30% mAP

_{50}

), suggesting the importance of mid-level semantic features in modality fusion. In contrast, fusing only at

C_{5}

(MAFB@C5) results in a relatively modest improvement (75.50% mAP

_{50}

, +3.37%). This is primarily because high-level features in

C_{5}

are highly abstract and spatially coarse, which limits the effectiveness of cross-modal interactions. The NIR modality’s strengths—such as preserving structural edges and enhancing contrast—are difficult to exploit at this stage due to the reduced spatial resolution and lack of fine detail. As a result, attention-based fusion in

C_{5}

alone cannot fully leverage the complementary advantages of RGB and NIR inputs. Interestingly, fusion at two levels does not always outperform single-level fusion at

C_{3}

or

C_{4}

. For instance, MAFB@C3,C4 achieves 78.91% mAP

_{50}

, which is slightly lower than MAFB@C3. This may be due to potential redundancy or interference introduced when combining features from layers of different abstraction levels without sufficient semantic alignment. The overlapping or conflicting attention from multiple fusion stages may disrupt the learning of distinctive cross-modal patterns, leading to sub-optimal performance. Moreover, mid- and high-level features may not capture complementary modality cues as effectively as lower-level features do, further limiting the benefit of multi-level fusion when not comprehensively designed.

Ultimately, fusion across all three levels (CFRANet) yields the best result of 82.41 mAP

_{50}

(+10.28%), validating the effectiveness of the proposed Cross-modal Fusion Residual Attention Network (CFRANet), which fully exploits complementary information from both modalities across different semantic levels. In terms of inference speed, the introduction of attention modules does incur additional computational overhead, leading to a gradual reduction in FPS. Nevertheless, the accuracy gains outweigh this cost. CFRANet, despite being slightly slower (30.96 FPS), still maintains real-time performance on modern GPUs, making it a practical and effective solution for real-world multimodal detection tasks.

To provide further qualitative evidence of the effectiveness of MAFB, we visualize the Grad-CAM activation maps for two representative configurations: the baseline dual-stream model with simple RGB-NIR concatenation, and the full CFRANet model with MAFB applied across all three levels (

C_{3}

,

C_{4}

,

C_{5}

). Figure 6 further demonstrates the effectiveness of the proposed MAFB fusion strategy through Grad-CAM visualizations on four representative TPPs located in Nanjing (Jiangsu), Zhengzhou (Henan), Shuozhou (Shanxi), and Jining (Shandong). Each row corresponds to one TPP sample, and the columns are arranged as follows: original RGB and NIR images (both with red bounding boxes indicating ground truth annotations), Grad-CAM heatmaps from the concat fusion model on the RGB stream and NIR stream, Grad-CAM heatmaps from CFRANet on the RGB stream and NIR stream. This comparison reveals that the concat-fusion model tends to activate broadly around background or irrelevant regions, whereas CFRANet produces more focused and discriminative responses, effectively highlighting the TPP structure in both modalities. This visual comparison further validates that MAFB enables more effective cross-modal feature interaction, facilitating precise localization and enhanced semantic understanding.

3.5.2. Effect of FRA Module

To further evaluate the effectiveness of the proposed cross-modal attention design, we replace the FRA module in MAFB of CFRANet with one frequency domain attention module and two widely used attention mechanisms, Fca (Frequency Channel Attention) [41], SE [25] and CBAM [26], respectively, and compare their performance within the same CFRANet architecture. Fca enhances channel attention by analyzing frequency domain representations of feature maps. It applies the 2D Discrete Cosine Transform (DCT) to capture global dependencies and assigns higher weights to informative frequency components, thereby highlighting semantically important features. SE modules recalibrate channel-wise feature responses by explicitly modeling inter-channel dependencies through a lightweight gating mechanism, enhancing discriminative features while suppressing less informative ones. CBAM extends this idea by introducing both channel and spatial attention in a sequential manner, allowing the network to focus not only on what to emphasize (channel) but also on where to emphasize (spatial). As shown in Table 3, the configuration CFRANet with Fca replacing FRA module yields 71.46% mAP

_{50}

, indicating that although Fca effectively models frequency information, it may not fully capture cross-modal interactions essential for the TPP detection task. CFRANet with SE replacing FRA module achieves 76.25% mAP

_{50}

, indicating a modest improvement over the simple concat fusion. Incorporating CBAM as CFRANet with CBAM replacing FRA module yields a higher mAP

_{50}

of 78.09%, thanks to its more comprehensive attention modeling. However, all three methods still fall short of the original CFRANet, which achieves the best performance of 82.41% mAP

_{50}

. This result validates the advantage of CFRANet’s modality-aware multi-scale fusion strategy over traditional intra-modal attention mechanisms. In terms of inference speed, Fca, SE and CBAM modules introduce only minor computational overhead, maintaining real-time performance (47.16, 41.59 and 41.00 FPS, respectively). Although CFRANet incurs a larger drop in FPS due to its cross-modal and multi-level attention design, it still operates at 30.96 FPS—adequate for real-time applications—while delivering significantly better detection accuracy. This demonstrates that CFRANet strikes an effective balance between performance and computational cost.

Figure 7 demonstrates the effectiveness of the proposed MAFB fusion strategy through Grad-CAM visualizations on four representative TPPs. Each row corresponds to one TPP and includes six columns: original RGB and NIR images (both with red bounding boxes indicating ground truth annotations), Grad-CAM heatmaps before fusion, and Grad-CAM heatmaps after applying MAFB fusion. It is evident that, after applying MAFB, both RGB and NIR modalities yield sharper and more localized activations around TPP structures. The post-fusion heatmaps exhibit higher concentration and clearer semantic focus, verifying the capability of MAFB to enhance cross-modal feature alignment and discriminative power. Figure 8 illustrates the effect of substituting FRA with the Fca module. While the NIR stream shows relatively good attention concentration around TPP structures, the RGB stream exhibits minimal improvement, with Grad-CAM activations remaining diffuse and poorly localized. This indicates that Fca provides limited cross-modal enhancement, especially in the RGB pathway. Figure 9 shows the visualizations when the FRA module in MAFB is replaced with the SE attention module. The results suggest that SE also improves the focus of Grad-CAM activation in many cases, especially in the NIR stream. However, the enhancement is generally less consistent compared to the original MAFB configuration. In contrast, Figure 10 presents the results when replacing the FRA module with CBAM. The improvements in activation focus are less noticeable.

Overall, these observations highlight that the proposed FRA module, with its frequency-responsive design, provides more robust cross-modal enhancement compared to generic attention mechanisms such as Fca, SE and CBAM.

3.6. Hyperparameters

3.6.1. Effect of $λ$ on Detection Performance

Figure 11 illustrates the impact of varying the hyperparameter

λ

on detection performance, as measured by mAP

_{50}

, Precision, and Recall. As

λ

increases from

0.0

to

1.0

with an interval of

0.2

, the contribution of the NIR modality in the total loss becomes more significant. When

λ = 0.0

, the model relies solely on RGB supervision, resulting in a relatively high Recall (

100 %

) but extremely low Precision (

0.02 %

), indicating a high false positive rate. Conversely,

λ = 1.0

—which corresponds to exclusive NIR supervision—also yields poor Precision (

0.02 %

), despite maintaining full Recall.

The hyperparameter

λ

controls the weighting balance between the RGB loss and the NIR loss. Based on empirical ablation experiments,

λ = 0.4

was found to achieve the best overall performance. Specifically, RGB images typically contain richer spatial and structural cues, so slightly emphasizing the RGB loss helps improve detection accuracy. At the same time, retaining the contribution from the NIR modality supplements complementary spectral information. This setting theoretically realizes a reasonable fusion of complementary information from both modalities, effectively enhancing the overall model performance. The best overall performance is observed when

λ = 0.4

, where the model achieves a maximum mAP

_{50}

of

82.42 %

, a balanced Precision of

63.64 %

, and a Recall of

80.77 %

. This setting outperforms both single-modality baselines, demonstrating that moderate fusion of RGB and NIR supervision provides complementary benefits. Additionally,

λ = 0.4

strikes an effective trade-off between modality-specific contributions. Therefore, we adopt

λ = 0.4

as the optimal configuration in our experiments.

3.6.2. Effect of $α$ on Frequency-Responsive Attention

We evaluate the influence of the global-branch weighting parameter

α

in the FRA Module by testing three configurations:

α = 0.0

(local-only),

α = 0.5

(balanced local–global fusion), and

α = 1.0

(global-only). As illustrated in Figure 12, the best performance is achieved when

α = 0.5

, where the model attains the highest mAP

_{50}

of

82.42 %

, along with balanced Precision (

63.64 %

) and Recall (

80.77 %

). When

α = 0.0

, the network relies solely on local spatial cues, which limits its ability to capture long-range dependencies, resulting in a lower mAP

_{50}

of

76.7 %

. In contrast, setting

α = 1.0

fully emphasizes the global frequency-aware branch, but omits crucial local details, leading to a further performance drop (mAP

_{50}

of

74.94 %

).

The FRA module consists of a local frequency branch and a global frequency branch, with

α

balancing their respective contributions. Empirical results indicate that setting

α = 0.5

, i.e., equal weighting between local and global branches, enables the model to better integrate local texture details and global structural information. From a theoretical perspective, this symmetric weighting simplifies optimization by avoiding bias toward either scale, thereby facilitating the complementary enhancement of multi-scale frequency features. This design aligns well with the module’s objective of capturing complex multi-level features of the target objects. These results also confirm that both local spatial features and global frequency information are indispensable for robust representation learning. Accordingly, we adopt

α = 0.5

as the default setting in the FRA Module for our experiments.

3.6.3. Effect of $ρ$ on LHF

We further analyze the effect of the weighting factor

ρ

in the LHF, which controls the relative importance of low- and high-frequency components during frequency-domain fusion. Specifically, we test three settings:

ρ = 0.25

(low-frequency dominant),

ρ = 0.5

(equal weighting), and

ρ = 0.75

(high-frequency dominant). As illustrated in Figure 13, the best performance is obtained when

ρ = 0.25

, achieving a maximum mAP

_{50}

of

82.42 %

, Precision of

63.64 %

, and Recall of

80.77 %

. When

ρ

increases, assigning greater emphasis to high-frequency features, a slight but consistent performance drop is observed. This trend suggests that while high-frequency components contain valuable texture and boundary information, overemphasizing them can lead to loss of global structural cues. Conversely, prioritizing low-frequency information (as in

ρ = 0.25

) leads to more stable and discriminative representations.

These findings confirm that low-frequency global context plays a critical role in complementing local and high-frequency details within the LHF design. Therefore, we adopt

ρ = 0.25

as the default configuration in the LHF.

3.7. Cross-Domain Generalization Analysis

To further assess the generalization ability of the proposed CFRANet in diverse visual conditions, we also conduct experiments on the publicly available LLVIP dataset [42]. LLVIP is a standardized multi-modal benchmark specifically designed for various low-light visual tasks such as image fusion, pedestrian detection, and image-to-image translation. It contains 30,976 images (15,488 aligned visible–infrared pairs), most of which were captured under extremely dark environments. Pedestrian annotations are provided, making it a challenging yet representative benchmark for low-light pedestrian detection and multispectral fusion performance evaluation.

Table 4 compares the performance of CFRANet with two state-of-the-art multi-modality methods on the LLVIP dataset. Although our model was originally designed for TPP detection in complex multispectral remote sensing imagery, it still achieves competitive performance on LLVIP, with a mAP

_{50}

of 96.2%. This result demonstrates the robustness and adaptability of CFRANet to different cross-modal tasks. While ICAFusion achieves the highest mAP

_{50}

of 96.3% and the best overall mAP of 62.3%, CFRANet remains competitive with a comparable mAP

_{50}

and a mAP of 59.0%, outperforming CSSA. The relatively lower mAP

_{75}

of CFRANet (67.6%) compared to ICAFusion (71.7%) suggests slightly less precise localization at stricter IoU thresholds.

This marginal difference can be attributed to CFRANet’s architectural design, which is optimized for detecting structured composite objects such as TPPs in high-resolution satellite images. Despite not being tailored for pedestrian detection in low-light conditions, CFRANet’s strong performance on LLVIP validates the generalizability of its dual-stream cross-modal fusion framework.

4. Discussion

We propose a robust detection framework for TPPs with strong generalization across diverse and complex geographic environments. As shown in Figure 14, CFRANet accurately detects large, medium, and small-scale TPPs under challenging background conditions. In the spatial domain, TPPs typically occupy limited regions of an image. After Fourier transform, these localized features (e.g., smokestacks, cooling towers) are dispersed across the frequency spectrum, potentially causing aliasing when multiple TPPs are present. However, CFRANet remains effective in such cases due to two key factors: (1) deep networks can learn robust global frequency patterns associated with key TPP structures, even when spatially mixed; (2) the joint fusion of spatial and frequency-domain features helps resolve ambiguities by preserving local cues while leveraging global context. Furthermore, extensive experiments across 18 provinces and regions (Figure 15) confirm CFRANet’s robustness and generalization ability. Notably, in the Guangdong example, two spatially separated TPPs within a single image are both correctly identified, demonstrating its capability to handle multi-instance and aliasing-prone scenarios.

These results underscore CFRANet’s effectiveness and reliability in real-world deployments across heterogeneous spatial and spectral environments.

Figure 16 illustrates common failure cases of our method, primarily involving TPPs with irregular shapes or dispersed internal components. These structures are often situated in dense urban areas, making them challenging to distinguish from surrounding buildings. Consequently, conventional rectangular annotations may not precisely outline the true extent of the plant, occasionally covering non-relevant structures. In future work, we aim to enhance the robustness of our framework by introducing component-level instance segmentation or implementing oriented bounding boxes to better capture object geometries. Furthermore, we also aim to enhance the robustness and efficiency of our framework by exploring model optimization strategies such as network pruning, quantization, or adopting lightweight backbone architectures (e.g., MobileNetV3, ShuffleNetV2). These approaches have the potential to reduce the computational footprint without significantly compromising detection accuracy.

5. Conclusions

In this paper, we presented CFRANet, a novel cross-modal TPP detection framework tailored for high-resolution multispectral remote sensing imagery. To address the lack of open-source composite object detection datasets, we constructed the AIR-MTPP dataset, comprising 481 well-annotated RGB-NIR image pairs of TPPs across 18 provinces and regions in China. Our proposed dual-stream architecture effectively integrates the complementary strengths of RGB and NIR modalities through the MAFB and the FRA module. Extensive experiments demonstrate that CFRANet significantly outperforms state-of-the-art baselines in both accuracy and robustness, with a mAP

_{50}

of 82.41%. Ablation studies further validate the importance of MAFB, revealing that cross-modal interactions contribute to performance gains. Comparisons with SE and CBAM also highlight the superiority of our FRA module strategy. Furthermore, experiments on LLVIP confirm the strong generalization ability of CFRANet to other cross-modal vision tasks. Overall, CFRANet offers a practical and effective solution for detecting TPPs in complex multispectral environments, and the AIR-MTPP dataset lays a solid foundation for future research in this domain.

Author Contributions

Q.H. designed the algorithm and experiments and wrote the manuscript. B.C. provided the original data, supervised the study, and reviewed the draft paper. X.Z. and Y.G. revised the manuscript and provided some constructive suggestions. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partially funded by the National Natural Science Foundation of China (72073125).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The AIR-MTPP dataset is currently not publicly available due to its ongoing use in related research projects. Upon completion of these studies, the full dataset will be made publicly accessible. Researchers interested in accessing the dataset prior to public release may contact the corresponding author to submit a formal request.

Acknowledgments

The authors thank their lab colleagues, as well as the anonymous reviewers and editors, for their valuable comments and suggestions that helped improve the manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Sun, X.; Wang, P.; Wang, C.; Liu, Y.; Fu, K. PBNet: Part-based convolutional neural network for complex composite object detection in remote sensing imagery. ISPRS J. Photogramm. Remote Sens. 2021, 173, 50–65. [Google Scholar] [CrossRef]
Yuan, S.; Zhang, L.; Dong, R.; Xiong, J.; Zheng, J.; Fu, H.; Gong, P. Relational Part-Aware Learning for Complex Composite Object Detection in High-Resolution Remote Sensing Images. IEEE Trans. Cybern. 2024, 54, 6118–6131. [Google Scholar] [CrossRef] [PubMed]
Zou, Z.; Shi, Z. Random Access Memories: A New Paradigm for Target Detection in High Resolution Aerial Remote Sensing Images. IEEE Trans. Image Process. 2018, 27, 1100–1111. [Google Scholar] [CrossRef] [PubMed]
Li, K.; Wan, G.; Cheng, G.; Meng, L.; Han, J. Object detection in optical remote sensing images: A survey and a new benchmark. ISPRS J. Photogramm. Remote Sens. 2020, 159, 296–307. [Google Scholar] [CrossRef]
Gui, S.; Song, S.; Qin, R.; Tang, Y. Remote Sensing Object Detection in the Deep Learning Era—A Review. Remote Sens. 2024, 16, 327. [Google Scholar] [CrossRef]
Fan, X.; Hu, Z.; Zhao, Y.; Chen, J.; Wei, T.; Huang, Z. A Small-Ship Object Detection Method for Satellite Remote Sensing Data. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 11886–11898. [Google Scholar] [CrossRef]
Wang, M.; Su, L.; Yan, C.; Xu, S.; Yuan, P.; Jiang, X.; Zhang, B. RSBuilding: Toward General Remote Sensing Image Building Extraction and Change Detection With Foundation Model. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–17. [Google Scholar] [CrossRef]
Liu, Y.; Li, Q.; Yuan, Y.; Du, Q.; Wang, Q. ABNet: Adaptive Balanced Network for Multiscale Object Detection in Remote Sensing Imagery. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–14. [Google Scholar] [CrossRef]
Cai, B.; Jiang, Z.; Zhang, H.; Zhao, D.; Yao, Y. Airport Detection Using End-to-End Convolutional Neural Network with Hard Example Mining. Remote Sens. 2017, 9, 1198. [Google Scholar] [CrossRef]
Sun, X.; Liu, Y.; Yan, Z.; Wang, P.; Diao, W.; Fu, K. SRAF-Net: Shape Robust Anchor-Free Network for Garbage Dumps in Remote Sensing Imagery. IEEE Trans. Geosci. Remote Sens. 2021, 59, 6154–6168. [Google Scholar] [CrossRef]
Yin, W.; Sun, X.; Diao, W.; Zhang, Y.; Gao, X. Thermal Power Plant Detection in Remote Sensing Images With Saliency Enhanced Feature Representation. IEEE Access 2021, 9, 8249–8260. [Google Scholar] [CrossRef]
Yin, W.; Diao, W.; Wang, P.; Gao, X.; Li, Y.; Sun, X. PCAN—Part-Based Context Attention Network for Thermal Power Plant Detection in Remote Sensing Imagery. Remote Sens. 2021, 13, 1243. [Google Scholar] [CrossRef]
Ashiq, F.; Asif, M.; Ahmad, M.B.; Zafar, S.; Masood, K.; Mahmood, T.; Mahmood, M.T.; Lee, I.H. CNN-Based Object Recognition and Tracking System to Assist Visually Impaired People. IEEE Access 2022, 10, 14819–14834. [Google Scholar] [CrossRef]
Lee, D.H. CNN-based single object detection and tracking in videos and its application to drone detection. Multimed. Tools Appl. 2021, 80, 34237–34248. [Google Scholar] [CrossRef]
Gao, T.; Li, Z.; Wen, Y.; Chen, T.; Niu, Q.; Liu, Z. Attention-Free Global Multiscale Fusion Network for Remote Sensing Object Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–14. [Google Scholar] [CrossRef]
Jung, C.; Zhou, K.; Feng, J. Fusionnet: Multispectral Fusion of RGB and NIR Images Using Two Stage Convolutional Neural Networks. IEEE Access 2020, 8, 23912–23919. [Google Scholar] [CrossRef]
Song, S.; Miao, Z.; Yu, H.; Fang, J.; Zheng, K.; Ma, C.; Wang, S. Deep Domain Adaptation Based Multi-Spectral Salient Object Detection. IEEE Trans. Multimed. 2022, 24, 128–140. [Google Scholar] [CrossRef]
Jung, C.; Han, Q.; Zhou, K.; Xu, Y. Multispectral Fusion of RGB and NIR Images Using Weighted Least Squares and Convolution Neural Networks. IEEE Open J. Signal Process. 2021, 2, 559–570. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar] [CrossRef]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 20–29 October 2017; pp. 2999–3007. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar] [CrossRef]
Shen, J.; Chen, Y.; Liu, Y.; Zuo, X.; Fan, H.; Yang, W. ICAFusion: Iterative cross-attention guided feature fusion for multispectral object detection. Pattern Recognit. 2024, 145, 109913. [Google Scholar] [CrossRef]
Qingyun, F.; Zhaokui, W. Cross-modality attentive feature fusion for object detection in multispectral remote sensing imagery. Pattern Recognit. 2022, 130, 108786. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11531–11539. [Google Scholar] [CrossRef]
Hou, Q.; Zhou, D.; Feng, J. Coordinate Attention for Efficient Mobile Network Design. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 13708–13717. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 9992–10002. [Google Scholar] [CrossRef]
Chi, L.; Jiang, B.; Mu, Y. Fast fourier convolution. Adv. Neural Inf. Process. Syst. 2020, 33, 4479–4488. [Google Scholar]
Suvorov, R.; Logacheva, E.; Mashikhin, A.; Remizova, A.; Ashukha, A.; Silvestrov, A.; Kong, N.; Goka, H.; Park, K.; Lempitsky, V. Resolution-robust Large Mask Inpainting with Fourier Convolutions. In Proceedings of the 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2022; pp. 3172–3182. [Google Scholar] [CrossRef]
Chaudhuri, Y.; Kumar, A.; Buduru, A.B.; Alshamrani, A. FGA: Fourier-Guided Attention Network for Crowd Count Estimation. In Proceedings of the 2024 International Joint Conference on Neural Networks (IJCNN), Yokohama, Japan, 30 June—5 July 2024; pp. 1–8. [Google Scholar] [CrossRef]
Lyu, P.; Yeung, P.H.; Yu, X.; Wu, C.; Rajapakse, J.C. Deep Fourier-embedded Network for RGB and Thermal Salient Object Detection. arXiv 2025, arXiv:2411.18409. [Google Scholar]
Zheng, C.; Nie, J.; Yin, B.; Li, X.; Qian, Y.; Wei, Z. Frequency- and Spatial-Domain Saliency Network for Remote Sensing Cross-Modal Retrieval. IEEE Trans. Geosci. Remote Sens. 2025, 63, 1–13. [Google Scholar] [CrossRef]
Everingham, M.; Van Gool, L.; Williams, C.K.; Winn, J.; Zisserman, A. The pascal visual object classes (voc) challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 658–666. [Google Scholar] [CrossRef]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part V 13; Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar] [CrossRef]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, Florida, 20–25 June 2009; pp. 248–255. [Google Scholar] [CrossRef]
Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: Fully Convolutional One-Stage Object Detection. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9626–9635. [Google Scholar] [CrossRef]
Cao, Y.; Bin, J.; Hamari, J.; Blasch, E.; Liu, Z. Multimodal Object Detection by Channel Switching and Spatial Attention. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Vancouver, BC, Canada, 17–24 June 2023; pp. 403–411. [Google Scholar] [CrossRef]
Qin, Z.; Zhang, P.; Wu, F.; Li, X. FcaNet: Frequency Channel Attention Networks. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 763–772. [Google Scholar] [CrossRef]
Jia, X.; Zhu, C.; Li, M.; Tang, W.; Zhou, W. LLVIP: A Visible-infrared Paired Dataset for Low-light Vision. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021. [Google Scholar]

Figure 2. Overall architecture of the proposed CFRANet (Cross-Modal Frequency-Responsive Attention Network), including the detailed structure of the MAFB (Modality-Aware Fusion Block) module.

Figure 3. Architecture of the proposed FRA (Frequency-Responsive Attention) Module. The symbol c denotes the concatenation operation.

Figure 4. Architecture of LHF (Low-High Frequency). The symbol c denotes the concatenation operation.

Figure 5. Qualitative comparison of detection results. Each column corresponds to one input image, and each row shows the result produced by a different method, including the ground truth at the top. Red bounding boxes indicate detected targets. The absence of red boxes in a prediction implies that the corresponding model failed to detect any object in that image.

Figure 6. Grad-CAM visualization results comparing concat fusion and CFRANet.

Figure 7. Grad-CAM visualization results before and after MAFB fusion.

Figure 8. Grad-CAM visualization results before and after MAFB fusion with Fca replacing FRA module.

Figure 9. Grad-CAM visualization results before and after MAFB fusion with SE replacing FRA module.

Figure 10. Grad-CAM visualization results before and after MAFB fusion with CBAM replacing FRA module.

Figure 11. Performance across different

λ

values for NIR modality.

Figure 11. Performance across different

λ

values for NIR modality.

Figure 12. Performance under different

α

values in FRA_Module.

Figure 12. Performance under different

α

values in FRA_Module.

Figure 13. Performance under different

ρ

values in LHF_Module.

Figure 13. Performance under different

ρ

values in LHF_Module.

Figure 14. Detection results of TPPs of multiscale sizes. Red boxes indicate predicted values, and green boxes indicate ground truth. (a) Shandong Jinling Power Station. (b) Datang Nanjing Xiaguan Power Station. (c) Weiqiao Zouping Power Station. (d) Shenmu Power Station.

Figure 15. Detection results of TPPs of 18 provinces and regions.

Figure 16. Common failure cases of our method. (a) Nanjing Huarun Thermal Power Station. (b) Shenhuo Zhundong Power Station. (c) Shenmu Jinjie Cogen Power Station. (d) Shiheng Power Station.

Table 1. Performance comparison of CFRANet and representative object detectors in terms of mAP

_{50}

, FPS, GFLOPs, and Params.

Table 1. Performance comparison of CFRANet and representative object detectors in terms of mAP

_{50}

, FPS, GFLOPs, and Params.

Modality	Method	Backbone	mAP $_{50}$ (%)	FPS	GFLOPs	Params(M)
Single	RetinaNet	ResNet50	72.13	45.54	72.67	36.33
	Faster R-CNN	ResNet50	62.05	29.95	250.14	28.28
	SSD	VGG16	57.84	176.77	30.43	23.75
	FCOS	ResNet50	73.36	55.91	80.57	32.11
Multi	CSSA	ResNet50	42.61	32.26	102.84	68.98
	ICAFusion	CSPDarkNet	68.89	26.45	95.78	108.50
	CFRANet	ResNet50	82.41	30.96	384.12	365.16

Table 2. Detection results of dual-stream models with MAFB placed at different levels, sorted by mAP

_{50}

in ascending order.

Table 2. Detection results of dual-stream models with MAFB placed at different levels, sorted by mAP

_{50}

in ascending order.

Method	$C_{3}$	$C_{4}$	$C_{5}$	mAP $_{50}$ (%)	$Δ mAP_{50}$ (%)	FPS
RetinaNet (NIR only)	–	–	–	69.63	–2.50	43.95
RetinaNet (RGB only)	–	–	–	72.13	0.00	45.54
Dual-Stream (concat fusion)	–	–	–	73.83	+1.70	46.31
Dual-Stream + MAFB@C5	–	–	✓	75.50	+3.37	41.94
Dual-Stream + MAFB@C3,C5	✓	–	✓	77.91	+5.78	35.94
Dual-Stream + MAFB@C4,C5	–	✓	✓	78.15	+6.02	35.1
Dual-Stream + MAFB@C3,C4	✓	✓	–	78.87	+6.74	37.48
Dual-Stream + MAFB@C4	–	✓	–	79.30	+7.17	43.88
Dual-Stream + MAFB@C3	✓	–	–	79.74	+7.61	45.38
CFRANet (MAFB@C3,C4,C5)	✓	✓	✓	82.41	+10.28	30.96

– indicates no module placed at that level; ✓ indicates that a module is placed at that level.

Table 3. Performance comparison of different attention modules replacing the FRA module in CFRANet.

Method	mAP $_{50}$ (%)	FPS
CFRANet with Fca replacing FRA module	71.46	47.16
CFRANet with SE replacing FRA module	76.25	41.59
CFRANet with CBAM replacing FRA module	78.09	41.00
CFRANet	82.41	30.96

Table 4. Experimental results on the LLVIP dataset using multimodal methods.

Method	mAP $_{50}$ (%)	mAP $_{75}$ (%)	mAP (%)
CSSA [40]	94.3	66.6	59.2
ICAFusion [23]	96.3	71.7	62.3
CFRANet	96.2	67.6	59.0

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

He, Q.; Cheng, B.; Zhang, X.; Gan, Y. CFRANet: Cross-Modal Frequency-Responsive Attention Network for Thermal Power Plant Detection in Multispectral High-Resolution Remote Sensing Images. Remote Sens. 2025, 17, 2706. https://doi.org/10.3390/rs17152706

AMA Style

He Q, Cheng B, Zhang X, Gan Y. CFRANet: Cross-Modal Frequency-Responsive Attention Network for Thermal Power Plant Detection in Multispectral High-Resolution Remote Sensing Images. Remote Sensing. 2025; 17(15):2706. https://doi.org/10.3390/rs17152706

Chicago/Turabian Style

He, Qinxue, Bo Cheng, Xiaoping Zhang, and Yaocan Gan. 2025. "CFRANet: Cross-Modal Frequency-Responsive Attention Network for Thermal Power Plant Detection in Multispectral High-Resolution Remote Sensing Images" Remote Sensing 17, no. 15: 2706. https://doi.org/10.3390/rs17152706

APA Style

He, Q., Cheng, B., Zhang, X., & Gan, Y. (2025). CFRANet: Cross-Modal Frequency-Responsive Attention Network for Thermal Power Plant Detection in Multispectral High-Resolution Remote Sensing Images. Remote Sensing, 17(15), 2706. https://doi.org/10.3390/rs17152706

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

CFRANet: Cross-Modal Frequency-Responsive Attention Network for Thermal Power Plant Detection in Multispectral High-Resolution Remote Sensing Images

Abstract

1. Introduction

2. Materials and Methods

2.1. Overview

2.2. Dual-Stream Feature Extraction Backbone and Modality-Aware Fusion Block

2.3. Frequency-Responsive Attention Module

2.3.1. Dual-Branch Local and Global Feature Extraction

2.3.2. Low-High Frequency Decomposition

2.3.3. Attention-Guided Feature Fusion

2.4. Loss Function

3. Results

3.1. Datasets

3.2. Evaluation Metrics

3.3. Experimental Settings

3.4. Comparisons with State of the Art

3.5. Ablation Study

3.5.1. Effect of MAFB

3.5.2. Effect of FRA Module

3.6. Hyperparameters

3.6.1. Effect of λ on Detection Performance

3.6.2. Effect of α on Frequency-Responsive Attention

3.6.3. Effect of ρ on LHF

3.7. Cross-Domain Generalization Analysis

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3.6.1. Effect of $λ$ on Detection Performance

3.6.2. Effect of $α$ on Frequency-Responsive Attention

3.6.3. Effect of $ρ$ on LHF