MCC-Net: Efficient Dual-Attention Network for Infrared Small-Target Detection

Zhou, Xiaotian; Wang, Xin; Tian, Yan; Jiang, Kai; Guo, Min; Lian, Xuezheng; Ding, Lu; Zhang, Quanyu; Xue, Yaqi

doi:10.3390/rs18111858

Open AccessArticle

MCC-Net: Efficient Dual-Attention Network for Infrared Small-Target Detection

by

Xiaotian Zhou

^1,2,

Xin Wang

^1,2,

Yan Tian

^1,2,*,

Kai Jiang

^1,2,

Min Guo

^1,2,

Xuezheng Lian

^1,2,

Lu Ding

^1,2,

Quanyu Zhang

^1,2 and

Yaqi Xue

^1,2

¹

Xi’an Institute of Optics and Precision Mechanics, Chinese Academy of Sciences, Xi’an 710119, China

²

University of Chinese Academy of Sciences, Beijing 100049, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(11), 1858; https://doi.org/10.3390/rs18111858

Submission received: 5 March 2026 / Revised: 3 May 2026 / Accepted: 3 June 2026 / Published: 5 June 2026

(This article belongs to the Special Issue New Insights in Remote Sensing Image Interpretation with Deep Learning)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

We propose MCC-Net, an innovative infrared small-target detection method featuring a streamlined architecture with a complementary spatial-channel dual-attention mechanism.
The proposed model integrates three innovative strategies, namely, Magnitude-Aware Linear Attention, Conditionally Parameterized Convolutions, and Conditional Cross-Channel Fusion, achieving superior detection performance while substantially reducing computational overhead.

What are the implications of the main findings?

The proposed method achieves state-of-the-art performance across multiple evaluation metrics and visual results on three public benchmark datasets.
MCC-Net demonstrates substantially lower computational complexity than state-of-the-art methods, enabling efficient deployment in resource-constrained scenarios.

Abstract

Recent years have witnessed the emergence of numerous U-shaped deep learning segmentation methods for infrared small-target detection (IRSTD). However, increasingly complex models still suffer from false and missed detections in challenging scenarios with cluttered backgrounds and weak targets while incurring escalating computational costs. To address these limitations, this paper proposes MCC-Net, a novel and efficient IRSTD framework that achieves superior detection performance with significantly reduced computational complexity. First, we integrate Magnitude-Aware Linear Attention (MALA) and Conditionally Parameterized Convolutions (CondConv) to replace conventional attention mechanisms in skip connections and standard convolutions, respectively, endowing the model with spatial contextual modeling and enhanced feature extraction capabilities at minimal computational overhead. Second, we design an innovative Conditional Cross-Channel Fusion (CondCCF) module that establishes a complementary spatial-channel dual-attention mechanism with MALA, enabling efficient multi-scale feature fusion. Extensive comparative and ablation experiments conducted on three public benchmarks—SIRST-v1, NUDT-SIRST, and IRSTD-1K—demonstrate that MCC-Net achieves state-of-the-art mIoU scores of 77.98%, 95.43%, and 70.46%, respectively, surpassing state-of-the-art methods by 1.07%, 1.95%, and 0.95%. MCC-Net also outperforms existing approaches across multiple evaluation metrics while maintaining substantially lower computational complexity.

Keywords:

IRSTD; feature fusion; attention mechanism; visualization; computational complexity

1. Introduction

Infrared small-target detection (IRSTD) has been widely deployed in critical applications such as border security [1,2], target surveillance [3,4,5], guidance systems, and missile interception [6,7]. Leveraging the all-weather operational capability and superior anti-interference characteristics of infrared imaging systems, IRSTD plays a pivotal role in detection tasks across diverse complex scenarios. However, infrared images inherently suffer from three fundamental challenges: low target signal-to-noise ratio, lack of texture information, and uncertain target shapes [8]. Consequently, accurate discrimination between the background and small targets in infrared images of complex environments is a challenge that has garnered significant research attention.

Early IRSTD research primarily employed model-driven traditional approaches, which encompass three distinct methodological categories. The first category comprises filtering and morphological techniques, exemplified by the top-hat transform [9] and max–median filtering [10], which demonstrate effectiveness primarily in high-contrast scenarios with simple backgrounds. The second category includes background suppression methods such as infrared patch-image (IPI) models [11] and the partial sum of singular values–based tensor nuclear norm (PSTNN) [12], which model the background and target as low-rank and sparse components, respectively, to suppress structured backgrounds. The third category consists of local-contrast-based approaches, including the local contrast measure (LCM) [13], the multiscale patch-based contrast measure (MPCM) [14], the weighted structured LCM (WSLCM) [15], and the temporal–local LCM (TLLCM) [16], which exploit intensity differences between targets and their local neighborhoods for efficient detection. However, these conventional methods suffer from inherent limitations—they rely heavily on hand-crafted features and prior assumptions, exhibit weak generalization capabilities, and demonstrate poor robustness. In practical applications, they often require time-consuming manual parameter tuning based on empirical knowledge.

Recent research in IRSTD has gradually shifted toward data-driven deep learning methods, leveraging their powerful feature extraction capabilities to enhance detection performance [17,18,19]. Considering the inherent characteristics of infrared images—particularly the lack of texture information—current deep learning approaches typically formulate IRSTD as a semantic segmentation task rather than conventional object detection, with detection results presented as binary mask images. Among these methods, U-shaped network architectures have demonstrated exceptional performance. Notably, the ACM [20] framework pioneered the integration of global channel attention mechanisms into feature fusion processes, achieving significant improvements and exerting substantial influence on subsequent research. Inspired by nested structures, UIU-Net [21] embeds smaller U-shaped structures within the backbone U-Net framework, while DNA-Net [22] proposes dense nested interaction modules. Both approaches incorporate attention mechanisms to enable efficient fusion and adaptive enhancement of shallow and deep features. MTU-Net [23] addresses the limited spatial correlation modeling capability of convolutional neural networks (CNNs) by introducing a multi-level vision transformer (ViT)-CNN hybrid encoder to enhance spatial contextual modeling.

As illustrated in Figure 1, while DNA-Net and UIU-Net employ sophisticated nested architectures to iteratively fuse and enhance multi-scale features, the repetitive feature extraction and fusion operations inherent in these nested structures often yield redundant feature representations. This redundancy not only obscures critical discriminative features but also introduces substantial computational overhead. MTU-Net routes features extracted from multiple decoder layers through hierarchical ViT-CNN modules for spatial context modeling. However, the effective features captured by these modules inevitably suffer from degradation or loss during propagation through the extended decoder pathways, while simultaneously incurring prohibitive computational costs [24,25,26,27]. Consequently, contemporary deep learning approaches for small-object detection continue to confront two fundamental challenges: (1) suboptimal detection performance for small objects against complex backgrounds and (2) escalating computational complexity driven by increasingly sophisticated model architectures.

Inspired by these three architectural paradigms, we propose MCC-Net, a streamlined yet high-performance model. Built upon the U-Net [28] baseline, our approach integrates attention mechanisms into skip connections, modifies feature fusion strategies, and employs novel convolutional methods. Compared to state-of-the-art deep learning methods, MCC-Net achieves superior detection performance while significantly simplifying model structure and reducing computational complexity. This study’s primary contributions to the field are fourfold:

(1): We propose MCC-Net, an efficient infrared small target detection framework incorporating a dual attention mechanism. The model employs deep supervision during training to accelerate convergence and mitigate the vanishing gradient problem.
(2): We integrate the MALA module into skip connections and introduce a unique combination of Local Enhanced Positional Encoding and Rotary Positional Embedding, with optimized computational pathways. This design enables the MALA module to simultaneously perceive global spatial context and capture local fine-grained features, demonstrating powerful spatial contextual modeling capabilities. The approach achieves high-quality feature extraction and precise localization while maintaining significantly lower computational complexity than conventional attention mechanisms.
(3): We design a novel Conditional Cross-Channel Fusion (CondCCF) module to replace traditional decoder architectures. This module fuses features extracted by each MALA layer—after channel attention processing—with deep-level features optimized through bilinear interpolation. This design establishes a complementary spatial-channel dual-attention mechanism in conjunction with the MALA module, facilitating efficient integration of deep and shallow features.
(4): We replace traditional convolutions with Conditionally Parameterized Convolutions (CondConv), effectively addressing the computational cost explosion that typically occurs when feature extraction capability is enhanced by increasing the size or number of kernels. This innovation enables MCC-Net to achieve superior target feature extraction with remarkably reduced computational complexity.

The remainder of this paper is organized as follows: Section 2 reviews related work in IRSTD, encompassing both traditional model-driven approaches and contemporary data-driven deep learning methods. Section 3 presents the MCC-Net architecture, first introducing the overall network structure and subsequently detailing the principles and functions of the three proposed enhancement strategies. Section 4 provides comprehensive experimental validation, including dataset descriptions, evaluation metrics, implementation details, comparative experiments, ablation studies, and in-depth analysis of results, which collectively demonstrate MCC-Net’s superiority over state-of-the-art methods. Finally, Section 5 concludes the paper with a detailed summary and discussion of future research directions.

2. Related Work

In this section, we categorize IRSTD methods into two main paradigms, namely, model-driven traditional approaches and data-driven deep learning methods, and we summarize the key advances in this field.

2.1. Model-Driven Traditional Methods

Early research on IRSTD predominantly employed model-driven traditional methods, which can be categorized into three distinct technical frameworks:

(1): Filtering-Based and Morphological Operation–Based Methods: These techniques leverage the characteristics of morphological operations to enhance the contrast between background and targets, thereby facilitating infrared small-target detection. Representative algorithms include the top-hat transform [9] and max–median filtering [10]. While these methods demonstrate satisfactory performance in high-contrast scenarios with simple backgrounds, they exhibit extreme sensitivity to noise and fail to achieve effective detection in complex background environments.
(2): Background Suppression Methods. Exemplified by algorithms such as infrared patch-image (IPI) models [11] and the partial sum of singular values–based tensor nuclear norm (PSTNN) [12], these approaches model the background and targets as low-rank and sparse components, respectively, and separate them through low-rank matrix recovery techniques. Although these methods effectively suppress structured backgrounds, they suffer from sensitivity to parameter configuration and high computational complexity, resulting in poor real-time performance.
(3): Local-Contrast-Based Methods. These methods, including the local contrast measure (LCM) [13], the multiscale patch-based contrast measure (MPCM) [14], the weighted structured LCM (WSLCM) [15], and the temporal–local LCM (TLLCM) [16], are inspired by human visual contrast mechanisms and detect small targets in infrared images by exploiting grayscale differences between targets and their local neighborhoods. Although these approaches feature intuitive algorithmic principles and simple structures, they also demonstrate weak background suppression capabilities, are prone to false alarms in edge regions owing to noise interference, and exhibit sensitivity to threshold and other parameter settings.

All three categories of traditional methods have certain limitations in common: First, they critically depend on hand-crafted features and prior assumptions, resulting in weak generalization capabilities and poor robustness. Second, in practical applications, they typically require extensive parameter tuning based on empirical knowledge, which is extremely time-consuming.

2.2. Data-Driven Deep Learning Methods

In contrast to traditional model-driven approaches—which rely on hand-crafted features and strong prior assumptions, thereby suffering from limited generalization and robustness—data-driven deep learning methods possess powerful feature extraction capabilities that automatically learn high-level semantic representations, enabling robust IRSTD in complex backgrounds. To address common challenges in infrared imagery such as a low target signal-to-noise ratio, lack of texture information, and uncertain target shapes, deep learning–based IRSTD methods typically formulate the detection task as semantic segmentation. These approaches commonly adopt U-shaped network structures enhanced with various novel attention mechanisms and feature fusion strategies to improve performance.

Among these methods, Attention-based Context Modeling (ACM) [20] pioneered the integration of global channel attention into feature fusion mechanisms for IRSTD, enabling the modeling of channel-wise contextual information to suppress background clutter. ALCNet [29] extends channel attention by proposing the attention-based local contrast module (BLAM), which combines traditional local contrast principles with deep learning, enhancing detection capability while preserving partial interpretability. RDIAN [30] employs a receptive field and a direction-induced attention mechanism that improves multi-scale target detection capabilities while addressing the class imbalance problem where background pixels significantly outnumber foreground pixels. ISTDU [17] enhances the U-Net structure by improving downsampling and skip connection mechanisms, integrating uncertainty estimation into IRSTD to improve reliability. MTU-Net [23] introduced a multi-level vision transformer (ViT)–CNN hybrid encoder to address CNN’s limitations in modeling spatial feature correlations, endowing the model with spatial contextual modeling capabilities and introducing multi-task learning to IRSTD for the first time. IAANet [31] and AGPCNet [32] not only effectively model long- and short-range dependencies between local details and global context through enhanced attention mechanisms but also achieve efficient integration of multi-scale features through improved fusion strategies, thereby enhancing the model’s ability to discriminate targets in complex backgrounds. DNA-Net [22] restructured skip connections into a dual-branch nested attention mechanism, significantly improving the model’s discriminative capability for multi-scale features and sensitivity to faint targets through iterative feature fusion and enhancement processes. UIU-Net [21] proposed an unsupervised framework with asymmetric skip connections, successfully decoupling target and background features without requiring pixel-level annotations.

Despite their promising performance, existing deep learning–based IRSTD methods still confront two fundamental challenges: their detection performance for small targets against complex backgrounds remains suboptimal and requires further enhancement, and the computational overhead continues to escalate because of increasingly sophisticated model architectures [33,34,35]. To address these limitations, we propose MCC-Net, an efficient framework that significantly reduces computational complexity while maintaining robust detection capabilities.

3. Method

3.1. Model Architecture

As illustrated in Figure 2, the structure of MCC-Net can be decomposed into three primary components: an Encoder, a Decoder, and Skip Connections.

The Encoder employs two specialized convolutional modules—CondCBL and Res_CondCBL—both incorporating Conditionally Parameterized Convolutions (CondConv). As depicted in Figure 3, these modules differ primarily in the number of convolutional blocks and the presence of residual connections. The Skip Connections utilize Magnitude-Aware Linear Attention (MALA), which provides powerful spatial contextual modeling capabilities at significantly reduced computational cost.

Within the Decoder section, we design an innovative Conditional Cross-Channel Fusion (CondCCF) module to effectively integrate and filter multi-scale features extracted from both the Skip Connections path and the Encoder–Decoder pathway. The detailed structure of CondCCF is shown in Figure 4. Its internal channelwise cross-attention (CCA) mechanism collaborates with the MALA used in Skip Connections to establish a complementary spatial-channel dual-attention framework. The convolutional module employed in CondCCF is CondCBR, which differs from the CondCBL in Figure 3 primarily in the activation function utilized.

To accelerate model convergence and mitigate gradient vanishing problems, we incorporate a deep supervision mechanism into MCC-Net. This mechanism transforms features from each decoder layer and the deepest encoder layer through the Deep Supervision Feature Extraction (DSFE) module shown in Figure 5, combining them to form multi-level prediction outputs. By computing losses between these multi-level predictions and ground truth labels and performing backpropagation, the model is forced to learn effective features at all hierarchical levels. During training, the deep supervision pathways in Figure 2 are activated to optimize training efficiency and performance. During inference, only the main inference pathway is utilized to reduce computational complexity and enhance detection speed.

3.2. Magnitude-Aware Linear Attention

Existing segmentation models often employ complex and diverse skip connection structures or integrate conventional attention mechanisms into skip pathways to enhance performance. However, the conventional Softmax Attention mechanism [36] suffers from quadratic computational complexity

O (N^{2})

, as formulated in Equation (1). This approach first generates query (Q), key (K), and value (V) representations through learnable linear projections, then computes pairwise attention weights between all query–key pairs to produce the output

Y_{i}

for each token

X_{i}

.

\begin{matrix} Q & = X W_{Q}, K = X W_{K}, V = X W_{V} \\ Y_{i} & = \sum_{j = 1}^{N} \frac{Sim (Q_{i}, K_{j})}{\sum_{m = 1}^{N} Sim (Q_{i}, K_{m})} V_{j} \end{matrix}

(1)

where X denotes an input token sequence, and

W_{Q}

,

W_{K}

,

W_{V}

represent learnable weight matrices. The similarity function is defined as

Sim (Q_{i}, K_{j}) = exp (Q_{i} K_{j}^{T} / \sqrt{d})

, where d represents the feature dimension.

To address the computational bottleneck, Linear Attention [37] replaces the Softmax function with a kernel-based decomposition, transforming the attention computation into a linear process with reduced complexity

O (N)

, as shown in Equation (2).

Y_{i} = \frac{ϕ (Q_{i}) (\sum_{j = 1}^{N} ϕ {(K_{j})}^{T} V_{j})}{ϕ (Q_{i}) (\sum_{m = 1}^{N} ϕ {(K_{m})}^{T})}

(2)

where

ϕ (Q_{i})

represents a kernel function designed to approximate the similarity function.

Nevertheless, Equation (3) reveals that this formulation disregards the magnitude information of

ϕ (Q_{i})

after kernel mapping, consequently losing the amplitude-preserving characteristic inherent in Softmax. This limitation significantly compromises the local modeling capability of Linear Attention [38].

Y_{i} = \frac{\vec{α_{i}} (\sum_{j = 1}^{N} ϕ {(K_{j})}^{T} V_{j})}{\vec{α_{i}} (\sum_{m = 1}^{N} ϕ {(K_{m})}^{T})}

(3)

where

\vec{α_{i}}

denotes the directional vector derived from the kernel-mapped query features

ϕ (Q_{i})

.

To enhance segmentation performance while optimizing computational efficiency, we introduce Magnitude-Aware Linear Attention (MALA) [39] as a replacement for conventional skip connection structures. MALA incorporates an amplitude-aware mechanism, mathematically expressed in Equation (4), which introduces two dynamically computed parameters: a scaling factor

β

and an offset term

γ

. When the magnitude of

ϕ (Q_{i})

increases,

β

and

γ

adaptively adjust to exponentially amplify the attention score ratio. This design not only resolves the magnitude neglect issue in Linear Attention but also mitigates the excessively sharp attention distribution characteristic of Softmax Attention, achieving more balanced attention allocation.

\begin{matrix} Y_{i} & = β ϕ (Q_{i}) \sum_{j = 1}^{N} ϕ {(K_{j})}^{T} V_{j} - γ \sum_{j = 1}^{N} V_{j} \\ β & = 1 + \frac{1}{ϕ (Q_{i}) \sum_{m = 1}^{N} ϕ {(K_{m})}^{T}} \\ γ & = \frac{ϕ (Q_{i}) \sum_{m = 1}^{N} ϕ {(K_{m})}^{T}}{N} \end{matrix}

(4)

Figure 6 illustrates the feature processing pipeline of the MALA module, whose design is based on the MALA mechanism. The module maintains consistent tensor dimensions between input and output feature maps, both shaped as

(B, C, H, W)

, where B denotes batch size; C represents channel dimension; and H and W correspond to the height and width of individual feature maps, respectively. For clarity in demonstrating dimensional relationships throughout the processing flow, the input tensor dimensions in Figure 6 are configured as

(1, 1, 3, 3)

, with the sole hyperparameter—the number of attention heads—set to a single head.

The MALA module incorporates two critical positional encoding mechanisms, namely, Local Enhanced Positional Encoding and Rotary Positional Embedding, which inject local and global positional information, respectively, into the computation. The core computational process of MALA consists of Linear Attention Kernelized Computation and Amplitude-Aware Attention Aggregation. This design extends conventional linear attention by explicitly incorporating magnitude information Z into the attention mechanism. Furthermore, the Gated Modulation Projection stage introduces tensor O as a gating signal.

Compared to Softmax attention and standard linear attention mechanisms, MALA achieves superior computational efficiency while simultaneously preserving low computational complexity and strong spatial modeling capabilities. When integrated with skip connections, the MALA module does more than enable spatial contextual modeling of features extracted by the encoder; its dual-path architecture—combining local detail capture with global spatial awareness—facilitates the extraction of high-quality features at precise target locations, thereby achieving accurate background suppression and target extraction.

3.3. Conditionally Parameterized Convolutions

Traditional convolutional operations often face a trade-off between feature extraction capability and computational efficiency. To enhance representational power, conventional approaches typically increase kernel size or the number of kernels, which inevitably leads to substantial computational overhead. To address this limitation, this paper introduces Conditionally Parameterized Convolutions (CondConv) [40] as an efficient alternative to standard convolution. As illustrated in Figure 7, CondConv can be viewed as a linear Mixture of Experts model, even as it maintains significantly lower computational complexity than traditional Mixture of Experts approaches.

The core innovation of CondConv lies in its dynamic routing mechanism. For each input sample, the routing function generates a unique set of weighting coefficients that adaptively combine multiple convolutional kernels. The standard convolution operation is formulated in Equation (5), while the CondConv transformation is defined in Equation (6):

Y = X * W

(5)

Y = \sum_{i = 1}^{M} r_{i} (X) (X * W_{i})

(6)

where

W_{i}

denotes the ith convolutional kernel in the ensemble and

r_{i} (X)

represents the corresponding adaptive weight generated by the routing function for input X. The routing weights are computed as follows:

r_{i} (X) = σ (GAP (X) \cdot R)

(7)

where

σ

denotes the sigmoid activation function, GAP represents global average pooling, and R is a learnable routing matrix optimized during training.

In this work, we set the number of expert convolutional kernels to 4. Building upon the CondConv framework, this paper design three distinct convolutional modules, as depicted in Figure 8: CondCBL, Res-CondCBL, and CondCBR. The CondCBL and Res-CondCBL modules are deployed in the encoder pathway, while CondCBR operates in the decoder. The former two employ LeakyReLU activation functions, whereas CondCBR utilizes a ReLU function. Notably, the Res-CondCBL module, positioned in deeper network layers, incorporates residual connections and additional convolutional layers to enhance feature extraction capabilities, particularly for improving small target discrimination against complex backgrounds.

3.4. Conditional Cross-Channel Fusion Module

To enable efficient multi-scale feature fusion and extraction in the decoder pathway, we design a Conditional Cross-Channel Fusion (CondCCF) module, an architectural overview of which is depicted in Figure 4. The module employs CondCBR convolutional blocks as fundamental building blocks. Distinct input features undergo upsampling and channelwise cross-attention (CCA) [41] processing, respectively, followed by channelwise concatenation. The fused features are then processed through two consecutive CondCBR layers to generate the final decoder output features.

Considering the deepest decoder layer as an illustrative example, the encoder-derived feature map E5 is upsampled using bilinear interpolation to double its spatial resolution, ensuring dimensional consistency with the corresponding skip connection feature S4 for subsequent fusion operations. The skip connection feature S4 undergoes transformation via the CCA mechanism, whose computational principle is illustrated in Figure 9. This attention mechanism learns channelwise importance weights, enabling adaptive feature selection at the channel level. The processed features from both pathways are concatenated and fed into CondCBR layers, ultimately producing the decoder output feature D4.

The CondCCF module, when integrated with the MALA mechanism employed in skip connections, establishes a complementary spatial-channel dual-attention framework. This synergistic combination enhances the model’s capacity to prioritize small target features while effectively mitigating the class imbalance problem inherent in infrared small-target detection, where background pixels typically dominate foreground pixels by significant margins.

4. Experiments and Analysis

4.1. Dataset

All experiments in this paper were conducted on three publicly available benchmark datasets: SIRST-v1 [20], NUDT-SIRST [22], and IRSTD-1K [42]. Prior to training, all input images and their corresponding binary mask labels were zero-padded to square dimensions as illustrated in Figure 10. Table 1 summarizes the key statistical characteristics of each dataset, including the total number of images, the resolution after padding, and the proportion allocated to the training set.

4.2. Evaluation Indicators

To comprehensively evaluate the performance of our model, we employ a suite of standard metrics commonly used in computer vision, namely, the mIoU, Pd, Fa, and F-measure.

(1): mIoU (mean Intersection over Union): As a widely adopted pixel-level evaluation metric, mIoU quantifies the spatial agreement between the predicted segmentation mask and the ground truth by computing the average IoU across all semantic classes. For binary segmentation tasks, it is defined as in Equation (8):

$mIoU = \frac{| P \cap G |}{| P \cup G |}$

(8)

where P and G denote the predicted and ground truth binary masks, respectively.
(2): Pd (Probability of Detection): Pd measures the fraction of ground truth targets that are successfully detected. To account for minor localization inaccuracies—often induced by the point spread function (PSF)—a tolerance radius r is introduced around each ground truth target center. A detection is considered correct if it falls within this radius [22]. In this work, we set $r = 3$ pixels. Formally, Pd is computed as the ratio of true positives (TP) to the total number of ground truth targets (ALL), as shown in Equation (9):

$Pd = \frac{TP}{ALL}$

(9)

where TP denotes the number of correctly detected true targets, and ALL denotes the total number of ground truth targets.
(3): Fa (False-Alarm Rate): Fa characterizes the density of spurious detections, defined as the number of false-alarm pixels normalized by the total number of background pixels. Specifically, it is given by Equation (10):

$Fa = \frac{P_{fa}}{P_{all}}$

(10)

where $P_{fa}$ denotes the number of non-target pixels erroneously predicted as targets and $P_{all}$ represents the total number of background pixels in the image.
(4): F (F-measure): The F-measure provides a balanced assessment of a model’s precision and recall through their harmonic mean. It mitigates the bias that may arise when optimizing for either metric in isolation, e.g., overly conservative predictions (high precision, low recall) or overly aggressive ones (high recall, low precision). With $β = 1$ , the F1-score gives equal weight to precision and recall, as formulated in Equation (11).

$F - measure = \frac{(1 + β^{2}) \cdot \Pr \cdot Re}{β^{2} \cdot \Pr + Re} (β = 1)$

(11)

where Pr denotes precision and Re denotes recall, with their computational formulas defined in Equation (12).

$\begin{matrix} \Pr & = \frac{TP}{TP + FP} \\ Re & = \frac{TP}{ALL} \end{matrix}$

(12)

where TP denotes the number of correctly detected true targets, FP denotes the number of false positive detections, and ALL denotes the total number of ground truth targets.
(5): FPS (frames per second): FPS serves as a fundamental performance metric for evaluating model inference speed, defined as the number of image frames that an algorithm or system can process within a unit time interval of one second.
(6): AUC (area under the curve): The AUC represents the area under the ROC curve, with a value range of [0,1]. It serves as a scalar metric for evaluating the ranking quality and discriminative capability of binary classification models.

4.3. Experimental Details

All experiments were conducted on an NVIDIA GeForce RTX 4090 GPU with Python 3.8.20, PyTorch 1.13.1, and CUDA 11.7. The model was trained using binary cross-entropy (BCE) loss [43] as defined in Equation (13), optimized with the Adam optimizer [44] at an initial learning rate of

1 \times 10^{- 3}

following a cosine annealing schedule [45]. To prevent overfitting, we employed geometric transformations (including random flipping and rotation) and spatial transformations as data augmentation strategies during training. Model weights and biases were initialized using Kaiming initialization [46]. The maximum number of training epochs was set to 1000. For inference, a fixed threshold of 0.5 was applied to segment the saliency maps and generate binary detection results.

BCE (y, \hat{y}) = - [y \cdot log (\hat{y}) + (1 - y) \cdot log (1 - \hat{y})]

(13)

where

y \in {0, 1}

represents the ground truth label for a pixel (with 1 indicating target pixels and 0 indicating background), while

\hat{y} \in [0, 1]

denotes the predicted probability that the pixel belongs to the target class.

4.4. Comparative Experiments

The detection metrics of various algorithms are presented in Table 2. Overall, data-driven deep learning approaches demonstrate significantly superior performance compared to model-driven traditional methods, with our proposed MCC-Net outperforming all competing approaches. MCC-Net achieves state-of-the-art results on three public datasets, exhibiting remarkable advantages in both mIoU and F-measure metrics. This indicates that MCC-Net not only possesses exceptional capability for pixel-level discrimination between targets and background but also achieves an optimal balance between precision and recall. Furthermore, MCC-Net attains the highest probability of detection (Pd) on both the SIRST-v1 and NUDT-SIRST datasets, while maintaining false alarm rates (Fa) that, although not optimal, remain superior to most competing algorithms.

To further evaluate the performance of our proposed algorithm, we conducted a comprehensive comparison of ROC curves and AUC values across different deep learning methods on three benchmark datasets: SIRST-v1, NUDT-SIRST, and IRSTD-1K. As shown in Figure 11 and Table 3, MCC-Net consistently outperforms competing algorithms across all three datasets. The ROC curves of MCC-Net are positioned closest to the top left corner, and the method achieves the highest AUC values, demonstrating its exceptional discriminative capability in infrared small-target detection tasks.

We visualize the detection results of three representative algorithms alongside MCC-Net on the three public datasets, as illustrated in Figure 12. MCC-Net consistently outperforms other methods. As shown in Figure 12(1,4), when confronted with challenging scenarios featuring highly complex backgrounds and weak target–background contrast, MCC-Net successfully detects all targets without generating false alarms, unlike competing methods. Figure 12(2,3) reveal that in extremely challenging conditions with complex backgrounds, small targets located at image boundaries, or multiple closely spaced targets, other algorithms miss detections, whereas MCC-Net successfully identifies all targets. Moreover, across all detection results, MCC-Net’s output morphology most closely aligns with the ground truth annotations. These observations collectively demonstrate that our proposed method possesses robust global and local modeling capabilities, coupled with exceptional fine-grained segmentation performance.

To validate the computational efficiency of MCC-Net, we conducted a comprehensive comparison with state-of-the-art methods including UIU-Net, DNA-Net, SCTransNet, and HDNet across three benchmark datasets. The evaluation metrics encompass floating-point operations (FLOPs), parameter count (Params), and frames per second (FPS). As demonstrated in Table 4, MCC-Net achieves state-of-the-art performance across all metrics on the three public datasets while maintaining exceptionally low computational complexity. Specifically, MCC-Net exhibits substantially fewer FLOPs and parameters than existing methods even as it achieves significantly higher FPS. These results substantiate the effectiveness of our design strategy: replacing conventional attention mechanisms and standard convolutions with MALA and CondConv not only dramatically reduces computational overhead but also enhances the model’s ability to discriminate small targets from complex backgrounds.

From the comprehensive comparative experiments presented above, it is evident that among existing algorithms, DNA-Net, UIU-Net, and SCTransNet achieve the second best detection performance on different datasets, yet they still exhibit notable limitations. Specifically, while the complex nested architectures employed by DNA-Net and UIU-Net enhance effective feature extraction, their inherent repetitive feature extraction and fusion operations often generate redundant feature representations. Such redundancy not only tends to obscure critical discriminative features—leading to false positives and missed detections—but also neglects essential edge-related information, resulting in poor contour alignment between detected targets and ground truth boundaries. Furthermore, these nested structures incur substantial computational overhead. Although SCTransNet replaces complex nested architectures with multi-scale attention mechanisms, it fails to demonstrate significant performance improvements over DNA-Net and UIU-Net, while still maintaining prohibitively high computational complexity.

In contrast, MCC-Net not only surpasses all competing methods in detection performance but also achieves precise detection in complex scenarios where other algorithms frequently produce false positives or missed detections. Moreover, MCC-Net generates target contours that exhibit the closest alignment with ground truth boundaries. Crucially, its exceptionally streamlined architecture results in significantly reduced computational complexity, establishing a superior trade-off between performance and efficiency.

4.5. Ablation Experiments

Ablation studies were conducted on the SIRST-v1, NUDT-SIRST, and IRSTD-1K datasets to validate the effectiveness of individual improvement strategies. To accurately assess the contribution of each component, we first established a baseline using the U-Net structure, then incrementally incorporated different combinations of proposed modules to evaluate their impact on detection performance. The mean Intersection over Union (mIoU), floating-point operations (FLOPs), parameter count (Params), and frames per second (FPS) for all experimental configurations are summarized in Table 5.

Analysis of Table 5 demonstrates that both individual component integration and strategic combinations consistently enhance model performance, validating the effectiveness of each proposed component. Notably, MCC-Net, which integrates all three improvement strategies, achieves the highest detection metrics across all three datasets. Compared to the baseline U-Net model without any enhancements, MCC-Net improves the mean Intersection over Union (mIoU) by 6.47%, 20.56%, and 10.52% on the SIRST-v1, NUDT-SIRST, and IRSTD-1K datasets, respectively. Furthermore, MCC-Net outperforms the second-best results by 2.47%, 2.52%, and 2.40% in terms of mIoU on the corresponding datasets. These substantial improvements indicate that the synergistic combination of the three strategies significantly enhances the model’s ability to detect small targets under complex background conditions. Additionally, comparative analysis of computational metrics—FLOPs, Params, and FPS—reveals that all three improvement strategies incur minimal computational overhead. Notably, CondConv, as a lightweight convolutional operation applied throughout the entire inference pipeline, further reduces the computational costs of both MALA and CondCCF modules.

4.6. MCC-Net Performance in Challenging Scenarios

To validate MCC-Net’s detection performance under challenging conditions, we selected multiple infrared images encompassing three representative difficult scenarios: poor target–background contrast, complex background textures, and bright interference regions. As demonstrated in Figure 13, MCC-Net successfully detected all targets across these challenging images without any false positives or missed detections. The model effectively discriminates targets from backgrounds even in highly complex scenarios such as forest environments and reflective sea surfaces, where target–background texture overlap is significant. Furthermore, MCC-Net maintains robust performance in scenes containing bright interfering structures such as buildings and signal towers, consistently avoiding both false alarms and missed detections. These comprehensive results substantiate MCC-Net’s superior detection capability and robustness in handling challenging infrared small-target detection scenarios.

5. Conclusions

In this paper, we propose MCC-Net, a novel framework for IRSTD. Building upon the U-Net baseline architecture, we integrate the MALA module into MCC-Net’s skip connections. Within the MALA module, we uniquely incorporate Local Enhanced Positional Encoding and Rotary Positional Embedding (RoPE) while optimizing the computational pathway. This design endows the model with dual spatial contextual modeling capabilities—simultaneously capturing local details and perceiving global spatial relationships. Such an approach significantly enhances the model’s ability to discriminate between background and targets while maintaining substantially lower computational complexity than conventional attention mechanisms. Furthermore, we introduce a novel Conditional Cross-Channel Fusion (CondCCF) module that fuses features extracted by each MALA layer—after channel attention processing—with deep-level features optimized through bilinear interpolation. This design establishes a complementary spatial-channel dual-attention mechanism in conjunction with the MALA module, facilitating efficient integration of deep and shallow features and enabling the model to focus on extracting salient small-target features. Finally, we incorporate Conditionally Parameterized Convolutions (CondConv) to replace traditional convolutional operations, enhancing feature extraction capabilities while reducing computational overhead.

We conduct comprehensive comparative experiments on three public benchmark datasets. The experimental results demonstrate that MCC-Net achieves state-of-the-art performance across multiple evaluation metrics and visual qualitative assessments. Furthermore, computational complexity analysis reveals that our proposed model achieves better detection performance with substantially lower computational requirements than existing state-of-the-art methods. Extensive ablation studies on different combinations of our proposed strategies confirm that all three components—MALA, CondConv, and CondCCF—are individually effective and that they collectively achieve optimal detection performance when integrated synergistically.

Author Contributions

Conceptualization, X.Z. and X.W.; Methodology, X.Z. and Q.Z.; Software, X.Z.; Supervision, K.J., L.D. and Y.T.; Validation, X.Z., X.L. and X.W.; Writing—Original Draft, X.Z.; Writing—Review and Editing, Q.Z., X.W., Y.X. and M.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Xi’an Institute of Optics and Precision Mechanics, Chinese Academy of Sciences (grant number G25-042-II).

Data Availability Statement

The datasets presented in this study can be downloaded here: https://github.com/YimianDai/sirst (SIRST-v1, accessed on 21 January 2026). https://github.com/YeRen123455/Infrared-Small-Target-Detection (NUDT-SIRST, accessed on 21 January 2026). https://github.com/RuiZhang97/ISNet (IRSTD-1K, accessed on 21 January 2026).

Acknowledgments

We thank the editors and reviewers for their hard work and valuable advice.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Liu, Q.; Chen, H.; Zhao, F. Maritime target detection algorithm based on fusion of visible and infrared images. J. Supercomput. 2025, 81, 22. [Google Scholar] [CrossRef]
Li, R.; An, W.; Wang, Y.; Ying, X.; Dai, Y.; Wang, L.; Li, M.; Guo, Y.; Liu, L. Probing Deep into Temporal Profile Makes the Infrared Small Target Detector Much Better. arXiv 2026, arXiv:2506.12766. [Google Scholar] [CrossRef]
Zhang, Y.; Zhang, Y.; Shi, Z.; Zhang, J.; Wei, M. Design and training of deep CNN-based fast detector in infrared SUAV surveillance system. IEEE Access 2019, 7, 137365–137377. [Google Scholar] [CrossRef]
Liu, L.; Zhang, S.; Hu, M. MFE-Net: A Novel Multiscale Feature Enhancement Network for SAR Ship Instance Segmentation. IEEE Geosci. Remote Sens. Lett. 2025, 22, 1–5. [Google Scholar] [CrossRef]
Li, H.L.; Chen, S.W. Polyhedral Corner Reflectors Multidomain Joint Characterization With Fully Polarimetric Radar. IEEE Trans. Antennas Propag. 2025, 73, 10679–10693. [Google Scholar] [CrossRef]
Cao, S.; Deng, J.; Luo, J.; Li, Z.; Hu, J.; Peng, Z. Local convergence index-based infrared small target detection against complex scenes. Remote Sens. 2023, 15, 1464. [Google Scholar] [CrossRef]
Li, H.L.; Chen, S.W. General Polarimetric Correlation Pattern: A Visualization and Characterization Tool for Target Joint-Domain Scattering Mechanisms Investigation. IEEE Trans. Geosci. Remote Sens. 2026, 64, 1–17. [Google Scholar] [CrossRef]
He, X.; Zhou, Y.; Zhao, J.; Zhang, D.; Yao, R.; Xue, Y. Swin transformer embedding UNet for remote sensing image semantic segmentation. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–15. [Google Scholar] [CrossRef]
Bai, X.; Zhou, F. Analysis of new top-hat transformation and the application for infrared dim small target detection. Pattern Recognit. 2010, 43, 2145–2156. [Google Scholar] [CrossRef]
Deshpande, S.D.; Er, M.H.; Venkateswarlu, R.; Chan, P. Max-mean and max-median filters for detection of small targets. In Proceedings of the SPIE; SPIE: Bellingham, WA, USA, 1999; Volume 3809, pp. 74–83. [Google Scholar]
Gao, C.; Meng, D.; Yang, Y.; Wang, Y.; Zhou, X.; Hauptmann, A.G. Infrared patch-image model for small target detection in a single image. IEEE Trans. Image Process. 2013, 22, 4996–5009. [Google Scholar] [CrossRef]
Zhang, L.; Peng, Z. Infrared small target detection based on partial sum of the tensor nuclear norm. Remote Sens. 2019, 11, 382. [Google Scholar] [CrossRef]
Chen, C.P.; Li, H.; Wei, Y.; Xia, T.; Tang, Y.Y. A local contrast method for small infrared target detection. IEEE Trans. Geosci. Remote Sens. 2013, 52, 574–581. [Google Scholar]
Wei, Y.; You, X.; Li, H. Multiscale patch-based contrast measure for small infrared target detection. Pattern Recognit. 2016, 58, 216–226. [Google Scholar] [CrossRef]
Han, J.; Moradi, S.; Faramarzi, I.; Zhang, H.; Zhao, Q.; Zhang, X.; Li, N. Infrared small target detection based on the weighted strengthened local contrast measure. IEEE Geosci. Remote Sens. Lett. 2021, 18, 1670–1674. [Google Scholar]
Han, J.; Moradi, S.; Faramarzi, I.; Liu, C.; Zhang, H.; Zhao, Q. A local contrast method for infrared small-target detection utilizing a tri-layer window. IEEE Geosci. Remote Sens. Lett. 2020, 17, 1822–1826. [Google Scholar]
Hou, Q.; Zhang, L.; Tan, F.; Xi, Y.; Zheng, H.; Li, N. ISTDU-Net: Infrared Small-Target Detection U-Net. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
Hou, Q.; Wang, Z.; Tan, F.; Zhao, Y.; Zheng, H.; Zhang, W. RISTDnet: Robust Infrared Small Target Detection Network. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
Wang, H.; Zhou, L.; Wang, L. Miss detection vs. false alarm: Adversarial learning for small object segmentation in infrared images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8509–8518. [Google Scholar]
Dai, Y.; Wu, Y.; Zhou, F.; Barnard, K. Asymmetric contextual modulation for infrared small target detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Online, 5–9 January 2021; pp. 950–959. [Google Scholar]
Wu, X.; Hong, D.; Chanussot, J. UIU-Net: U-Net in U-Net for infrared small object detection. IEEE Trans. Image Process. 2023, 32, 364–376. [Google Scholar]
Li, B.; Cao, J.; Ning, Y.; Zhao, T.; Li, Z.; Wang, Z.; Zhang, L.; Hao, Q. Dense nested attention network for infrared small target detection. IEEE Trans. Image Process. 2023, 32, 1745–1758. [Google Scholar] [CrossRef]
Wu, T.; Li, B.; Luo, Y.; Wang, Y.; Xiao, C.; Liu, T.; Yang, J.; An, W.; Guo, Y. MTU-Net: Multilevel TransUnet for space-based infrared tiny ship detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5601015. [Google Scholar] [CrossRef]
Liu, S.; Chen, P.; Woźniak, M. Image enhancement-based detection with small infrared targets. Remote Sens. 2022, 14, 3232. [Google Scholar] [CrossRef]
Zuo, Z.; Tong, X.; Wei, J.; Su, S.; Wu, P.; Guo, R.; Sun, B. AFFPN: Attention fusion feature pyramid network for small infrared target detection. Remote Sens. 2022, 14, 3412. [Google Scholar]
Yu, C.; Liu, Y.; Wu, S.; Xia, X.; Hu, Z.; Lan, D.; Liu, X. Pay attention to local contrast learning networks for infrared small target detection. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar]
Tong, X.; Sun, B.; Wei, J.; Zuo, Z.; Su, S. EAAU-Net: Enhanced asymmetric attention U-Net for infrared small target detection. Remote Sens. 2021, 13, 3200. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
Dai, Y.; Wu, Y.; Zhou, F.; Barnard, K. Attentional local contrast networks for infrared small target detection. IEEE Trans. Geosci. Remote Sens. 2021, 59, 9813–9824. [Google Scholar] [CrossRef]
Sun, H.; Bai, J.; Yang, F.; Bai, X. Receptive-field and direction induced attention network for infrared dim small target detection with a large-scale dataset IRDST. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5000513. [Google Scholar]
Wang, K.; Du, S.; Liu, C.; Cao, Z. Interior attention-aware network for infrared small target detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5002013. [Google Scholar]
Zhang, T.; Li, L.; Cao, S.; Pu, T.; Peng, Z. Attention-guided pyramid context networks for detecting infrared small target under complex background. IEEE Trans. Aerosp. Electron. Syst. 2023, 59, 4250–4261. [Google Scholar] [CrossRef]
Zhang, M.; Bai, H.; Zhang, J.; Zhang, R.; Wang, C.; Guo, J.; Gao, X. Rkformer: Runge-kutta transformer with random-connection attention for infrared small target detection. In Proceedings of the 30th ACM International Conference on Multimedia, Lisbon, Portugal, 10–14 October 2022; pp. 1730–1738. [Google Scholar]
Pan, P.; Wang, H.; Wang, C.; Nie, C. ABC: Attention with bilinear correlation for infrared small target detection. In Proceedings of the 2023 IEEE International Conference on Multimedia and Expo (ICME), Brisbane, Australia, 10–14 July 2023; pp. 2381–2386. [Google Scholar]
Liu, F.; Gao, C.; Chen, F.; Meng, D.; Zuo, W.; Gao, X. Infrared small and dim target detection with transformer under complex backgrounds. IEEE Trans. Image Process. 2023, 32, 5921–5932. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 1–11. [Google Scholar]
Han, D.; Pan, X.; Han, Y.; Song, S.; Huang, G. Flatten transformer: Vision transformer using focused linear attention. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 4–6 October 2023; pp. 5961–5971. [Google Scholar]
Han, D.; Pu, Y.; Xia, Z.; Han, Y.; Pan, X.; Li, X.; Lu, J.; Song, S.; Huang, G. Bridging the divide: Reconsidering softmax and linear attention. Adv. Neural Inf. Process. Syst. 2024, 37, 79221–79245. [Google Scholar]
Fan, Q.; Huang, H.; Ai, Y.; He, R. Rectifying magnitude neglect in linear attention. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Honolulu, HI, USA, 19–23 October 2025; pp. 21505–21514. [Google Scholar]
Yang, B.; Bender, G.; Le, Q.V.; Ngiam, J. Condconv: Conditionally parameterized convolutions for efficient inference. Adv. Neural Inf. Process. Syst. 2019, 32, 1–12. [Google Scholar]
Wang, H.; Cao, P.; Wang, J.; Zaiane, O.R. Uctransnet: Rethinking the skip connections in u-net from a channel-wise perspective with transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 22 February–1 March 2022; Volume 36, pp. 2441–2449. [Google Scholar]
Zhang, M.; Zhang, R.; Yang, Y.; Bai, H.; Zhang, J.; Guo, J. ISNet: Shape matters for infrared small target detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–20 June 2022; pp. 877–886. [Google Scholar]
Li, Q.; Jia, X.; Zhou, J.; Shen, L.; Duan, J. Rediscovering bce loss for uniform classification. arXiv 2024, arXiv:2403.07289. [Google Scholar] [CrossRef]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2017, arXiv:1412.6980. [Google Scholar] [CrossRef]
Loshchilov, I.; Hutter, F. Sgdr: Stochastic gradient descent with warm restarts. arXiv 2016, arXiv:1608.03983. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 13–16 December 2015; pp. 1026–1034. [Google Scholar] [CrossRef]
Sun, Y.; Yang, J.; An, W. Infrared Dim and Small Target Detection via Multiple Subspace Learning and Spatial-Temporal Patch-Tensor Model. IEEE Trans. Geosci. Remote Sens. 2021, 59, 3737–3752. [Google Scholar] [CrossRef]
Yuan, S.; Qin, H.; Yan, X.; Akhtar, N.; Mian, A. SCTransNet: Spatial-Channel Cross Transformer Network for Infrared Small Target Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–15. [Google Scholar] [CrossRef]
Xu, M.; Yu, C.; Li, Z.; Tang, H.; Hu, Y.; Nie, L. HDNet: A Hybrid Domain Network With Multiscale High-Frequency Information Enhancement for Infrared Small-Target Detection. IEEE Trans. Geosci. Remote Sens. 2025, 63, 1–15. [Google Scholar] [CrossRef]

Figure 1. Architectures of three representative deep learning methods: (a) DNA-Net, (b) UIU-Net, and (c) MTU-Net.

Figure 2. The structure of MCC-Net.

Figure 3. The structure of CondCBL and Res_CondCBL.

Figure 4. The structure of CondCCF and CondCBR.

Figure 5. The structure of DSFE.

Figure 6. The processing pipeline of the MALA module.

Figure 7. Architectural comparison between (a) CondConv and (b) Mixture of Experts.

Figure 8. Architectural diagrams of the three proposed convolutional modules: (a) CondCBL, (b) Res-CondCBL, and (c) CondCBR.

Figure 9. Flowchart of CCA calculation.

Figure 10. Examples of padded images from the three datasets.

Figure 11. ROC curves of different methods on three datasets: (a) SIRST-v1, (b) NUDT-SIRST, and (c) IRSTD-1K.

Figure 12. Detection results of ACM, UIUNet, DNANet, and MCC-Net on the SIRST-v1, NUDT-SIRST, and IRSTD-1K datasets. Red, blue, and yellow circles indicate correctly detected targets, missed detections, and false alarms, respectively. GT denotes ground truth annotations.

Figure 13. Visualization of detection results from MCC-Net on challenging scenarios across three public datasets. Red circles indicate correctly detected targets.

Table 1. Comparison of infrared small-target detection datasets.

Dataset	Number of Images	Padded Resolution	Training Set Ratio
SIRST-v1	427	$512 \times 512$	50%
NUDT-SIRST	1327	$256 \times 256$	50%
IRSTD-1K	1001	$512 \times 512$	80%

Table 2. Performance comparison of different methods on the SIRST-v1, NUDT-SIRST, and IRSTD-1K datasets in terms of mIoU (%), F-measure (%), Pd (%), and Fa (

10^{- 6}

). The best results are highlighted in red, and the second-best results are underlined.

Table 2. Performance comparison of different methods on the SIRST-v1, NUDT-SIRST, and IRSTD-1K datasets in terms of mIoU (%), F-measure (%), Pd (%), and Fa (

10^{- 6}

). The best results are highlighted in red, and the second-best results are underlined.

Method	SIRST-v1				NUDT-SIRST				IRSTD-1K
Method	mIoU	F	Pd	Fa	mIoU	F	Pd	Fa	mIoU	F	Pd	Fa
Top-Hat [9]	7.13	14.62	79.84	1012	20.71	33.51	78.41	166	10.05	16.01	75.11	1432
Max–Median [10]	4.15	10.66	69.18	55.33	4.18	7.62	58.39	36.89	6.98	8.14	65.19	59.73
WSLCM [15]	1.15	4.80	77.93	5445	2.27	5.98	56.80	1309	3.44	2.12	72.42	6618
TLLCM [16]	1.02	4.98	79.07	5898	2.16	7.22	61.99	1607	3.30	2.18	77.37	6737
IPI [11]	25.67	43.64	84.62	16.65	17.76	26.93	74.48	41.21	27.92	35.67	81.36	16.16
PSTNN [12]	30.30	39.16	72.80	48.97	14.85	35.63	66.13	44.15	24.57	37.18	71.99	35.24
MSLSTIPT [47]	10.30	18.82	82.12	1130	8.34	18.25	47.39	888	11.43	12.22	79.02	1523
ACM [20]	68.91	80.86	91.62	15.21	61.10	75.86	93.11	55.20	59.21	74.37	93.26	65.26
ALCNet [29]	70.82	82.91	94.30	36.15	64.73	78.58	94.18	34.61	60.59	75.46	92.98	58.82
RDIAN [30]	68.71	81.44	93.53	43.29	76.27	86.52	95.76	34.56	56.44	72.12	88.54	26.63
ISTDU [17]	75.52	86.06	96.56	14.54	89.55	94.49	97.67	13.44	66.36	79.58	93.60	53.15
MTU-Net [23]	74.76	85.36	93.52	22.35	74.83	84.46	93.95	46.94	66.09	79.25	93.25	36.79
IAANet [31]	74.20	85.01	93.52	22.69	90.20	94.87	97.25	8.31	66.23	78.33	93.14	14.19
AGPCNet [32]	75.69	85.26	96.47	14.98	88.87	93.88	97.19	10.01	66.29	79.58	92.82	13.11
DNA-Net [22]	75.80	86.23	95.81	8.76	88.19	93.72	98.82	8.98	65.90	79.43	90.90	12.22
UIU-Net [21]	76.91	86.95	95.81	14.12	93.48	96.63	98.31	7.78	66.15	79.63	93.97	22.06
SCTransNet [48]	75.36	85.95	96.27	18	92.24	95.96	98.82	21	69.51	82.01	90.75	55
HDNet [49]	72.82	84.27	94.14	19	77.76	87.49	96.37	76	67.82	80.83	92.12	49
MCC-Net	77.98	87.62	96.58	16	95.43	97.66	98.94	11	70.46	82.67	90.64	51

Table 3. AUC values of different methods on the SIRST-v1, NUDT-SIRST, and IRSTD-1K datasets.

Method	SIRST-v1	NUDT-SIRST	IRSTD-1K
ACM [20]	0.8127	0.5970	0.7144
ALCNet [29]	0.8826	0.7748	0.7997
AGPCNet [32]	0.8174	0.7415	0.7772
DNA-Net [22]	0.7950	0.8035	0.7625
HDNet [49]	0.8919	0.8937	0.8790
IAANet [31]	0.8579	0.8370	0.8533
ISTDU [17]	0.8228	0.9125	0.7519
MTU-Net [23]	0.8288	0.6228	0.7322
RDIAN [30]	0.7320	0.6675	0.6198
SCTransNet [48]	0.8062	0.9351	0.8388
UIU-Net [21]	0.7194	0.9015	0.7227
MCC-Net	0.9364	0.9816	0.8918

Table 4. Comparison of computational complexity and detection performance across deep learning methods.

Model	Params (M)	FLOPs (G)	SIRST-v1		NUDT-SIRST		IRSTD-1K
Model	Params (M)	FLOPs (G)	mIoU (%)	FPS	mIoU (%)	FPS	mIoU (%)	FPS
DNA-Net [22]	4.70	14.26	75.80	12	88.19	33	65.90	12
UIU-Net [21]	45.22	39.32	76.91	11	93.48	32	66.15	11
SCTransNet [48]	11.69	40.46	75.36	16	92.24	37	69.51	15
HDNet [49]	3.68	22.73	72.82	33	77.76	47	67.82	33
MCC-Net	0.64	1.54	77.98	63	95.43	84	70.46	63

Table 5. Ablation experiments for different components on three infrared small-target detection datasets.

MALA	CondConv	CondCCF	Params (M)	FLOPs (G)	FPS	mIoU (%)
MALA	CondConv	CondCCF	Params (M)	FLOPs (G)	FPS	SIRST-v1	NUDT-SIRST	IRSTD-1K
			3.38	5.82	47	71.51	74.87	59.94
✓			3.83	7.01	44	75.32	90.42	64.71
	✓		0.19	0.81	72	74.82	92.73	65.71
		✓	3.56	5.85	45	72.00	89.95	64.80
✓	✓		0.46	1.23	67	72.40	91.96	66.96
	✓	✓	0.37	1.11	69	73.52	92.91	68.06
✓		✓	4.01	7.36	42	75.51	90.39	66.55
✓	✓	✓	0.64	1.54	63	77.98	95.43	70.46

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhou, X.; Wang, X.; Tian, Y.; Jiang, K.; Guo, M.; Lian, X.; Ding, L.; Zhang, Q.; Xue, Y. MCC-Net: Efficient Dual-Attention Network for Infrared Small-Target Detection. Remote Sens. 2026, 18, 1858. https://doi.org/10.3390/rs18111858

AMA Style

Zhou X, Wang X, Tian Y, Jiang K, Guo M, Lian X, Ding L, Zhang Q, Xue Y. MCC-Net: Efficient Dual-Attention Network for Infrared Small-Target Detection. Remote Sensing. 2026; 18(11):1858. https://doi.org/10.3390/rs18111858

Chicago/Turabian Style

Zhou, Xiaotian, Xin Wang, Yan Tian, Kai Jiang, Min Guo, Xuezheng Lian, Lu Ding, Quanyu Zhang, and Yaqi Xue. 2026. "MCC-Net: Efficient Dual-Attention Network for Infrared Small-Target Detection" Remote Sensing 18, no. 11: 1858. https://doi.org/10.3390/rs18111858

APA Style

Zhou, X., Wang, X., Tian, Y., Jiang, K., Guo, M., Lian, X., Ding, L., Zhang, Q., & Xue, Y. (2026). MCC-Net: Efficient Dual-Attention Network for Infrared Small-Target Detection. Remote Sensing, 18(11), 1858. https://doi.org/10.3390/rs18111858

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MCC-Net: Efficient Dual-Attention Network for Infrared Small-Target Detection

Highlights

Abstract

1. Introduction

2. Related Work

2.1. Model-Driven Traditional Methods

2.2. Data-Driven Deep Learning Methods

3. Method

3.1. Model Architecture

3.2. Magnitude-Aware Linear Attention

3.3. Conditionally Parameterized Convolutions

3.4. Conditional Cross-Channel Fusion Module

4. Experiments and Analysis

4.1. Dataset

4.2. Evaluation Indicators

4.3. Experimental Details

4.4. Comparative Experiments

4.5. Ablation Experiments

4.6. MCC-Net Performance in Challenging Scenarios

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI