MEFA-Net: Multilevel Feature Extraction and Fusion Attention Network for Infrared Small-Target Detection

Ma, Jingcui; Pan, Nian; Yin, Dengyu; Wang, Di; Zhou, Jin

doi:10.3390/rs17142502

Open AccessArticle

MEFA-Net: Multilevel Feature Extraction and Fusion Attention Network for Infrared Small-Target Detection

by

Jingcui Ma

^1,2,3,

Nian Pan

^1,2,

Dengyu Yin

⁴,

Di Wang

⁴ and

Jin Zhou

^1,2,*

¹

National Laboratory on Adaptive Optics, Chengdu 610209, China

²

Institute of Optics and Electronics, Chinese Academy of Sciences, Chengdu 610209, China

³

University of Chinese Academy of Sciences, Beijing 101408, China

⁴

AVIC Chengdu Aircraft Design & Research Institute, Chengdu 610091, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(14), 2502; https://doi.org/10.3390/rs17142502

Submission received: 20 May 2025 / Revised: 2 July 2025 / Accepted: 14 July 2025 / Published: 18 July 2025

Download

Browse Figures

Versions Notes

Abstract

Infrared small-target detection encounters significant challenges due to a low image signal-to-noise ratio, limited target size, and complex background noise. To address the issues of sparse feature loss for small targets during the down-sampling phase of the traditional U-Net network and the semantic gap in the feature fusion process, a multilevel feature extraction and fusion attention network (MEFA-Net) is designed. Specifically, the dilated direction-sensitive convolution block (DDCB) is devised to collaboratively extract local detail features, contextual features, and Gaussian salient features via ordinary convolution, dilated convolution and parallel strip convolution. Furthermore, the encoder attention fusion module (EAF) is employed, where spatial and channel attention weights are generated using dual-path pooling to achieve the adaptive fusion of deep and shallow layer features. Lastly, an efficient up-sampling block (EUB) is constructed, integrating a hybrid up-sampling strategy with multi-scale dilated convolution to refine the localization of small targets. The experimental results confirm that the proposed algorithm model surpasses most existing recent methods. Compared with the baseline, the intersection over union (IoU) and probability of detection

P_{d}

of MEFA-Net on the IRSTD-1k dataset are increased by 2.25% and 3.05%, respectively, achieving better detection performance and a lower false alarm rate in complex scenarios.

Keywords:

infrared small-target detection; dilated direction-sensitive convolution; encoder attention fusion; efficient up-sampling

1. Introduction

Infrared small-target detection has been extensively utilized in various domains, including military security, maritime search and rescue, and forest fire prevention. In practical scenarios, infrared small targets typically exhibit the following attributes: the target size is relatively small, often spanning merely a few pixels, and the image signal-to-noise ratio (SNR) is low, making small targets susceptible to interference from the background environment. Additionally, target features are often incomplete, and many associated features fail to provide adequate information for effective detection and processing. These characteristics pose significant challenges for deep learning methods when addressing infrared small targets. First, the limited spatial extent of small targets results in restricted information content, which can be easily overshadowed by complex backgrounds during conventional convolution operations in deep learning models, leading to the loss of discriminative information or even the incorporation of noise into the learned features. Second, pooling operations, while reducing computational complexity, also diminish the spatial resolution of feature maps and weaken the representation of spatial information. For extremely small targets, their features are often concealed within strong semantic information embedded deep within the backbone network, making direct extraction particularly challenging.

To tackle these challenges, infrared small-target segmentation methods need to ensure robust feature representation capabilities, while avoiding the information loss typically associated with deep networks. Furthermore, the effective integration and utilization of low-level spatial details with high-level semantic cues is critical to enhance the model’s robustness in challenging environments dominated by complex backgrounds and noise, leading to more accurate detection and segmentation in challenging environments.

The existing infrared small-target detection techniques are mainly classified into classical algorithms and deep learning algorithms. Classical algorithms can be further classified based on their design principles into filter-based methods [1,2,3,4,5,6,7,8], local information-based methods [9,10,11,12,13,14], and data structure-based methods [15,16,17,18,19]. Filter-based methods were among the earliest approaches employed for small-target detection, exploiting differences between the small targets and cluttered backgrounds in spatial–frequency domains to design specific filters. Local information-based methods identify targets by leveraging abrupt changes in gray levels and brightness within the target and its immediate surrounding region. Data structure-based methods distinguish targets from the background by exploiting inherent structural attributes, including target sparsity and background low-rankness. However, classical methods often excessively depend on prior knowledge, resulting in relatively poor robustness when significant variations occur in the target or scene [20].

Compared to traditional algorithms, data-driven deep learning methods can autonomously capture discriminative features from data without relying on predefined assumptions or handcrafted features, demonstrating superior detection performance. Owing to the tiny size of infrared targets, minor shifts in bounding box annotations during training can cause significant fluctuations in the Intersection over Union (IoU). Consequently, most infrared small-target detection tasks utilize image segmentation approaches. Currently, many deep learning methods [21,22,23,24,25,26,27,28,29] leverage the traditional U-Net framework [30], which features an encoder–decoder architecture. The role of the encoder involves extracting hierarchical features while gradually decreasing the spatial resolution of the feature representations; meanwhile, the decoder aims to fuse features across layers to recover the spatial information lost during encoding. Although CNN-based methods have demonstrated notable performance in the detection of infrared small targets, some issues, such as the loss of deep features of small targets, the lack of effective fusion and utilization of high- and low-level features, and the limitation of traditional up-sampling methods, have not been effectively solved. Specifically, the feature extraction stages often lack designs specifically optimized for the inherent attributes of infrared small targets. This results in insufficient discrimination between targets and complex backgrounds, and leads to target information being lost during down-sampling. In the feature fusion process, the relationships between semantic features across different layers are not effectively modeled. The mismatch between features of different layers instead has a negative impact on the detection of small targets. In addition, the smoothing effect and checkerboard effect caused by traditional sampling methods such as bilinear interpolation and transposed convolution fail to perform more accurate decoding.

To address the above-mentioned issues, the MEFA-Net network is proposed. To mitigate the challenge of the feature degradation of small targets through successive pooling, a dilated direction-sensitive convolution block (DDCB) is designed. Via the parallel convolution strategy, this module effectively captures local details, contextual information, and the Gaussian salient features of targets. Rich pixel-level feature representations enhance segmentation accuracy and improve the sparse feature extraction of small targets. Furthermore, to enhance the correlation between semantic information at different levels, an encoder attention fusion (EAF) module [31] is adopted. This module leverages spatial and channel attention mechanisms to produce more comprehensive feature representations, achieving the more precise segmentation of small targets under background noise and complex interference. The output of the EAF module is transmitted to the decoder through an enhanced skip connection, which fuses attention-enhanced features from both encoders. In addition, a straightforward skip connection directly transfers fine-grained spatial information to the decoder. Finally, to compensate for the limitations of traditional up-sampling methods, an efficient up-sampling block integrates PixelShuffle and bilinear interpolation for high-efficiency decoding. This module also employs dilated convolutions with diverse dilation factors to execute multi-scale contextual enhancement during the iterative reconstruction of spatial information. The synergistic application of these mechanisms enables the model to effectively retain the contextual details of small targets within infrared images, facilitating more accurate segmentation results. Empirical results on public datasets NUAA-SIRST and IRSTD-1K demonstrate the efficacy of our approach, showing stronger robustness in small-target detection under complex background conditions and achieving superior segmentation performance.

The primary contributions of our work are summarized as follows:

(1): Collaborative feature extraction strategy: The DDCB module focuses on the intrinsic attributes of the target and fully considers the Gaussian characteristics of the gray-scale distribution of infrared small targets. Synergistically combining context features and Gaussian salient features enhances the expression of small-target sparse features in the encoding stage.
(2): Attention-guided hierarchical fusion method: Inspired by medical image segmentation, the EAF module uses attention mechanisms to guide the fusion of output features from adjacent encoder layers and transmits the fused features to the corresponding decoder layers by utilizing enhanced skip connections. This process significantly strengthens the semantic correlation between different feature layers, facilitating more precise segmentation in complex backgrounds.
(3): Efficient up-sampling mechanism: the EUB module combines PixelShuffle to rearrange a portion of the feature map’s channel information into spatial information to achieve up-sampling, which retains the rich information in the original features and enhances the flexibility and feature preservation during decoding.

The subsequent sections are structured as follows. Section 2 reviews related literature in this field. Section 3 elaborates the architecture of MEFA-Net and the specific details of each module. Section 4 outlines the experimental methodology and comprehensively analyzes the results. Section 5 conducts some discussions. Finally, Section 6 summarizes this work.

2. Related Works

2.1. Infrared Small-Target Detection

Traditional algorithms based on handcrafted features primarily include filter-based methods, local information-based methods, and data structure-based methods. Filter-based approaches utilize specially designed image filters to suppress targets, including spatial domain filtering [1,2], frequency domain filtering [3,4,5], and morphological filtering techniques [6,7,8]. However, the efficacy of these methods is typically limited to applications exhibiting simple and uniform backgrounds. Local information-based methods exploit the saliency of small targets within their local backgrounds, employing techniques such as local contrast mechanisms [9,10,11,12] and local entropy methods [13,14]. The local contrast approach simulates human visual perception, using contrast as the primary criterion for target extraction. The local entropy technique, based on the maximum entropy principle, assigns higher entropy values to homogeneous regions compared to heterogeneous ones, thereby facilitating the separation of small targets from the background. These methods are effective when targets exhibit high local contrast; however, they encounter limitations when targets are dark or blended into the background. Data structure-based methods [15,16,17,18,19] recast the task of small-target detection as a mathematical problem involving low-rank and sparse matrix decomposition. By exploiting the sparse nature of targets and the low-rank structural properties of the background regions, these methods aim to differentiate targets from clutter. Nonetheless, they tend to have higher computational complexity, and background edges or noise are often mistakenly decomposed as sparse components, leading to false alarms.

Driven by rapid advancements in deep learning paradigms, CNN-based approaches have demonstrated superior performance in infrared small-target detection. Wang et al. [21] introduced generative adversarial networks, leveraging the adversarial training mechanism between generative and discriminative modules to compensate for the significant disparity between positive and negative samples in infrared small-target detection. ACM [22] facilitated cross-scale feature interaction by integrating local attention refinement and global attention calibration mechanisms, achieving effective asymmetric feature fusion. ALCNet [23] restructured conventional local contrast analysis techniques into a comprehensive neural network integrated with attention-driven modulation mechanisms. By designing densely nested interaction modules, DNANet [24] facilitates the layer-by-layer interaction between high-level semantic representations and low-level detailed features. UIUNet [25] embedded a sub-U-Net architecture within a larger U-Net framework, promoting multi-level and multi-scale feature extraction and fusion. Won et al. [26] designed a multi-scale feature fusion network based on UNet3+ with residual attention blocks, utilizing full-scale skip connections to fuse features across different network stages. SpirDet [27] proposed a dual-branch sparse decoder, where a fast branch predicts coarse target locations, and a slow branch refines the positions within the coarse regions, enabling precise small-target detection. DMFNet [28] employs a dual-encoder architecture to effectively extract the features of small targets. However, this design also results in significantly increased computational complexity. Quan et al. [29] first introduced the hint mechanism, which utilized prior knowledge of the positions of small targets to highlight the key local features.

2.2. Image Segmentation Structure Based on U-Net

The U-Net architecture, a foundational framework for many infrared small-target segmentation algorithms, was initially developed for medical image segmentation. The U-Net network comprises two key components: a contracting path consisting of several convolutional layers for down-sampling to extract semantic and contextual features, and an expansive path with a series of convolutional blocks employing up-sampling operations to progressively increase the spatial resolution of the feature maps [32]. With the deepening of the network, while learning advanced semantic features, the network gradually loses the location information of the target features, which is crucial for the reconstruction of the segmentation results. To mitigate this loss of information and make network reconstruction more accurate, U-Net employs skip connections between the encoder and decoder paths at corresponding levels. The skip connections transfer semantic and location information extracted by the encoder to the decoder at the same stage, facilitating more accurate localization to combine high-resolution contextual information with advanced features.

In 2020, U-Net++ [33] introduced a more flexible feature fusion approach by integrating multi-scale semantic features through dense skip connections. Additionally, Huang et al. [34] incorporated full-scale skip connections within the UNet3+ architecture, enabling multi-scale feature fusion through the deep supervision of high-level semantic information and comprehensive skip pathways. Within the domain of infrared small-target detection, subsequent studies [24,25,26] proposed enhanced methods based on U-Net, U-Net++, and UNet3+, which utilize multi-scale feature fusion and attention mechanisms to improve the feature representation and boundary localization of infrared small targets. Moreover, DCANet [35] employs dense networks combined with the SimAM attention module to accurately capture small-target features and address issues related to target occlusion in deeper network layers. These advancements underscore the significant and widespread application of UNet-based architectures in small-target detection tasks.

2.3. Discussion

Although these existing methods have made significant progress in performance, they still face several challenges. Firstly, during the feature extraction stage, multiple down-sampling operations gradually results in the loss of small-target information. Although some existing models [21,22] solve this problem via the fusion of different depths for small-target features, they have a poor effect because they only rely on the original sparse characteristics of small targets under complex environments.

Secondly, during feature fusion, there exists a semantic gap between high-level and low-level features. The traditional UNet network directly transmits the high-resolution features extracted by the encoder to the decoder through skip connections, overlooking the relationships between semantic features across different layers. Although [24,33,35] aggregate the features of different semantic scales through dense skip connections, this also significantly increases the model complexity [26,34]. Each decoder level is connected with all encoder levels and all preceding decoder levels, but the excessive information redundancy and mismatch of semantic information between different feature layers instead interfere with the detection of small targets.

Furthermore, in the decoding stage, most of the existing methods primarily employ two traditional up-sampling methods: bilinear interpolation and transposed convolution. Bilinear interpolation calculates new pixel values via the weighted averages of surrounding pixels, tending to smooth high-frequency details such as edges and textures. Since the small targets themselves have few pixels and low contrast, the smoothing effect of linear interpolation can diminish the distinctly salient features and sharp edges of small targets, thereby increasing the difficulty of detection. Transposed convolution employs learnable convolutional kernels to perform up-sampling, but the uneven activation of the convolution kernels in the overlapping regions leads to checkerboard-like artifacts in the output feature map. These checkerboard effects of transposed convolution can generate false textures around small targets, leading to missed detections or false positives.

These issues motivate us to design targeted, fine-grained encoding processes and to explore more efficient feature fusion and decoding strategies.

3. Methods

3.1. Overall Architecture

The UNet network is a segmentation model that classifies targets pixel by pixel, offering the more accurate localization of infrared small targets. The proposed MEFA-Net architecture is built upon the basic encoder–decoder structure of UNet and contains three key modules: DDCB, EAF, and EUB modules. The overall structure of MEFA-Net is illustrated in Figure 1. The real infrared small targets have been marked with red boxes in both the input and output images.

Initially, the input image undergoes feature extraction via the encoder, generating a multi-scale representation comprising four distinct feature maps. The encoder consists of four DDCB modules, producing four sets of feature maps:

F_{E}^{1}

,

F_{E}^{2}

,

F_{E}^{3}

, and

F_{E}^{4}

. The DDCB modules replace the standard convolutional blocks found in U-Net, with the intention of enhancing infrared small-target feature extraction and integrating richer contextual information. The EAF module adaptively fuses the output feature maps from two adjacent DDCB modules, guided by attention mechanisms. The fused result is then transferred to the corresponding decoder layers via enhanced skip connections. This combination of enhanced skip connections with simple skip connections aids the model in achieving more precise segmentation. The decoder consists of three EUB modules and three residual blocks, resulting in three sets of feature maps:

F_{D}^{3}

,

F_{D}^{2}

, and

F_{D}^{1}

. The EUB module implements efficient decoding with PixelShuffle operations combined with bilinear interpolation up-sampling, in place of traditional up-sampling methods. The residual block further fuses and optimizes the outputs of the DDCB module, EAF module and EUB module to ensure the integrity and consistency of the features. In addition, a feature pyramid fusion module at the end of the decoder uses multi-scale fusion to generate robust feature maps by effectively integrating details and semantic information. Finally, the fused decoder features are further refined through the residual block and fed into the detection head to generate the final prediction. MEFA-Net optimizes the features extraction and segmentation of infrared small targets, addresses challenges related to information loss, and efficiently enhances semantic clarity. Table 1 shows the output size of each layer in MEFA-Net.

3.2. Dilated Direction-Sensitive Convolution Block

Infrared small targets are characterized by their extremely small size, low signal-to-noise ratio, and lack of texture features. As the network depth increases, the sparse features caused by the lack of shape and fine-grained features tend to be lost during standard convolution operations. To address this issue, the dilated direction-sensitive convolution block (DDCB) is designed, which integrates three parallel convolutional branches. Through this parallel branching strategy, the DDCB module collaboratively extracts multi-level features, effectively capturing diverse contextual information. Compared to traditional target detection networks that rely solely on raw features, the DDCB module combines contextual cues with mechanisms inspired by human visual perception to extract the fused contextual features and Gaussian saliency features of small targets, enhancing the capture of sparse features. The specific structure of the DDCB module is illustrated in Figure 2.

The standard convolutional branch employs standard convolution to extract raw, fine-grained features such as shape, edge, and texture, preserving local spatial consistency and ensuring that essential information within the immediate vicinity of the small target is retained [36]. Due to the low spatial resolution and limited number of pixels in small targets, relying solely on local target information often proves insufficient to distinguish them from visually similar false alarms. Therefore, contextual information is crucial for the detection of small targets. Dilated convolution is used to expand the receptive field, extracting contextual information from a broader spatial extent. This approach compensates for semantic deficiencies caused by the sparse pixels of small targets, enriching feature representation. Drawing inspiration from saliency enhancement methods in traditional detection, pinwheel-shaped convolution [37] is adopted, which is specifically tailored for infrared small-target detection. Pinwheel-shaped convolution utilizes an asymmetric padding strategy to generate horizontal and vertical directional kernel groups. Its receptive field exhibits a radial diffusion pattern, with dense kernels in the central region and sparse kernels in the periphery, simulating a Gaussian weight distribution. This design conforms to the Gaussian statistical properties inherent in the grayscale intensity distributions of infrared dim targets, enabling the model to more effectively prioritize the extraction of the salient features of the target region while attenuating background clutter responses.

The detailed architecture of the DDCB module is as follows. For the channel number for C, the length–width resolution of H and W input feature maps, it is processed through three parallel convolutional operations: a standard convolution with a filter size of 3 × 3, a dilated convolution with a kernel size of 3 and a dilation rate of 2, and a pinwheel-shaped convolution with a kernel size of 4. Batch normalization (

B N

) and rectified linear unit (

R e L u

) activation are applied after each convolution operation, yielding feature maps

F_{c}

,

F_{d}

, and

F_{p}

, respectively. Subsequently, these feature maps are concatenated and input into a channel–spatial attention module to refine the features at both the channel and pixel levels, emphasizing potentially lost small-target information. Specifically, a

C B R

(Convolution-BatchNorm-ReLU) block, comprising a 3 × 3 convolution, a batch normalization layer, and a ReLU activation function, is used to integrate and refine the concatenated multi-level features. Subsequently, a channel attention mechanism is adopted to highlight target-relevant channels. A

C B

(Convolution-BatchNorm) module, consisting of a 1 × 1 convolutional kernel and batch normalization, is applied to adjust the dimensionality of feature channels. Finally, spatial attention focuses on the target region. The specific formulas representing these operations are as follows:

F_{o u t} = R e L u (S A (C B (C A ({C B R (F}_{c} ⨁ F_{d} ⨁ F_{p}))))),

(1)

where

⨁

denotes the channel concatenation operation,

F_{o u t}

is the output feature map of the DDCB module,

C A

and

S A

represent the channel attention process and the spatial attention process, respectively.

The channel attention process

C A

can be summarized as Formulas (2) and (3) [34]:

C A = σ (M L P (P_{A v g P o o l} (X)) + M L P (P_{M a x P o o l} (X))),

(2)

{C A}_{o u t} = C A ⨂ X,

(3)

where ⨂ denotes the element-wise multiplication,

σ

denotes the sigmoid function,

P_{A v g P o o l}

denotes the average pooling,

X

is the input of channel attention, and

{C A}_{o u t}

is the output of channel attention. Formula (4) summarizes the specific calculation process of

X

:

X = {C B R (F}_{c} ⨁ F_{d} ⨁ F_{p}),

(4)

The spatial attention process

S A

can be summarized as Formulas (5) and (6) [34]:

S A = σ \{f^{7 \times 7} [P_{A v g P o o l} (Y), P_{M a x P o o l} (Y)]\}

(5)

{S A}_{o u t} = S A ⨂ Y,

(6)

where

f^{7 \times 7}

denotes a convolutional operation with a kernel size of 7 [A,B], denotes the concatenation of A and B,

Y

is the input of spatial attention, and

{S A}_{o u t}

is the output of spatial attention. Formula (7) summarizes the specific calculation process of

Y

:

Y = C B (C A ({C B R (F}_{c} ⨁ F_{d} ⨁ F_{p}))),

(7)

Compared to traditional target detection methods that rely solely on raw target features, the DDCB module combines contextual features and visual attention mechanisms. By integrating the extraction of contextual information and Gaussian salient features, it enhances the extraction of sparse infrared small-target features. The introduction of pinwheel-shaped convolution, specifically designed to capture the Gaussian characteristics of infrared small targets, improves the reliability of small-target feature extraction. This is further reinforced through convolutional fusion operations, which enhance the module’s ability to discriminate small targets, outperforming conventional convolution techniques. Additionally, the channel and spatial attention refinement steps highlight target-related channels via channel weighting and focus on salient regions through spatial attention, effectively suppressing background noise in complex scenes and thereby improving detection performance.

3.3. Encoder Attention Fusion Module

Inspired by the encoder attention fusion mechanism in medical image segmentation [31], the encoder attention fusion (EAF) module is introduced. The output feature maps of two consecutive DDCB modules are used as the input of the EAF module. Figure 3 shows a detailed description of the EAF module.

In this module, the global spatial pooling is defined as Formula (8) [31]:

g_{s} (F) = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = i}^{W} {(F)}_{[:, i, j]},

(8)

and the global channel pooling is defined as Formula (9) [31]:

g_{c} (F) = \frac{1}{C} \sum_{c = 1}^{C} {(F)}_{[c, :, :]},

(9)

Pooling mechanisms are applied to the feature maps derived from the DDCB module to compute the spatial attention weights (

ω_{s}

) and channel attention weights (

ω_{c}

). The two attention weights are then multiplied to capture the global contextual relationships across features and between different channels. This process emphasizes salient features while suppressing less relevant information, thereby enhancing the model’s discriminative capability. After element-wise multiplication, the resulting feature weights are fused through element-wise addition to incorporate multi-scale feature information. To facilitate this fusion, the deeper features with dimensions

F_{E}^{2}

are upsampled via transposed convolution (TConv) to match the shallower features

F_{E}^{1}

. In the U-Net architecture, each encoder layer produces features at different scales.

F_{E}^{1}

provides fine-grained, low-level information, while

F_{E}^{2}

offers high-level semantic features. Combining the outputs of two consecutive DDCB modules ensures a more comprehensive representation of features. The fused features are then processed through subsequent convolutional and activation layers, resulting in feature maps that retain both spatial and channel attention weights. Convolutional operations maintain the identical channel and spatial dimensions of the inputs, increasing the network depth and refining features without introducing additional parameters. Preserving spatial dimensions facilitates the capture of local spatial information, which is crucial for segmenting small infrared targets against complex backgrounds. Finally, the generated attention weights are multiplied with the feature

F_{E}^{1}

, enabling the EAF module to transmit hierarchically enhanced features to the corresponding decoder modules.

Simple skip connections in MEFA-Net directly transfer fine-grained spatial information from the encoder to the decoder. The EAF module adaptively fuses outputs from two adjacent DDCB modules via an attention mechanism. By leveraging these enhanced skip connections, multi-scale features are effectively integrated. In the decoding stage, EAF is combined with an efficient up-sampling module to guide the model in comprehensively utilizing multi-level feature information, enabling a more complete understanding of the small target.

3.4. Efficient Up-Sampling Block

To address the issue of traditional up-sampling methods and feature map resolution recovery in the decoder stage, an efficient up-sampling module that integrates a hybrid up-sampling strategy with a multi-scale context-aware mechanism is designed. By combining PixelShuffle with bilinear interpolation, the proposed approach enhances the flexibility of up-sampling and improves feature preservation, thereby reducing information loss. Additionally, a dilated convolution is introduced after the up-sampling operation to strengthen the module’s contextual perception capabilities and to utilize multi-scale information for optimizing small-target boundaries. To ensure the lightweight nature of the module, depthwise separable convolution is employed. Finally, a nonlinear activation function SiLU is incorporated to enhance feature representation. The detailed structure of the EUB module is shown in Figure 4a.

PixelShuffle is a sub-pixel convolution operation that leverages channel-to-spatial rearrangement to generate high-resolution outputs from low-resolution feature maps. Given an input feature map

X \in R^{C \times H \times W}

, where C, H and W, respectively, represent the number of channels, height and width of the input feature map, a convolution operation is applied to produce

X^{'} \in R^{(C \cdot r^{2}) \times H \times W}

, where r represents the up-sampling factor. Subsequently, the data from the channel dimension are rearranged into the spatial dimensions, resulting in an output

Y \in R^{C \times (H \cdot r) \times (W \cdot r)}

. For the EUB module, r is fixed at 2. PixelShuffle achieves up-sampling through the structured rearrangement of channel data, transforming the channel dimension into spatial dimensions. This direct approach preserves semantic information extracted from the original feature map, mitigating the checkerboard artifacts commonly associated with traditional transposed convolutions. Compared to conventional interpolation methods, PixelShuffle maintains high-frequency information within the feature maps, avoiding potential smoothing effects. However, PixelShuffle’s channel compression mechanism inevitably forces the network to discard some channel features, leading to a weakening of small-target detail information. To balance high-frequency detail preservation with feature smoothness, the EUB module introduces a dual-path weighted fusion enhancement strategy to refine small-target regions. Specifically, PixelShuffle is used to learn the complex features of the image, generating high-resolution feature maps with richer details. Simultaneously, bilinear interpolation’s inherent smoothing characteristics are exploited to maintain background continuity.

To mitigate the adverse effects caused by the channel information loss inherent in the PixelShuffle operation, the upsampled features are further refined by incorporating contextual information through parallel dilated convolutions. This approach progressively expands the receptive field using dilated convolutions with varying dilation rates, enabling the network to focus on both local details of small targets and the global contextual information. Finally, the multi-scale features obtained from the parallel dilated convolutions are fused to enhance the target distinguishability. The process can be mathematically expressed as Formula (10):

X_{c o n t e x t} = B N (X_{d 1} + X_{d 2} + X_{d 3}),

(10)

where

X_{d 1}

,

X_{d 2}

and

X_{d 3}

denote the features obtained from convolutions with different dilation rates.

B N

denotes batch normalization. The outputs from these convolutions are fused through element-wise addition. The convolution with a dilation rate of 1 enhances target edge responses, enabling the preservation of high-frequency details. Conversely, the convolutions with dilation rates of 2 and 3 suppress background interference, effectively reducing low-frequency background components. Finally, the fused features are refined using depthwise separable convolution and the SiLU nonlinear activation function, which further extract features while significantly reducing the computational complexity.

In summary, the PixelShuffle operation enhances the resolution of the feature map by reorganizing the channel information of the input features for up-sampling, while bilinear interpolation restores the feature map resolution through an interpolation algorithm. The EUB module leverages the advantages of both methods to more effectively recover the fine details of small targets, resulting in clearer and more distinguishable features, and enabling the more accurate segmentation of target position and shape. The multi-scale context-aware mechanism captures features at various scales and fuses them to further strengthen the multi-scale perception capability for small targets. The EUB module thus provides an efficient feature enhancement and processing scheme for the detection and segmentation of small infrared objects, improving overall model performance and facilitating efficient decoding. Finally, the residual block further fuses and refines the features output by the DDCB, EAF, and EUB modules. The structure of the residual block is illustrated in Figure 4b.

4. Experiment

4.1. Experimental Settings

To validate the efficacy of the proposed network for infrared small-target detection, experiments were conducted on the NUAA-SIRST [22] and IRSTD-1k [38] datasets. The NUAA-SIRST dataset consists of 427 infrared images from various real-world scenes, with image resolutions ranging from 96 × 135 pixels to 400 × 592 pixels. In addition to short- and medium-wavelength infrared images, the NUAA-SIRST also includes infrared images at a wavelength of 950 nm [22]. The IRSTD-1k dataset is composed of 1001 real infrared images with a resolution of 512 × 512 pixels. The backgrounds in this dataset are diverse, including seas, clouds, cities, rivers, fields and mountains. The datasets were split into training and testing sets in a 4:1 proportion. During the preprocessing stage, data augmentation techniques such as random flipping, cropping, and Gaussian blurring were applied to normalize all images to a resolution of 256 × 256 pixels. This study adopts a segmentation network as the baseline, using a UNet with a ResNet-10 backbone and setting the number of downsampling layers to three. The network was trained using the SoftIOU loss function, with the Cosine Annealing learning rate scheduler and Adagrad optimizer used to enhance training stability and performance. The epoch size and learning rate were set as 300 and 0.03, respectively. All models were implemented using PyTorch framework 2.4 on a computer equipped with an NVIDIA GeForce RTX 4090 GPU (NVIDIA, Santa Clara, CA, USA).

4.2. Evaluation Metrics

The model performance is evaluated by the intersection over union (IoU), the probability of detection (

P_{d}

) and false alarm rate (

F_{a}

). Meanwhile, the model complexity and speed are evaluated by floating point operations (FLOPs), parameters (Params) and frames per second (FPS).

As a pixel-based evaluation metric, the intersection over union (IoU) is predominantly utilized to measure a model’s capability to accurately match the shape of the target. It quantifies the overlap via computing the ratio of the intersection to the union between the predicted area and the ground truth. The IoU is formally defined as follows [24]:

I o U = \frac{A_{i n t e r}}{A_{u n i o n}},

(11)

where

A_{i n t e r}

represents the intersection area between the prediction result and the true label, and

A_{u n i o n}

represents their union area.

The probability of detection (

P_{d}

) is employed to measure the model’s efficacy in correctly detecting targets. It represents the proportion of successfully detected targets to the total number of actual targets present in the imagery, and is defined as follows [24]:

P_{d} = \frac{T_{c o r r e c t}}{T_{a l l}},

(12)

where

T_{c o r r e c t}

denotes the number of successfully detected targets, and

T_{a l l}

represents the total number of actual targets. In this paper, whether a target is detected correctly is determined by comparing the centroid deviation with the predefined deviation threshold. Here, the predefined deviation threshold was set to 3 pixels. Specifically, if the centroid deviation is less than 3 pixels, the target is classified as being correctly detected.

False Alarm Rate (

F_{a}

): This metric is a key indicator of the detection algorithm’s precision. It measures the proportion of falsely classified pixels relative to the total pixel count within the image. The false alarm rate is defined as follows [24]:

F_{a} = \frac{N_{f a l s e}}{N_{a l l}},

(13)

where

N_{f a l s e}

is the number of falsely classified pixels, and

N_{a l l}

is the total pixel count in the image.

Floating Point Operations (FLOPs) denote the total count of floating point operations executed during the inference or training process. FLOPs are an important metric for evaluating algorithm performance; higher FLOPs indicate increased network complexity. For reference, 1 GFLOP equals one billion floating point operations.

Parameters (Params) refer to the learnable weights and biases within the network. A larger number of parameters indicates a higher model complexity. In this context, 1 million (1M) parameters denote one million trainable variables.

4.3. Comparison with SOTA Methods

In recent years, data-driven infrared small-target detection algorithms have demonstrated significant advantages over model-driven approaches. Consequently, this paper concentrates on comparing data-driven methods. To comprehensively evaluate algorithm performance, this section presents a detailed comparative analysis using state-of-the-art (SOTA) infrared small-target detection algorithms on two public datasets: NUAA-SIRST and IRSTD-1k. The algorithms for this comparison include the traditional methods of FKRW [39], PSTNN [40] and the deep learning methods of UNet [30], ACM [22], ALCNet [23], ISTDU-Net [41], UIUNet [25], DNANet [24], and DMFNet [28].

4.3.1. Quantitative Results

The experimental results are presented in Table 2. Quantitative comparisons between our proposed MEFA-Net and the SOTA methods are provided for the NUAA-SIRST and IRSTD-1k datasets.

Specifically, on the NUAA-SIRST dataset, MEFA-Net achieved IoU improvements of 58.88%, 16.85%, 4.99%, 9.01%, 8.9%, 6.7%, 2.19%, 1.03% and 0.92% compared to FKRW, PSTNN, UNet, ACM, ALCNet, ISTDU-Net, UIUNet, DNANet, and DMFNet, respectively. On the IRSTD-1k dataset, MEFA-Net also demonstrated robust performance. This notable performance gain can be attributed to the designs of the DDCB, EAF, and EUB modules. MEFA-Net utilizes the DDCB module to selectively extract and retain infrared small-target features. The EAF module adaptively enhances cross-layer features, while the EUB module facilitates efficient decoding, simultaneously reducing the parameter count and computational complexity. These strategies promote the retention of the intrinsic features of infrared small targets at deeper network layers, enhancing the ability of the network to capture small-target salient features. The coordinated interactions among these three modules improve the segmentation performance of MEFA-Net. Compared to traditional methods such as FKRW and PSTNN, MEFA-Net achieves better detection performance with higher IoU,

P_{d}

values and lower

F_{a}

values. Compared to DNANet and DMFNet, MEFA-Net achieves more favorable metrics in terms of the IoU and false alarm rate, while maintaining similar detection rates. Moreover, MEFA-Net achieves an improved trade-off between detection performance and computational efficiency. Specifically, MEFA-Net achieves fewer parameters (Params) and computation (FLOPs), as well as a faster inference speed (FPS) compared to the suboptimal performance DMFNet and DNANet. These experimental results comprehensively demonstrate MEFA-Net’s superiority, resulting in a substantial improvement in detection capabilities for infrared small targets.

4.3.2. Qualitative Comparison

The detection results of eight different data-driven methods on the NUAA-SIRST and IRSTD-1k datasets are illustrated in Figure 5 and Figure 6, respectively. Each infrared small target has been enlarged and displayed in the lower-left corner of the images to facilitate a clearer observation of target details. The red, blue, and yellow circles indicate correctly detected targets, missed detections, and false detections, respectively.

For the NUAA-SIRST dataset, image (a) features a complex background that includes sky, trees, and ground. Despite this complexity, most methods accurately detect the small bright targets, except for UNet and ALCNet, which produce false alarms. Images (b) and (c) contain predominantly bright backgrounds. In Image (b), the targets are very small and submerged within the bright background; only our proposed method, ISTDU-Net, DNANet and DMFNet successfully detect the targets. In Image (c), the targets are larger and brighter; all methods correctly identify the targets, but ALCNet, UIUNet, DNANet and DMFNet generate one false alarm each. Image (d) presents a relatively simple background with larger, brighter targets; however, the ACM and DMFNet still produce a false alarm, and ISTDU-Net fails to detect the target. Although DNANet detects the target in this image, its shape is distorted, indicating some detection ambiguity.

For the IRSTD-1k dataset, the backgrounds of images (a) are relatively simple. In image (a), there are two bright targets and one faint target. All methods successfully detect the bright targets and the faint target except for ISTDU-Net and DMFNet, which fails to detect the faint target. The backgrounds of images (b) are more complex, involving water surfaces, trees, buildings, grounds and other interference. Besides our proposed method, UIUNet and DMFNet, other methods tend to generate varying degrees of false alarms while detecting small targets. Similarly, the backgrounds of images (c) and (d) are quite complex. Except for ISTDU-Net, which can recognize the small targets in image (c), only our proposed method and DMFNet can effectively detect the targets on images (c) and (d).

Based on the visual results from both the NUAA-SIRST and IRSTD-1k datasets, although DMFNet achieves a detection effect comparable to the method we proposed, its parameter quantity and computational cost are quite large. In contrast, MEFA-Net demonstrates a stronger ability to adapt to scene variations and a better detection performance with fewer parameters and computational costs.

Overall, our method demonstrates a robust detection performance across various scenarios. Additionally, it outperforms other models in reconstructing target shape features. This is due to the design of the DDCB module, a specialized encoding module tailored to infrared small targets, which extracts and preserves as much target information as possible during the down-sampling stage. The EAF module effectively suppresses interference from complex backgrounds, while the EUB module further reduces noise and enables the more precise recovery of small-target locations.

4.4. Ablation Study

To validate the effectiveness of the proposed method, we conducted ablation experiments on the more challenging IRSTD-1k dataset with complex backgrounds. We first assessed the effects of the DDCB, EAF, and EUB modules on the detection performance of the model. To do this, we removed the modifications within each module individually, thereby further verifying the contribution of each internal component to the overall performance. Specifically, for the DDCB module, we compared models employing different convolutional branches; for the EUB module, we evaluated models using a hybrid up-sampling mechanism alone, and then in combination with a multi-scale contextual perception mechanism. Throughout the ablation study, all other parts of the model were kept unchanged.

4.4.1. The Ablation Study on the MEFA-Net

Through ablation experiments, we verified the significant contributions of the DDCB, EAF, and EUB modules. The corresponding results are detailed in Table 3. An analysis of the metrics reveals that the DDCB module enhances the network’s feature extraction capabilities. Moreover, the EAF module and the inclusion of enhanced skip connections play a crucial role in preserving vital contextual information. Finally, the EUB module provides a notable improvement in the recovery of both the location and shape of small targets.

In Table 3, we observe that strategy (e) achieves a detection rate of 91.53%, which slightly outperforms MEFA-Net. A detailed analysis of the roles and interrelationships of each module reveals that the DDCB module’s pinwheel-shaped convolution enhances target edge responses, and the multi-scale features extracted strengthen target representation. The EAF module amplifies weak signals through feature enhancement, and the cross-layer fusion operation improves semantic consistency. The collaborative operation of these two modules results in an increased detection rate; however, the absence of the EUB module leads to coarse up-sampling, which can cause high-frequency background noise to be mistaken as targets, resulting in false alarms.

The synergy among the DDCB, EAF and EUB modules is essential for MEFA-Net to achieve efficient detection. The lightweight design of the EUB module reduces the model’s parameters and computational costs, partially alleviating the overfitting tendency of the DDCB module. The attention weights in the EAF module tend to be more balanced under multi-module cooperation, preventing excessive dependence on a single feature. The experimental results indicate that the synergistic effects among modules are not simply additive but involve complex performance trade-offs. The DDCB module extracts rich feature representations through multi-branch convolution, while the EAF module guides the output features of the DDCB module for adaptive fusion via attention mechanisms. The multi-scale features generated by the EAF module are transmitted to the EUB module via enhanced skip connections. The EUB module further optimizes the segmentation accuracy of small targets through a hybrid up-sampling strategy and multi-scale context-aware mechanisms. These three modules cooperate with each other to form an efficient feature extraction, fusion and decoding process, enabling MEFA-Net to achieve more accurate detection.

To further analyze the specific role of the EUB module in the overall network performance, we selected five representative infrared scenes from the IRSTD-1k dataset. Visualization analyses of the experimental results under strategies (e) and (f) are shown in Figure 7. In the figure, the target regions are magnified and positioned in the lower-left corner. Correct detections are marked with red dashed lines, false alarms with yellow dashed lines, and missed detections with blue dashed lines.

In image 1, the target is relatively weak; both strategies (e) and (f) are able to detect the target. Compared to the detection result of strategy (e), strategy (f) more accurately restores the true shape of the small target. In image 2, which contains two small targets, strategy (e) correctly detects the targets located on the edges of the image; however, the contours and edges of these small targets exhibit significant distortion compared to the ground truth. The detection results of images 1 and 2 demonstrate the EUB module’s capacity to perform the enhanced reconstruction of small-target shapes and contours.

In image 3, with a more complex background, strategy (e) detects a false target, showcasing the EUB module’s effectiveness in suppressing background noise in challenging environments. In image 4, the small target is situated against a dark, high-contrast background with interference; strategy (e) fails to recognize the small target under these conditions. In image 5, the small target appears brighter, and while strategy (e) successfully detects the target, it also mistakenly considers nearby bright tree branches as targets, leading to false alarms.

These visualization results highlight the ability of the EUB module to more accurately recover the shape and contours of small targets while effectively suppressing background noise in complex environments. They further confirm the superiority and necessity of the collaborative integration of the EUB module with the DDCB and EAF modules.

To further validate the efficiency of MEFA-Net, the convergence rate analysis based on test loss curves is presented. As shown in Figure 8, it is clear that MEFA-Net achieves a significantly faster initial convergence compared to the baseline. In contrast, the baseline shows a slower decline, suggesting that it requires more epochs to achieve comparable progress. After epoch 150, MEFA-Net consistently achieves a lower test loss compared to the baseline. This analysis confirms that MEFA-Net not only converges faster but also ensures stable and accurate convergence. It is critical for practical applications, particularly in scenarios with limited computational resources or time constraints.

4.4.2. The Ablation Study on the DDCB Module

Compared to traditional convolution, the DDCB module integrates dilated convolution and pinwheel-shaped convolution to generate richer feature representations. To analyze the contribution of each convolution type within the DDCB module, ablation experiments were conducted and the results are shown in Table 4, including IoU, Pd, and Fa metrics.

Specifically, strategy (a) does not incorporate the DDCB module, serving as the baseline. Strategy (b) removes the dilated convolution branch and pinwheel-shaped convolution branch from the DDCB module, while keeping other components unchanged. Compared to strategy (a), strategy (b) achieves improvements in both the IoU and detection rate, with a reduction in the false alarm rate, demonstrating the contribution of the attention mechanism within the DDCB module.

Strategies (c) and (d) progressively incorporate the dilated convolution branch and pinwheel-shaped convolution branch into the standard convolution pathway. The results indicate that both dilated and pinwheel-shaped convolutions enhance detection capabilities and further suppress false alarms compared to ordinary convolution. These findings clearly verify the effectiveness of the DDCB module.

4.4.3. The Ablation Study on the EUB Module

The EUB module combines a hybrid up-sampling strategy with a multi-scale context-aware mechanism. To isolate and evaluate the distinct contributions of these components, we performed an ablation study on the EUB module’s internal structure. The results of the ablation experiments are shown in Table 5. Strategy (a) represents the baseline network without the EUB module, utilizing standard bilinear interpolation for up-sampling in the decoder. Strategy (b) integrated the hybrid up-sampling strategy (a combination of PixelShuffle and bilinear interpolation) into the baseline, ablating the multi-scale context-aware mechanism of the EUB; all other module components were held constant. Relative to the baseline, strategy (b) achieved an improved detection performance and reduced false alarms, accompanied by a certain degree of reduction in the model parameters and computational complexity. Strategy (c) incorporated the complete EUB module, with both the hybrid up-sampling strategy and multi-scale context-aware mechanism, into the baseline network. In comparison to strategy (b), the multi-scale context-aware mechanism further refined small-target localization, providing additional false alarm suppression.

5. Discussion

Infrared small-target detection has long been plagued by inherent challenges, including the diminutive size of targets, low signal-to-noise ratios, and complex background interference. With the rapid advancements in deep learning, data-driven approaches have gradually superseded traditional model-driven methods. At present, most deep learning methods are based on the encoder-decoder architecture of U-Net. However, these methods still have certain limitations, including the lack of design for the intrinsic features of infrared small targets in the feature extraction stage, the neglect of the semantic mismatch problem of cross-layer features in the feature fusion stage, and the inability of traditional methods to achieve efficient and accurate decoding in the up-sampling stage. Compared with the current infrared small-target detection algorithms, our proposed MEFA-Net effectively mitigates the issue of sparse feature loss of small targets in deep networks through the design of the DDCB module. The EAF module, adapted from the field of medical image segmentation, enhances the correlation between semantic information across different layers. Furthermore, the EUB module, incorporating PixelShuffle, makes up for the limitations of traditional sampling methods. The three modules work in coordination, enhancing the reliability of the MEFA-Net network.

Although MEFA-Net demonstrates superior performance compared to alternative methodologies overall, it necessitates continued research and exploration. On the one hand, despite our algorithm reduces the parameter count and computational demands compared to other state-of-the-art detection algorithms, there are still challenges in achieving real-time performance on embedded platforms with limited computing resources. On the other hand, our algorithm performs well on two public datasets, but its robustness in more diverse unknown scenarios, such as extreme weather conditions or different imaging systems, still needs further exploration.

6. Conclusions

This paper proposes a novel network architecture, MEFA-Net, for infrared small-target detection. During the feature extraction process, the DDCB module is designed to improve the traditional single convolution operations, and the diverse features of the infrared small target are synergistically extracted to enhance the representation of sparse features. During the feature fusion process, by introducing the EAF module in the field of medical image segmentation, the semantic gap inherent in cross-level information fusion is alleviated through the adaptive fusion of deep and shallow features. Furthermore, the proposed EUB module enables the more precise localization of small targets and improves the overall segmentation performance of the model. The effectiveness of MEFA-Net is strictly verified via a comparison with advanced methods and comprehensive ablation studies, and the contribution of each module is determined. On the public datasets NUAA-SIRST and IRSTD-1k, MEFA-Net demonstrates superior performance, proving its efficiency and robustness.

Author Contributions

Conceptualization, J.M.; methodology, J.M.; validation, J.M.; formal analysis, J.M.; writing—original draft preparation, J.M.; writing—review and editing, J.M., N.P., D.Y., D.W. and J.Z.; visualization, J.M.; funding acquisition, N.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Youth Innovation Promotion Association of Chinese Academy of Sciences No. 2021376.

Data Availability Statement

The NUAA-SIRST dataset are available at: https://github.com/YimianDai/sirst (accessed on 10 February 2025); The IRSTD-1k dataset are available at: https://github.com/RuiZhang97/ISNet (accessed on 10 February 2025).

Conflicts of Interest

Authors Dengyu Yin and Di Wang were employed by the AVIC Chengdu Aircraft Design & Research Institute. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Hadhoud, M.M.; Thomas, D.W. The two-dimensional adaptive LMS (TDLMS) algorithm. IEEE Trans. Circuits Syst. 1988, 35, 485–494. [Google Scholar] [CrossRef]
Bae, T.W. Small target detection using bilateral filter and temporal cross product in infrared images. Infrared Phys. Technol. 2011, 54, 403–411. [Google Scholar] [CrossRef]
Yang, L.; Yang, J.; Yang, K. Adaptive detection for infrared small target under sea-sky complex background. Electron. Lett. 2004, 40, 1083–1085. [Google Scholar] [CrossRef]
Wang, X.; Peng, Z.; Zhang, P.; He, Y. Infrared small target detection via nonnegativity-constrained variational mode decomposition. IEEE Geosci. Remote Sens. Lett. 2017, 14, 1700–1704. [Google Scholar] [CrossRef]
Ji, H.; Cui, Z. Infrared background suppression method based on low-pass adaptive morphological filtering. In Proceedings of the 3rd International Conference on Computer Vision, Image and Deep Learning & International Conference on Computer Engineering and Applications (CVIDL & ICCEA), Changchun, China, 20–22 May 2022; IEEE: New York, NY, USA, 2022; pp. 574–577. [Google Scholar]
Tom, V.T.; Peli, T.; Leung, M.; Bondaryk, J.E. Morphology-based algorithm for point target detection in infrared backgrounds. In Signal and Data Processing of Small Targets 1993, Proceedings of the Optical Engineering and Photonics in Aerospace Sensing, Orlando, FL, USA, 11–16 April 1993; SPIE: Bellingham, WA, USA, 1993; Volume 1954, pp. 2–11. [Google Scholar]
Bai, X.; Zhou, F. Analysis of new top-hat transformation and the application for infrared dim small target detection. Pattern Recognit. 2010, 43, 2145–2156. [Google Scholar] [CrossRef]
Deng, L.; Zhang, J.; Xu, G.; Zhu, H. Infrared small target detection via adaptive M-estimator ring top-hat transformation. Pattern Recognit. 2021, 112, 107729. [Google Scholar] [CrossRef]
Chen, C.L.P.; Li, H.; Wei, Y.; Xia, T.; Tang, Y.Y. A local contrast method for small infrared target detection. IEEE Trans. Geosci. Remote Sens. 2013, 52, 574–581. [Google Scholar] [CrossRef]
Han, J.; Moradi, S.; Faramarzi, I.; Liu, C.; Zhang, H.; Zhao, Q. A local contrast method for infrared small-target detection utilizing a tri-layer window. IEEE Geosci. Remote Sens. Lett. 2019, 17, 1822–1826. [Google Scholar] [CrossRef]
Han, J.; Liu, C.; Liu, Y.; Luo, Z.; Zhang, X.; Niu, Q. Infrared small target detection utilizing the enhanced closest-mean background estimation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 14, 645–662. [Google Scholar] [CrossRef]
Kou, R.; Wang, C.; Fu, Q.; Yu, Y.; Zhang, D. Infrared small target detection based on the improved density peak global search and human visual local contrast mechanism. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 6144–6157. [Google Scholar] [CrossRef]
Deng, H.; Sun, X.; Liu, M.; Ye, C.; Zhou, X. Infrared small-target detection using multiscale gray difference weighted image entropy. IEEE Trans. Aerosp. Electron. Syst. 2016, 52, 60–72. [Google Scholar] [CrossRef]
Aghaziyarati, S.; Moradi, S.; Talebi, H. Small infrared target detection using absolute average difference weighted by cumulative directional derivatives. Infrared Phys. Technol. 2019, 101, 78–87. [Google Scholar] [CrossRef]
Gao, C.; Meng, D.; Yang, Y.; Wang, Y.; Zhou, X.; Hauptmann, A.G. Infrared patch-image model for small target detection in a single image. IEEE Trans. Image Process. 2013, 22, 4996–5009. [Google Scholar] [CrossRef] [PubMed]
Dai, Y.; Wu, Y.; Song, Y. Infrared small target and background separation via column-wise weighted robust principal component analysis. Infrared Phys. Technol. 2016, 77, 421–430. [Google Scholar] [CrossRef]
Dai, Y.; Wu, Y.; Song, Y.; Guo, J. Non-negative infrared patch-image model: Robust target-background separation via partial sum minimization of singular values. Infrared Phys. Technol. 2017, 81, 182–194. [Google Scholar] [CrossRef]
Wang, X.; Peng, Z.; Kong, D.; Zhang, P.; He, Y. Infrared dim target detection based on total variation regularization and principal component pursuit. Image Vis. Comput. 2017, 63, 1–9. [Google Scholar] [CrossRef]
Dai, Y.; Wu, Y. Reweighted infrared patch-tensor model with both nonlocal and local priors for single-frame small target detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2017, 10, 3752–3767. [Google Scholar] [CrossRef]
Kou, R.; Wang, C.; Luo, Y.; Zhang, Y.; Xu, Z.; Peng, Z.; Wu, C.; Fu, Q. Multi-scale small target detection techniques in single-frame infrared images: A review. J. Image Graph 2024, 29, 0193–0217. [Google Scholar]
Wang, H.; Zhou, L.; Wang, L. Miss detection vs. false alarm: Adversarial learning for small object segmentation in infrared images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8509–8518. [Google Scholar]
Dai, Y.; Wu, Y.; Zhou, F.; Barnard, K. Asymmetric contextual modulation for infrared small target detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Virtual Conference, 5–9 January 2021; pp. 950–959. [Google Scholar]
Dai, Y.; Wu, Y.; Zhou, F.; Barnard, K. Attentional local contrast networks for infrared small target detection. IEEE Trans. Geosci. Remote Sens. 2021, 59, 9813–9824. [Google Scholar] [CrossRef]
Li, B.; Xiao, C.; Wang, L.; Wang, Y.; Lin, Z.; Li, M.; An, W.; Guo, Y. Dense Nested Attention Network for Infrared Small Target Detection. IEEE Trans. Image Process. 2023, 32, 1745–1758. [Google Scholar] [CrossRef]
Wu, X.; Hong, D.; Chanussot, J. UIU-Net: U-Net in U-Net for Infrared Small Object Detection. IEEE Trans. Image Process. 2023, 32, 364–376. [Google Scholar] [CrossRef] [PubMed]
Chung, W.Y.; Lee, I.H.; Park, C.G. Lightweight infrared small target detection network using full-scale skip connection U-Net. IEEE Geosci. Remote Sens. Lett. 2023, 20, 7000705. [Google Scholar] [CrossRef]
Mao, Q.; Li, Q.; Wang, B.; Zhang, Y.; Dai, T.; Chen, C.P. SpirDet: Towards efficient, accurate and lightweight infrared small target detector. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5006912. [Google Scholar] [CrossRef]
Bai, S.; Yang, L.; Liu, Y.; Yu, H. DMF-Net: A Dual-Encoding Multi-Scale Fusion Network for Pavement Crack Detection. IEEE Trans. Intell. Transp. Syst. 2024, 25, 5981–5996. [Google Scholar] [CrossRef]
Quan, W.; Zhao, W.; Wang, W.; Xie, H.; Wang, F.L.; Wei, M. Lost in UNet: Improving Infrared Small Target Detection by Underappreciated Local Features. IEEE Trans. Geosci. Remote Sens. 2024, 63, 5000115. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015, Proceedings of the 18th International Conference, Munich, Germany, 5–9 October 2015; Springer International Publishing: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Munia, A.A.; Abdar, M.; Hasan, M.; Jalali, M.S.; Banerjee, B.; Khosravi, A.; Hossain, I.; Fu, H.; Frangi, A.F. Attention-guided hierarchical fusion U-Net for uncertainty-driven medical image segmentation. Inf. Fusion 2025, 115, 102719. [Google Scholar] [CrossRef]
Azad, R.; Aghdam, E.K.; Rauland, A.; Jia, Y.; Avval, A.H.; Bozorgpour, A.; Karimijafarbigloo, S.; Cohen, J.P.; Adeli, E.; Merhof, D. Medical image segmentation review: The success of u-net. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 10076–10095. [Google Scholar] [CrossRef]
Zhou, Z.; Rahman Siddiquee, M.M.; Tajbakhsh, N.; Liang, J. Unet++: A nested u-net architecture for medical image segmentation. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, Proceedings of the 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, 20 September 2018; Springer International Publishing: Cham, Switzerland, 2018; pp. 3–11. [Google Scholar]
Huang, H.; Lin, L.; Tong, R.; Hu, H.; Zhang, Q.; Iwamoto, Y.; Han, X.; Chen, Y.-W.; Wu, J. Unet 3+: A full-scale connected unet for medical image segmentation. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; IEEE: New York, NY, USA, 2020; pp. 1055–1059. [Google Scholar]
Zang, D.; Su, W.; Zhang, B.; Liu, H. DCANet: Dense Convolutional Attention Network for infrared small target detection. Measurement 2025, 240, 115595. [Google Scholar] [CrossRef]
Dai, Y.; Pan, P.; Qian, Y.; Li, Y.; Li, X.; Yang, J.; Wang, H. Pick of the bunch: Detecting infrared small targets beyond hit-miss trade-offs via selective rank-aware attention. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5006215. [Google Scholar] [CrossRef]
Yang, J.; Liu, S.; Wu, J.; Su, X.; Hai, N.; Huang, X. Pinwheel-shaped convolution and scale-based dynamic loss for infrared small target detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 9202–9210. [Google Scholar]
Zhang, M.; Zhang, R.; Yang, Y.; Bai, H.; Zhang, J.; Guo, J. ISNet: Shape matters for infrared small target detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 877–886. [Google Scholar]
Qin, Y.; Bruzzone, L.; Gao, C.; Li, B. Infrared small target detection based on facet kernel and random walker. IEEE Trans. Geosci. Remote Sens. 2019, 57, 7104–7118. [Google Scholar] [CrossRef]
Zhang, L.; Peng, Z. Infrared small target detection based on partial sum of the tensor nuclear norm. Remote Sens. 2019, 11, 382. [Google Scholar] [CrossRef]
Hou, Q.; Zhang, L.; Tan, F.; Xi, Y.; Zheng, H.; Li, N. ISTDU-Net: Infrared Small-Target Detection U-Net. IEEE Geosci. Remote Sens. Lett. 2022, 19, 7506205. [Google Scholar] [CrossRef]

Figure 1. Overall structure of MEFA-Net.

Figure 2. Illustration of the DDCB module.

Figure 3. Illustration of the EAF module.

Figure 4. (a) Illustration of the EUB module; (b) Illustration of the residual block.

Figure 5. Visualization results on the NUAA-SIRST dataset are shown in the figure. The enlarged view of the small targets, placed in the lower-left corner of the image, uses red, blue, and yellow circles to indicate correctly detected targets, missed targets, and false detections, respectively.

Figure 6. Visualization results on the IRSTD-1k dataset are shown in the figure. The enlarged view of the small targets, placed in the lower-left corner of the image, uses red, blue, and yellow circles to indicate correctly detected targets, missed targets, and false detections, respectively.

Figure 7. Ablation study results of the EUB module on MEFA-Net. (a) original image, (b) ground truth, (c) strategy (e)’s detection results, (d) strategy (f)’s detection results.

Figure 8. Convergence curves based on MEFA-Net and Baseline.

Table 1. Details of MEFA-Net. H, W and C represent the vertical resolution, horizontal resolution and the number of channels of the feature map, respectively.

Type	$Output Size (H \times W \times C)$
Input image	256 × 256 × 3
$F_{E}^{1}$	256 × 256 × 16
$F_{E}^{2}$	128 × 128 × 32
$F_{E}^{3}$	64 × 64 × 64
$F_{E}^{4}$	32 × 32 × 128
$F_{D}^{3}$	64 × 64 × 64
$F_{D}^{2}$	128 × 128 × 32
$F_{D}^{1}$	256 × 256 × 16
Output image	256 × 256 × 1

Table 2. Quantitative comparison with SOTA methods in FLOPs, Params, IoU,

P_{d}

,

F_{a}

and FPS. The optimal outcomes are displayed in bold.

Table 2. Quantitative comparison with SOTA methods in FLOPs, Params, IoU,

P_{d}

,

F_{a}

and FPS. The optimal outcomes are displayed in bold.

Method	FLOPs/G	Params/M	NUAA-SIRST			IRSTD-1k			FPS
Method	FLOPs/G	Params/M	IoU (%)	$P_{d} (%)$	$F_{a} (10^{- 5})$	IoU (%)	$P_{d} (%)$	$F_{a} (10^{- 5})$	FPS
FKRW	-	-	18.88	87.79	1.33	24.30	68.78	6.11	-
PSTNN	-	-	60.91	88.86	10.75	30.60	81.34	93.03	-
UNet	65.45	34.53	72.77	95.00	1.38	62.91	89.15	6.04	89.30
ACM	0.40	0.40	68.75	90.83	6.77	53.49	89.15	12.46	513.83
ALCNet	0.38	0.43	68.86	90.83	3.81	58.59	89.83	4.76	475.35
ISTDU-Net	7.94	2.75	71.06	94.17	2.73	65.32	90.85	4.30	206.01
UIUNet	54.43	50.54	75.57	97.50	1.94	63.18	89.15	4.45	132.15
DNANet	14.26	4.50	76.73	97.50	1.49	65.84	91.19	4.05	162.24
DMFNet	40.59	10.91	76.84	98.33	0.53	66.82	89.15	1.07	157.60
MEFA-Net	10.47	2.49	77.76	98.33	0.23	67.58	91.19	1.06	226.17

Table 3. Ablation study on MEFA-Net.

Strategy	DDCB	EAF	EUB	FLOPs/G	Params/M	IoU (%)	$P_{d} (%)$	$F_{a} (10^{- 5})$
(a)	-	-	-	3.82	0.53	65.33	88.14	2.15
(b)	√	-	-	9.53	2.39	66.67	90.17	1.30
(c)	-	√	-	5.18	0.68	66.66	89.15	1.45
(d)	-	-	√	3.40	0.49	66.66	88.81	1.28
(e)	√	√	-	10.89	2.53	66.88	91.53	1.23
(f)	√	√	√	10.47	2.49	67.58	91.19	1.06

Table 4. Ablation study on the DDCB Module.

Strategy	Conv	Dconv	Pconv	FLOPs/G	Params/M	IoU (%)	$P_{d} (%)$	$F_{a} (10^{- 5})$
(a)	-	-	-	3.82	0.53	65.33	88.14	2.15
(b)	√	-	-	3.87	0.55	65.75	88.47	2.09
(c)	√	√	-	6.02	1.26	65.80	89.15	1.48
(d)	√	√	√	9.53	2.39	66.67	90.17	1.30

Table 5. Ablation study on the EUB Module.

Strategy	Pixelshuffle + Upsample	Dilation	FLOPs/G	Params/M	IoU (%)	$P_{d} (%)$	$F_{a} (10^{- 5})$
(a)	-	-	3.82	0.53	65.33	88.14	2.15
(b)	√	-	3.06	0.46	65.90	88.81	1.43
(c)	√	√	3.40	0.49	66.66	88.81	1.28

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ma, J.; Pan, N.; Yin, D.; Wang, D.; Zhou, J. MEFA-Net: Multilevel Feature Extraction and Fusion Attention Network for Infrared Small-Target Detection. Remote Sens. 2025, 17, 2502. https://doi.org/10.3390/rs17142502

AMA Style

Ma J, Pan N, Yin D, Wang D, Zhou J. MEFA-Net: Multilevel Feature Extraction and Fusion Attention Network for Infrared Small-Target Detection. Remote Sensing. 2025; 17(14):2502. https://doi.org/10.3390/rs17142502

Chicago/Turabian Style

Ma, Jingcui, Nian Pan, Dengyu Yin, Di Wang, and Jin Zhou. 2025. "MEFA-Net: Multilevel Feature Extraction and Fusion Attention Network for Infrared Small-Target Detection" Remote Sensing 17, no. 14: 2502. https://doi.org/10.3390/rs17142502

APA Style

Ma, J., Pan, N., Yin, D., Wang, D., & Zhou, J. (2025). MEFA-Net: Multilevel Feature Extraction and Fusion Attention Network for Infrared Small-Target Detection. Remote Sensing, 17(14), 2502. https://doi.org/10.3390/rs17142502

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MEFA-Net: Multilevel Feature Extraction and Fusion Attention Network for Infrared Small-Target Detection

Abstract

1. Introduction

2. Related Works

2.1. Infrared Small-Target Detection

2.2. Image Segmentation Structure Based on U-Net

2.3. Discussion

3. Methods

3.1. Overall Architecture

3.2. Dilated Direction-Sensitive Convolution Block

3.3. Encoder Attention Fusion Module

3.4. Efficient Up-Sampling Block

4. Experiment

4.1. Experimental Settings

4.2. Evaluation Metrics

4.3. Comparison with SOTA Methods

4.3.1. Quantitative Results

4.3.2. Qualitative Comparison

4.4. Ablation Study

4.4.1. The Ablation Study on the MEFA-Net

4.4.2. The Ablation Study on the DDCB Module

4.4.3. The Ablation Study on the EUB Module

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI