DEMNet: Dual Encoder–Decoder Multi-Frame Infrared Small Target Detection Network with Motion Encoding

He, Feng; Zhang, Qiran; Li, Yichuan; Wang, Tianci

doi:10.3390/rs17172963

Open AccessArticle

DEMNet: Dual Encoder–Decoder Multi-Frame Infrared Small Target Detection Network with Motion Encoding

School of Electronic Information and Communication, Huazhong University of Science and Technology, Wuhan 430074, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Remote Sens. 2025, 17(17), 2963; https://doi.org/10.3390/rs17172963

Submission received: 19 June 2025 / Revised: 19 August 2025 / Accepted: 22 August 2025 / Published: 26 August 2025

(This article belongs to the Special Issue Recent Advances in Infrared Target Detection)

Download

Browse Figures

Versions Notes

Abstract

Infrared dim and small target detection aims to accurately localize targets within complex backgrounds or clutter. However, under extremely low signal-to-noise ratio (SNR) conditions, single-frame detection methods often fail to effectively detect such targets. In contrast, multi-frame detection can exploit temporal cues to significantly improve the probability of detection (Pd) and reduce false alarms (Fa). Existing multi-frame approaches often employ 3D convolutions/RNNs to implicitly extract temporal features. However, they typically lack explicit modeling of target motion. To address this, we propose a Dual Encoder–Decoder Multi-Frame Infrared Small Target Detection Network with Motion Encoding (DEMNet) that explicitly incorporates motion information into the detection process. The first multi-level encoder–decoder module leverages spatial and channel attention mechanisms to fuse hierarchical features across multiple scales, enabling robust spatial feature extraction from each frame of the temporally aligned input sequence. The second encoder–decoder module encodes both inter-frame target motion and intra-frame target positional information, followed by 3D convolution to achieve effective motion information fusion. Extensive experiments demonstrate that DEMNet achieves state-of-the-art performance, outperforming recent advanced methods such as DTUM and SSTNet. For the DAUB dataset, compared to the second-best model, DEMNet improves Pd by 2.42 percentage points and reduces Fa by 4.13 × 10⁻⁶ (a 68.72% reduction). For the NUDT dataset, it improves Pd by 1.68 percentage points and reduces Fa by 0.67 × 10⁻⁶ (a 7.26% reduction) compared to the next-best model. Notably, DEMNet demonstrates even greater advantages on test sequences with SNR ≤ 3.

Keywords:

infrared small target detection; motion encoding; multi-frame detection; spatial–temporal fusion

1. Introduction

Infrared thermal imaging systems passively receive infrared radiation from scenes, offering advantages such as good concealment, high angular resolution, and strong anti-interference capabilities. Unlike visible light imaging systems, infrared systems can operate in all weather conditions, making them widely used in fields such as maritime rescue, military reconnaissance, and missile guidance. Among its related technologies, infrared small target detection has attracted widespread attention due to its ability to accurately locate targets of interest within an image. However, infrared targets are typically small in size, lack well-defined texture, possess limited shape and contour information, and are often embedded in low signal-to-noise ratio (SNR) environments. These challenges make it difficult to distinguish such targets from background clutter and noise, significantly complicating the detection task.

1.1. Related Works

1.1.1. Single-Frame Methods

Early methods for infrared small target detection were based on the assumption that the background of the image remained static or exhibited only minor variations, with background pixels showing strong correlation and similarity. Under this assumption, filter-based techniques were widely adopted [1]. However, the presence of prominent edges in the background can disrupt this correlation. To address such issues, some researchers proposed transforming the image from the spatial domain to various transform domains [2], such as the Fourier transform and gradient vector fields.

With the rise of saliency detection techniques in computer vision, infrared small target detection methods inspired by the human visual attention mechanism have also emerged. In 2012, Shao et al. proposed an approach based on the contrast mechanism of the human visual system (HVS), where a Laplacian of Gaussian (LoG) filter was used to suppress noise and enhance target intensity, thereby improving detection performance [3]. Also in 2012, Qi et al. introduced a saliency-based region detection method incorporating attention mechanisms to detect small infrared targets in complex backgrounds [4]. In 2013, Gao et al. developed an adaptive infrared patch-image model constructed from local patches, which was used to segment targets and suppress various types of clutter interference [5].

In addition, in 2013, Chen et al. proposed an algorithm based on the contrast mechanism of the HVS and a derived kernel model, in which local contrast and adaptive thresholding were employed to segment targets [6]. In 2014, Han et al. introduced a thresholding and rapid traversal method based on the attention shift mechanism of the HVS for fast target acquisition [7]. In 2018, Zhang et al. calculated a local intensity and gradient (LIG) map from the original infrared image to enhance targets while suppressing clutter [8]. In the same year, Moradia et al. modeled point targets using multi-scale average absolute gray difference (AAGD) and a Laplacian of point spread function (LoPSF) to reduce false alarm rates [9]. Although these methods significantly improve detection performance, they still struggle to handle complex and dynamically changing background scenarios.

In recent years, deep learning methods have been widely applied to various visual tasks. Dai et al. proposed asymmetric context modulation (ACM), which integrates high-level semantic information and low-level positional details through a comprehensive top-down and bottom-up attention modulation pathway [10]. In their subsequent work, they introduced the attention local contrast network (ALCNet), which refines features based on the idea of local contrast, particularly targeting small targets [11]. Furthermore, the internal attention-aware network (IAANet) first adopts a region proposal network (RPN) to generate coarse target regions, and then pixel-level self-attention computation is applied to these proposed regions to obtain attention-aware features [12]. Wu et al. designed UIUNet, in which micro-UNet modules are embedded within a UNet architecture to learn multi-level and multi-scale features of infrared images [13]. The dense nested attention network (DNANet) extracts spatial features at multiple scales through a feature pyramid structure and effectively fuses them via densely connected skip pathways, achieving outstanding detection performance [14]. Liu et al. incorporated the Transformer architecture into infrared small target detection, yielding promising results [15]. He et al. employed discrete wavelet transform (DWT) and inverse DWT (IDWT) to extract and fuse frequency-domain and spatial-domain features, enhancing detection accuracy on public datasets [16]. Liu et al. also proposed MSHNet, a network capable of capturing multi-scale spatial location information. Combined with a novel scale and location sensitive loss function, their method achieved superior detection results [17]. IRMamba introduced an innovative infrared small target detection model that integrates pixel-difference attention and layer restoration techniques within a Mamba architecture, achieving state-of-the-art performance by dynamically capturing subtle thermal variations while preserving contextual coherence [18].

Incorporating human expert knowledge can significantly enhance data-driven methods. LCAE-Net proposed a novel method that strategically combines local contrast amplification with domain-specific prior knowledge to achieve higher detection accuracy in cluttered infrared scenes, particularly for dim and small targets [19]. TCI-Former proposed a thermal conduction-inspired Transformer architecture that mimics heat diffusion dynamics to enhance infrared small target detection through adaptive feature aggregation and noise suppression [20]. CSENet incorporates shape information into the model learning [21]. Liu C et al. introduced a prior-guided dense nested network for infrared small target detection, leveraging multi-scale feature fusion and prior knowledge constraints to enhance detection accuracy in cluttered backgrounds while maintaining real-time processing efficiency [22].

Recently, transfer learning methods and language models have been continuously evolving. Chib et al. utilized language models as prior knowledge to guide attention weights, enhancing the model’s detection capabilities and demonstrating broad development prospects [23]. SAIST leverages contrastive language–image pretraining (CLIP) to enhance detection accuracy for tiny thermal targets by fusing cross-modal semantic guidance and dynamic scale adaptation, achieving real-time inference with a lightweight architecture [24]. IRSAM adapts the Segment Anything Model (SAM) for infrared small target detection by introducing thermal-aware feature enhancement and domain-specific prompt engineering to improve precision in low-contrast scenarios [25].

To summarize, single-frame detection methods, whether model-driven traditional approaches or data-driven deep learning techniques, often fail to deliver satisfactory results in scenarios involving strong interfering targets or adverse imaging conditions characterized by low signal-to-noise ratios, heavy clutter, and significant noise. In such challenging environments, the detection of infrared small targets typically requires richer feature information for accurate identification. Therefore, multi-frame detection algorithms that are capable of capturing temporal contextual information are better suited for handling the aforementioned complex scenarios

1.1.2. Multi-Frame Methods

In the early stages, traditional multi-frame infrared small target detection algorithms primarily relied on target motion characteristics [26] or differences between adjacent frames to facilitate detection [27]. Subsequently, some methods attempted to extend single-frame detection algorithms into the multi-frame domain, achieving notable improvements. For example, the spatiotemporal local contrast filter (STLCF) [28] and the spatiotemporal local difference measure (STLDM) [29] expanded 2D operators into 3D spatiotemporal operators to compute information across the current and historical frames. These outputs were then fused with temporal saliency features to extract target locations. In 2021, the multi-subspace learning and spatiotemporal patch tensor model (MSLSTIPT) was proposed [30], which similarly extends 2D spatial low-rank decomposition into the 3D spatiotemporal domain for small target detection in multi-frame infrared image sequences. In more recent studies, Wu et al. proposed a 4D tensor model that processes a sequence of infrared images and utilizes tensor train and its extension, tensor ring, to decompose them into 4D tensors [31]. Li et al. introduced a twist tensor model based on sparse regularization, which enhances the contrast between targets and background for more effective small target detection [32]. However, such approaches often rely heavily on scene-specific prior knowledge and make relatively narrow assumptions about the target types, leading to limited robustness. As a result, detection performance tends to degrade significantly when there are scene changes or variations in target characteristics.

In recent years, with the continuous advancement of deep learning algorithms and the release of infrared sequence image datasets [33,34], deep learning-based multi-frame infrared small target detection methods have gradually become mainstream. The earliest approaches simply extended single-frame detection methods to sequential images by applying single-frame detection to each frame independently [35,36]. Although this strategy enabled processing of consecutive frames, it failed to leverage temporal contextual information and could not fully exploit the advantages of multi-frame analysis. Subsequent research explored the use of motion cues from preceding and succeeding frames to perform super-resolution enhancement on the current frame, thereby improving the detail and visibility of dim infrared targets [37,38]. While this technique can enhance detection performance, it typically results in significant computational overhead due to the resolution upscaling process, which poses challenges for real-time applications. Later, Wang et al. proposed a network incorporating a spatiotemporal multi-scale feature extractor module, which captures multi-scale spatial information across frames in the temporal dimension. This approach significantly improved detection accuracy and fully exploited the advantages of multi-frame detection [39]. As a result, most subsequent studies have adopted similar strategies.

In 2023, Li et al. proposed an effective direction-coded temporal U-shape module in multi-frame detection [33]. In 2024, Chen et al. extended the use of ConvLSTM-based cross-spatiotemporal slice node feature processing to the field of infrared small target detection [40]. Duan et al. introduced frequency information by applying Fourier and inverse Fourier transforms to infrared images, achieving excellent detection performance [41]. Ying et al. enhanced target features by emphasizing specific frequency-domain components and incorporated a feature recurrence framework, yielding promising results on a self-constructed satellite dataset [42]. In 2025, Liu et al. proposed a long-term optical flow-based motion pattern extractor module that improves upon traditional optical flow methods by leveraging motion information for target detection [43]. Peng et al. adapted a dual-branch parallel feature extraction approach from the video domain to infrared small target detection, where one branch extracts global features while the other focuses on key frame features; the fusion of these features achieved competitive performance with low computational cost [44]. Zhu et al. incorporated a Transformer-based attention mechanism to encode and fuse spatiotemporal and channel-wise features for target detection [45]. Zhang et al. utilized the consistent motion direction and strong inter-frame correlation of infrared small targets by designing a spatial saliency feature generation module, which was fused across the temporal dimension for final detection [46]. Similarly, Zhu et al. adopted a parallel dual-branch feature extraction strategy, where spatial features assist in enhancing temporal motion cues, and proposed a complementary symmetric weighting module to fuse spatiotemporal features, resulting in outstanding detection performance [47]. MOCID designed a dual-branch neural network that jointly learns motion context and displacement features for robust moving infrared small target detection, effectively addressing challenges in dynamic clutter suppression and trajectory continuity [48].

1.2. Motivation

When detecting dim infrared targets in complex environments, the low contrast between the target and the background often renders single-frame infrared small target detection algorithms ineffective, either missing the target or generating a large number of false alarms. In such scenarios, multi-frame infrared small target detection algorithms can leverage temporal contextual information by perceiving differences across consecutive frames, thereby achieving better detection performance. However, despite the improvements over single-frame methods, existing multi-frame detection approaches still suffer from several limitations:

Although these methods adopt various spatiotemporal feature fusion strategies, most rely on implicit extraction via 3D convolutions or attention mechanisms, and they often overlook the motion consistency of the target. Li et al. [33] proposed an explicit encoding method that maps the target’s position within each frame to model its motion features. However, the motion encoding strategy is insufficiently developed and fails to capture the relative positional relationships of the target across frames, leaving substantial room for refinement in motion representation.
Current methods usually integrate spatiotemporal feature extraction into one encode-decode architecture, which may impede the performance for the tight coupling of the spatial and temporal feature extraction process.
The commonly used false alarm rate (Fa) metric presents limitations. Fa is defined as the ratio of non-target pixels incorrectly predicted as targets to the total number of pixels in the image, which fails to intuitively reflect the model’s ability to suppress a target-level false positive ratio. In practical applications, detection results are typically processed on a per-target basis rather than per-pixel.

This paper proposes a novel Dual Encoder-Decoder Multi-Frame Infrared Small Target Detection Network with motion encoding, termed DEMNet. The proposed method first employs a spatial feature extractor module, which uses an encoder–decoder structure to extract multi-scale spatial features from input images. Then, a motion information encoder–decoder module is used to map and reconstruct the motion characteristics of the target, thereby capturing rich temporal contextual information. A multi-stage decoder and prediction head are used to generate the target prediction map for the last frame. This process effectively enhances the representation of dim targets embedded in complex backgrounds.

The main contributions of this paper are summarized as follows:

(1): A dual encoder–decoder multi-frame infrared small target detection network, DEMNet, was proposed. The network integrates spatial and temporal contextual features and employs end-to-end learning to enhance the representation of dim and small targets under complex backgrounds.
(2): Based on the motion consistency of infrared targets, a motion encoding strategy was introduced. It consists of inter-frame motion encoding and intra-frame location encoding to explicitly capture spatiotemporal motion characteristics and improve temporal feature utilization.
(3): A target-level false alarm evaluation metric, F_aT, was proposed to address the limitations of pixel-level metrics. F_aT evaluates false alarms at the object level, providing a more intuitive and accurate assessment of the model’s false alarm suppression ability in practical scenarios.
(4): Experimental results on the DAUB [34] and NUDT-MIRSDT [33] datasets demonstrate that the proposed DEMNet significantly outperforms existing state-of-the-art methods, particularly in detecting dim targets with low signal-to-noise ratios.

2. Methods

This section presents the implementation details of DEMNet. The main components of the proposed model are introduced in detail, including the design of the spatial feature extractor module (SFEM) and the motion information encoder–decoder module (MIEM). Furthermore, the specific implementation of the proposed algorithm on input image sequences is described.

2.1. Overall Architecture

The overall architecture of DEMNet is illustrated in Figure 1. It consists of three major components: spatial feature extractor, motion information encoder–decoder, and a prediction head composed of 3D convolution, batch normalization, and activation functions. All the operation blocks in Figure 1 are detailed in the legend.

The input to the network is a sequence of consecutive infrared frames denoted as

I^{1 \times T \times H \times W}

, where T denotes the length of the input sequence (5 frames are used in this work). Firstly, all previous frames in the input sequence are aligned with the last frame individually. The image alignment operation consists of four steps: detecting feature points, matching feature points, computing the homography matrix, and performing image transformation. Each aligned frame is fed sequentially into the spatial feature extractor module. The resulting feature maps are then concatenated along the temporal dimension to produce a feature map

G^{C \times T \times H \times W}

, where C = 32 denotes the number of channels.

Subsequently,

G^{C \times T \times H \times W}

is passed into the multi-stage motion information encoder (MI Encoder) to extract motion features across multiple frames. These features are then progressively decoded through a multi-stage decoder, where they are fused with shallow features at each stage.

Finally, the output feature map from the highest-level decoder is concatenated with the last frame of

G^{C \times T \times H \times W}

, and this fused representation is passed through the prediction head to generate the final detection map of infrared small targets for the last frame.

2.2. Spatial Feature Extractor Module

First, each frame of input image sequence

I^{1 \times T \times H \times W}

is processed through a double convolution to expand the channel number to obtain

R_{t}^{32 \times 1 \times H \times W} (t = 1, 2 \dots T)

. R_t is then fed into multi-level encoders to extract feature maps that contain multi-scale spatial information. These features are subsequently passed through a multi-level decoder for feature fusion and decoding. Finally, the decoded features from all T frames are concatenated to form the output feature map

G^{C \times T \times H \times W}

. The details of encoder and decoder are given as follows.

2.2.1. Encoder of SFEM

The encoder uses 2 × 2 average pooling to obtain the global low-frequency information of the input feature map, then the features go through a scale layer and a double convolution layer. Its structure is shown in Figure 2.

At the encoder of level m, the output feature map of the average pooling layer has a size of

2^{m - 1} C \times 1 \times \frac{H}{2^{m}} \times \frac{W}{2^{m}}

. It is subsequently multiplied by scalar 2 and then processed by two convolutional layers; each convolution is followed by batch normalization and activation, resulting in a refined feature map P whose size is

2^{m - 2} C \times 1 \times \frac{H}{2^{m}} \times \frac{W}{2^{m}}

. This output is forwarded to the encoder at the next level and also to the decoder at level (m + 1) for further processing. The final level of the encoder is slightly different, where the feature map P is only transmitted laterally, without being passed down to the next encoder level.

2.2.2. Decoder of SFEM

The decoder integrates both channel attention and pixel attention mechanisms to highlight critical features. The output features from different decoding levels are resized and undergo complex top-down and bottom-up information propagation and fusion. Ultimately, a single-frame feature map enriched with multiscale spatial information is obtained, as illustrated in Figure 3.

The pixel attention and channel attention of the received feature maps

P

and

F

are calculated and cross-multiplied to get the high-level semantic feature map output containing rich context information. The whole process is shown in Equations (1)–(8) as follows:

F_{u p} = U p s a m p l e (F)

(1)

F_{m i d - 1} = A v g P o o l i n g (C o n v (F_{u p}))

(2)

F_{m i d - 2} = R e l u \{B n [C o n v (F_{m i d - 1})]\}

(3)

F_{C A} = S i g m o i d \{B n [C o n v (F_{m i d - 2})]\}

(4)

P_{m i d} = R e l u \{B n [C o n v (P)]\}

(5)

P_{P A} = S i g m o i d \{B n [C o n v (P_{m i d})]\}

(6)

G_{m i x} = R e l u \{B n [C o n v (F_{c o n v} ⊙ P_{P A} + P ⊙ F_{C A})]\}

(7)

O u t p u t = D o u b l e C o n v (G_{m i x})

(8)

2.3. Motion Information Encoder Module

Based on the assumption of motion consistency of targets, an inter-frame motion encoding module and an intra-frame positional encoding module are proposed. On this basis, a motion information encoder (MI Encoder) is constructed. The feature maps obtained from the spatial feature extractor are concatenated along the temporal dimension and fed into the MI Encoder. This encoder module comprises the inter-frame motion encoding module, intra-frame positional encoding module, as well as 3D MaxPooling layer, 3D convolutional layers, batch normalization (BN) layer, and ReLU activation layer. It effectively encodes the motion information of the target and fuses it with multiscale spatial features, yielding feature maps that contain both motion and spatial information of the target, as illustrated in Figure 4. Note that the indices (max values’ positions) of all max pooling regions are sent to intra-frame encoding module and inter-frame encoding module.

2.3.1. Inter-Frame Motion Encoding Module

The inter-frame motion encoding module encodes the directional relationship of the maximum value in each 2 × 2 pooling region of past frames relative to the current frame, capturing the target’s motion characteristics through variations in directional relationships. As illustrated in Figure 5, in a sequence of consecutive frames, it is assumed that both targets and clutter appear as the brightest regions within their local neighborhoods. However, the motion patterns of targets and clutter differ significantly. Typically, targets move along specific, regular trajectories, while clutter tends to move randomly.

Let

V_{p_{i \to t}}

denotes the directional mapping between the maximum pixel in each pooling region of the past frame

F_{t - i}

and the corresponding position in the current frame

F_{t}

. The average of the mapping values across all past frames is defined as P, as expressed in Equation (9):

P = \frac{\sum_{i = 1}^{t - 1} V_{p_{i \to t}}}{t}

(9)

where

V_{p_{i \to t}}

is determined by the horizontal positional mapping

x_{i \to t}

and the vertical positional mapping

y_{i \to t}

.

x_{i \to t}

and

y_{i \to t}

are derived based on the relative directional relationship between the location of the maximum pixel in the pooling region of the past frame and its corresponding position in the current frame. If the maximum pixel in the past frame is located to the left or above its position in the current frame,

x_{i \to t}

and

y_{i \to t}

are set to 1, respectively. If the horizontal or vertical position remains unchanged, the values are set to 0. If the direction is to the right or below,

x_{i \to t}

and

y_{i \to t}

are set to −1, respectively. The definition of

V_{p_{i \to t}}

is then given in Equation (10):

V_{p_{i \to t}} = \{\begin{matrix} \sqrt{x_{i \to t} + α \cdot y_{i \to t}} & (x_{i \to t} + α \cdot y_{i \to t}) \geq 0 \\ - \sqrt{|x_{i \to t} + α \cdot y_{i \to t}|} & (x_{i \to t} + α \cdot y_{i \to t}) < 0 \end{matrix}

(10)

V_{p_{i \to t}}

has the following properties:

(1): When the position of the maximum pixel within the pooling region remains in the same direction across frames, the sign remains unchanged; when the direction changes, the sign is reversed.
(2): By setting α = 0.8, it ensures that the mapping values are unique under all nine possible directional relationships between the positions of the maximum pixel in the past and current frames.
(3): A square root operation is applied to $V_{p_{i \to t}}$ to avoid conflicts where the summation of $V_{p_{i \to t}}$ are the same across different previous frames, but the motion trajectories differ. For example, in one pooling region, the directional encoding for two previous frames may both be 1. In another region, the encodings for the same two frames may be 0 and 2, respectively. So, without the square root operation, the summation cannot differentiate between different motion trajectories.

The specific computation process of

V_{p_{i \to t}}

is shown in the pseudo code of Algorithm 1.

Algorithm 1: Inter-Frame Motion Encoding

Input: Feature map F
Output: Mapping value vector V

1. (F_pool,index) = 3DMaxPooling(F)
2: index_x(i) = index(i)%W // index_x

\in

[0,W)
3. index_y(i) = index(i)//W // index_y

\in

[0,H)
4. for i in range(T − 1):
if index_x(i) < index_x(t):

x_{i \to t}

= −1
if index_x(i) = index_x(t):

x_{i \to t}

= 0
if index_x(i) > index_x(t):

x_{i \to t}

= 1 // Horizontal direction encoding
5. for i in range(T − 1):
if index_y(i) < index_y (t):

y_{i \to t}

= −1
if index_y(i) = index_y (t):

y_{i \to t}

= 0
if index_y(i) > index_y (t):

y_{i \to t}

= 1 // Vertical direction encoding
6. for i in range(T − 1):

v (i) = \{\begin{matrix} \sqrt{x_{i \to t} + α * y_{i \to t}} & (x_{i \to t} + α * y_{i \to t}) \geq 0 \\ - \sqrt{|x_{i \to t} + α * y_{i \to t}|} & (x_{i \to t} + α * y_{i \to t}) < 0 \end{matrix}

//

α = 0.8

V+ = v(i)
7. V = V/(t − 1)
8. return V

It is assumed that the target pixel corresponds to the maximum value within each MaxPooling region, and that the target moves in an arbitrary pattern across multiple frames. For the background, the location of the maximum value in each MaxPooling region is randomly distributed. After image alignment of the consecutive multi-frame inputs, the background regions remain unchanged across frames. As shown in Figure 6a, from frame t − 4 to frame t, the 2 × 2 MaxPooling regions traversed by the target’s motion are denoted as A, B, C, D, and E, respectively.

As inter-frame encoding is performed independently within each pooling region. The target’s motion trajectory across frames can follow an arbitrary pattern, without any assumption of linearity or nonlinearity. Figure 6b illustrates the positions of the MaxPooling maxima in region A over five consecutive frames. The symbol ◁ denotes the target’s location within region A in frame t − 4, whereas the symbol ◇ denotes the position of the maximum value in region A when the target is absent. According to Equation (10), for frame t – 4, the value is

(x_{i \to t} = - 1, y_{i \to t} = 1)

, and for the other frames, it is

(x_{i \to t} = 0, y_{i \to t} = 0)

. Based on Equation (9), P =

- \sqrt{0.2}

. The calculations for positions B–E follow the same procedure as that for position A. From the computation process for position A, it can be inferred that if the target passes through this region within five consecutive frames, the inter-frame encoding has a 3/4 probability of producing a nonzero value, and a 1/4 probability of producing a nonzero value when the target position coincides with the background’s maximum value location. Therefore, under multi-frame conditions with target motion, inter-frame encoding is highly likely to provide markers indicating that the target has passed through a given position within five frames, thereby establishing a reliable basis for subsequent target feature extraction and false-alarm suppression.

2.3.2. Intra-Frame Positional Encoding Module

The intra-frame positional encoding module encodes the position of the maximum value within each pooling region of a single frame, capturing the motion characteristics of the target based on the positional patterns of maximum values across consecutive frames. As shown in Figure 7, for the four pooling regions within the 4 × 4 area enclosed by the orange dashed box in the feature map, when a target passes through this box over several consecutive frames, both the maximum value and its location within each pooling region change in each frame. The variations in the maximum value and its index position caused by target motion exhibit more regular patterns, whereas those induced by clutter signals or background regions tend to be more random.

Therefore, by encoding the index positions of the maximum values during pooling in each frame, the model can perceive the intra-frame variations caused by target motion, distinguish them from the irregular patterns of clutter or background, and thus improve detection performance. Specifically, the encoding process of the intra-frame location information encoding module is presented in Algorithm 2.

Algorithm 2: Intra-Frame Positional Encoding Module

Input: Feature map F
Output: Mapping value vector D

1: (F_pool,index) = 3DMaxPooling(F) // Kernel = (1,2,2), Stride = (1,2,2)
2: index_x(i) = index(i)%2 //The result is 0 or 1, which means on the left or right
3. index_y(i) = (index(i)//W)%2 // The result is 0 or 1, which means in the upper or lower area
4. for i in range(T):
D(i) = 1.25 +

\frac{index_x (i) + 2 * index_y (i)}{4}

// The codes of the four positions are

(\begin{matrix} 1.25 & 1.5 \\ 1.75 & 2.0 \end{matrix})

5: return D

2.4. Motion Information Decoder Module

The motion information decoder performs hierarchical decoding on the received features, as shown in Figure 8.

In the decoder module, the decoder at level n receives deep features passed upward from the decoder at the lower level (the lowest-level decoder receives features from the corresponding encoder at the same level). After upsampling, these features are concatenated with the shallow features from the encoder at level n − 1 along the channel dimension, followed by 3D convolution, BN, and a ReLU activation layer. The output is then passed to the upper-level decoder. The top-level decoder is slightly different in its 3D convolution module; the temporal padding is set to 0, and the output feature map has the shape

B \times C \times 1 \times H \times W

.

3. Experiments and Results

3.1. Dataset

The DAUB dataset [34] was created by B. Hui et al. from the National University of Defense Technology. It consists of 22 video sequences with a total of 16,177 frames, each with a resolution of 256 × 256 pixels and corresponding annotations. All the images are captured by a camera with a wavelength of 3–5 μm. The dataset focuses on detecting fixed-wing UAV targets and includes various backgrounds such as sky and ground across diverse scenes. As a real-world dataset, its annotations include frame index, target index, and the target’s center point. Following the work of Yuan et al. [49], binary mask labels were generated from the center point annotations. The training set and test set split follows the protocol provided at https://github.com/UESTC-nnLab/SSTNet, accessed on 30 May 2025.

However, in the test set, sequence 21 contains severe blind pixels, nearly no inter-frame motion of the target, and extremely dim targets, as illustrated in Figure 9, where the red box indicates the target and the blue boxes mark the blind pixel regions. Therefore, this noisy sequence was excluded from evaluation.

The NUDT-MIRSDT dataset was developed by Li et al. from the National University of Defense Technology [33]. It was generated by embedding targets into real-world captured infrared images, followed by jitter and noise augmentation. The dataset consists of 100 sequences with a total of 10,000 frames, covering diverse scenarios including sky, ocean, and land. It contains short-wave infrared, medium-wave infrared, and 950 nm wavelength. Each frame is resized to 512 × 512 pixels. The images are categorized into two groups: SNR ≤ 3 and 3 < SNR < 10. In the training set, the first 10 frames of each sequence are composed of images with SNR ≤ 3, while the remaining 90 frames contain images with 3 < SNR < 10. The test set includes 8 sequences entirely consisting of images with SNR ≤ 3 and 12 sequences entirely composed of images with 3 < SNR < 10. The training and testing split follows the protocol available at https://github.com/TinaLRJ/Multi-frame-infrared-small-target-detection-DTUM, accessed on 30 May 2025.

3.2. Performance Evaluation Indices

P_d (probability of detection), used to evaluate the detection performance of the model, is calculated as shown in Equation (11):

p_{d} = \frac{T_{T P}}{T_{A l l}} \times 100 %

(11)

where T_TP is the number of correctly predicted targets and T_All is the number of all targets in labels. A predicted target is deemed correctly detected if the Euclidean distance between its center and the nearest ground-truth target center is less than or equal to three pixels.

F_a (false alarm rate) is used to evaluate the model’s anti-false-alarm performance, and its calculation method is presented as follows:

F_{a} = \frac{P_{F P}}{\sum_{i = 1}^{N} H_{i} \times W_{i}}

(12)

where P_FP represents the number of all pixels falsely predicted as targets,

H_{i} \times W_{i}

is the size of the i-th input image, and N is the number of test images. If the distance between the center of a predicted target and the nearest annotated target center exceeds three pixels, or no matching target exists, the pixels contained in the predicted target are considered false alarms.

In infrared small target detection tasks, the proportion of target pixels in the entire image is extremely small, typically on the order of

10^{- 7} ~ 10^{- 5}

. This proportion varies significantly across different datasets and therefore cannot intuitively reflect the performance of the model. Specifically, in some cases, it becomes difficult to distinguish between models based on commonly used metrics. As shown in Figure 10, black pixels represent the ground-truth targets, while red and orange pixels denote false alarms detected by two different models. For this particular image prediction, both models yield the same false alarm F_a score. However, the red model predicts only one false target, whereas the orange model generates three. These two scenarios have distinctly different negative impacts. In general, the model represented by red causes less interference and thus has a relatively smaller impact. Nevertheless, this advantage is not reflected in the F_a metric, highlighting its limitation.

In such scenarios, a new metric, false alarms of target (F_aT), is proposed to address this limitation. The definition of F_aT is given as follows:

F_{a T} = \frac{T_{F P}}{T_{A l l}} \times 100 %

(13)

where

T_{F P}

represents the number of correctly predicted targets and

T_{A l l}

represents the number of all targets in labels. A prediction target is considered a false alarm if the distance between the center of the predicted target and the nearest annotated target center exceeds three pixels, or if no corresponding ground truth target exists. Note that each ground-truth target can be exclusively matched with at most one predicted target for TP attribution. This metric counts the number of false alarm targets, providing a closer alignment with real-world usage scenarios. The numerical value can more intuitively reflect the model’s false alarm performance.

Params: Refers to the number of model parameters, which is used to measure the size and complexity of the model. It represents the total sum of all parameters in the model.

FLOPs: Stands for floating-point operations per second, a metric used to assess the computational complexity of the model. It indicates the number of floating-point operations required for one forward pass through the model. In this study, tests were conducted using images with a resolution of 256 × 256.

FPS: Represents the model’s running speed, defined as the number of frames processed per second.

ROC curve (receiver operating characteristic curve): It is used to evaluate the performance of a model across varying detection thresholds. The P_d and F_a are metrics used to assess model performance at fixed detection thresholds, whereas the ROC curve provides an overview of model performance across a range of sliding thresholds. In the ROC curve, F_a is plotted on the x-axis and P_d on the y-axis. The closer the ROC curve approaches the top-left corner, the better the model’s performance.

3.3. Network Training

Training was conducted on a workstation equipped with an Intel(R) Core(TM) i9-10920X @ 3.50GHz CPU (Intel Corporation, Santa Clara, CA, USA) and an NVIDIA GeForce RTX 3090 with 24GB of VRAM (NVIDIA Corporation, Santa Clara, CA, USA). The primary software versions used were Python 3.8 and PyTorch 1.8. The Adam optimizer [50] was used with an initial learning rate of 0.001. Evaluation was performed after 20 training epochs. The learning strategy adopted was Cosine Annealing LR [51], where the learning rate gradually decreases to its minimum value over the course of 20 epochs. Unless otherwise specified, the loss function used in the following experiments is soft-IoU loss, as shown in Equation (14):

Soft - IoULoss = 1 - \frac{\sum_{i, j} p_{i, j} \times t_{i, j} + a}{\sum_{i, j} p_{i, j} + t_{i, j} - p_{i, j} \times t_{i, j} + a}

(14)

where i and j represent the row and column coordinates of the image, respectively; background pixels are labeled as 0, and target pixels are labeled as 1. t_i,j denotes the ground truth pixel value at the corresponding coordinate, and p_i,j represents the predicted pixel value. a is a small constant added to prevent division by zero.

3.4. Ablation Study

3.4.1. Effectiveness of the MI Encoder

The study constructed four different models to evaluate the effectiveness of the motion information encoding modules:

Model A: An encoder that uses neither the inter-frame nor intra-frame encoding modules.

Model B: An encoder that uses only the intra-frame positional encoding module.

Model C: An encoder that uses only the inter-frame motion encoding module.

Model D: The complete encoder with both inter-frame and intra-frame motion encoding modules.

These models were trained and evaluated on both the DAUB and NUDT-MIRSDT datasets, and the results are shown in Table 1. The best results are shown in boldface, and the second-best results are underlined.

For the DAUB dataset, the baseline model A, which uses a conventional encoder without any motion encoding module, achieved a P_d of 94.71%. Using only the intra-frame encoding module (model B) improved Pd by 1.66 percentage points, while using only the inter-frame encoding module (model C) improved Pd by 0.85 percentage points. The complete encoder module (model D) achieved a detection rate of 98.28%, an increase of 3.57 percentage points, outperforming both models B and C.

In terms of false alarm suppression, compared to baseline model A, the F_a metric of models B, C, and D improved by 3.14 × 10⁻⁶, 5.05 × 10⁻⁶, and 6.25 × 10⁻⁶, corresponding to improvements of 38.54%, 60.10%, and 76.88%, respectively. In terms of the F_aT metric, compared to model A, the values dropped by 8.99, 13.99, and 14.19 percentage points, respectively.

For the NUDT dataset, model A achieved a Pd of 96.01% as the baseline. Compared to this, models B, C, and D improved Pd by 1.66, 0.85, and 2.2 percentage points, respectively.

In terms of false alarm suppression performance, the F_a metric of models B, C, and D increased by 9.01 × 10⁻⁶, 16.83 × 10⁻⁶, and 18.61 × 10⁻⁶ compared to baseline model A, corresponding to improvements of 33.16%, 61.94%, and 68.49%, respectively. For the F_aT metric, compared to model A, the values decreased by 4.61, 6.27, and 7.43 percentage points, respectively.

These results clearly demonstrate that the two motion encoding modules indeed enhance the model’s target detection capability, resulting in higher detection rates and lower false alarms, thereby improving the overall performance of the model.

3.4.2. Effectiveness of the Spatial Feature Extractor

This section constructs and evaluates three models:

Model A: An encoder module using spatial downsampling and a ResBlock-like [52] residual connection structure for encoding, combined with a decoder module that performs decoding only through concatenation and upsampling.

Model B: An encoder–decoder module without integrating a global attention mechanism during decoding.

Model C: The complete spatial feature extractor module.

These models were trained and tested on the DAUB and NUDT datasets. The results are presented in Table 2.

On the DAUB dataset, Model A achieved a P_d of 94.62% as the baseline. Model B improved P_d by 2.44 percentage points, while Model C, which consists of the complete spatial feature extractor module, improved P_d by 3.66 percentage points. Regarding false alarm performance, the F_a metric for Models B and C increased by 1.08 × 10⁻⁶ and 7.40 × 10⁻⁶ compared to baseline Model A, representing improvements of 11.64% and 79.74%, respectively. In terms of F_aT, Models B and C reduced the false alarm rate by 0.41 and 1.47 percentage points compared to Model A.

On the NUDT dataset, Model A had a P_d of 94.39% as a baseline. Model B improved P_d by 1.65 percentage points, and Model C improved by 3.82 percentage points. For false alarm performance, the F_a metric of Models B and C decreased by 1.09 × 10⁻⁶ and 0.72 × 10⁻⁶ compared to Model A, representing improvements of 11.7% and 7.76%, respectively. In F_aT, Models B and C reduced false alarms by 0.46 and 0.28 percentage points compared to Model A.

Overall, the spatial feature extractor of the encoding-decoding structure can better extract the spatial features of the image, so that the model has a higher detection rate and lower false alarm rate, and the overall performance of the model has been improved.

3.4.3. Optimal Number of Layers of MI Encoders

This study constructs networks with varying numbers of MI encoder layers and evaluates them on the DAUB and NUDT datasets. As shown in Table 3, on the DAUB dataset, a four-layer configuration achieves optimal performance, with a P_d of 98.28%, F_a of 1.88 × 10⁻⁶, and F_aT of 5.01%. On the NUDT dataset, the four-layer network attains optimal F_aT, while Pd and Fa are marginally inferior to the optimal values. Overall, the four-layer motion information encoder structure demonstrates the most balanced performance.

3.4.4. Optimal Number of Frames

This study trains the network using different total frame counts and tests it on the DAUB and NUDT datasets. As shown in Table 4, on the DAUB dataset, a total of five frames yields a P_d of 98.28% and a Fa of 1.88 × 10⁻⁶, both performing sub-optimally, while the F_aT of 5.01% is optimal. On the NUDT dataset, five frames result in optimal P_d and F_aT values, whereas the F_a is only marginally worse. Additionally, increasing the frame count significantly elevates the model’s computational load. Based on the results, five frames are more appropriate overall.

3.5. Comparative Experiments

This paper conducted comparative experiments by evaluating DEMNet against classic single-frame algorithms such as ResUNet [52], DNANet [14], UIUNet [13], MSHNet [17], as well as multi-frame models including RFR [42], DTUM [33], and SST [40] on the DAUB and NUDT datasets. Table 5 and Table 6 present the comparative results on the DAUB and NUDT datasets, respectively.

Note that for the DTUM model trained on the NUDT-MIRSDT dataset, the hybrid training scheme combining the authors’ proposed HPM loss and soft-IoU loss achieved better performance, and these results are reported here. However, on the DAUB dataset, the hybrid loss scheme underperformed compared to using only soft-IoU loss, so the results shown are based solely on the soft-IoU loss training.

The SST model’s test results were obtained using the original authors’ best model weights and evaluated with the same metrics. However, since the SST model’s predictions are bounding boxes, the pixel-level F_a could not be calculated, and thus, only other metrics besides F_a are presented.

On the DAUB dataset, DEMNet achieved P_d of 98.28%, F_a of 1.88 × 10⁻⁶, and F_aT of 5.01%, outperforming all other compared models and reaching the best overall performance. Among these, P_d improved by 2.42 percentage points over the second-best model, F_a decreased by 4.13 × 10⁻⁶ (a reduction of 68.72%), and F_aT improved by 5.3 percentage points compared to the second-best.

On the NUDT dataset, DEMNet also demonstrated excellent performance. Across the entire test set, P_d was 98.21%, F_a was 8.56 × 10⁻⁶, and F_aT was 6.19%, all better than the other compared models and achieving the top performance. Compared to the second-best model, P_d increased by 1.68 percentage points, F_a decreased by 0.67 × 10⁻⁶ (a reduction of 7.26%), and F_aT improved by 0.46 percentage points. Notably, on the test sequences with SNR ≤ 3, DEMNet’s advantages were even more pronounced, with a P_d of 96.41%, F_a of 6.77 × 10⁻⁶, and F_aT of 11.72%. Here, P_d outperformed the second-best by 5.67 percentage points, F_a decreased by 2.45 × 10⁻⁶ (a reduction of 26.57%), and F_aT led by 2.46 percentage points.

Overall, DEMNet demonstrates excellent performance on both datasets, with a particularly significant advantage in detection rate, also leading to false alarm suppression.

To more clearly showcase the detection capability of DEMNet and the comparison models for dim and small targets under complex environments, two sets of images were randomly selected from the NUDT test sequences with SNR ≤ 3 and from the DAUB test sequences. The visualized prediction results are shown in Figure 11. Images 1 and 2 are from NUDT, while images 3 and 4 are from DAUB.

For images 1 and 2, the prediction maps from different models show that single-frame methods either fail to detect the target, resulting in blank prediction maps, or detect a large number of false alarm pixels and false targets. Only a few single-frame models manage to detect the target, but their predicted shapes deviate significantly from the real target. In contrast, most multi-frame models can locate the target, although the predicted shapes may vary somewhat, and occasionally some false alarms appear. Among them, DEMNet not only detects the target but also predicts the target’s contour most closely matching the real one.

For image 3, the target is relatively small in the frame, with weak infrared features and interference from other bright spots, making detection challenging. Most single-frame methods fail to identify the true target and instead detect interference targets. While multi-frame methods (except DEMNet) detect the true target, they still produce many false alarms.

For image 4, the target has a good signal-to-noise ratio and high contrast against the background, so all models successfully detect the correct target. However, some models’ prediction results contain false alarms. Among the models without false alarms, DEMNet and DTUM provide the most accurate target contour predictions, demonstrating their performance advantages consistent with the comparative experimental results.

Figure 12 shows the ROC curves of DEMNet and the comparison models on the DAUB dataset. In the figure, the ROC curve of DEMNet envelops those of the other models, meaning that at various F_a levels, DEMNet consistently achieves higher Pd values than the comparison models. This demonstrates that DEMNet’s detection performance under varying threshold settings is also superior to all the other models.

4. Conclusions

To address the challenge of detecting dim and weak targets in complex environments, this paper presents DEMNet—a multi-frame infrared small target detection algorithm incorporating a dual encoder–decoder architecture and carefully designed motion encoding modules. During the feature extraction phase, DEMNet fuses features through average pooling and attention mechanisms to obtain multi-scale spatial features. Additionally, guided by the motion consistency principle, the model incorporates a frame-to-frame motion information encoder and an intra-frame positional encoder. These modules constitute the core of the motion encoder, which captures temporal contextual information—particularly inter-frame target motion—to fully utilize temporal cues.

Extensive experiments demonstrate the superior performance of DEMNet in both detection accuracy and false alarm suppression under complex low-SNR conditions. Specifically, on the DAUB dataset, DEMNet improves Pd by 2.42 percentage points, reduces Fa by 4.13 × 10⁻⁶ (a 68.72% reduction), and decreases F_aT by 5.3 percentage points compared to the second-best model. On the NUDT dataset, DEMNet achieves an improvement of 1.68 percentage points in Pd, reduces Fa by 0.67 × 10⁻⁶ (a 7.26% decrease), and lowers F_aT by 0.46 percentage points. Notably, on low-SNR (SNR ≤ 3) test sequences, DEMNet achieves a Pd of 96.41%, Fa of 6.77 × 10⁻⁶, and F_aT of 11.72%, outperforming the next-best method by 5.67 percentage points in Pd, 2.45 × 10⁻⁶ in Fa (a 26.57% reduction), and 2.46 percentage points in F_aT.

Both the inter-frame motion encoding and intra-frame positional encoding in DEMNet rely on the position indices of maximum values obtained through 2 × 2 Max Pooling operations. This design is based on the assumption that the target corresponds to the maximum value within each pooling region. However, this assumption may not always hold, particularly when the target exhibits a weak signature or when localized bright spots exist in the background, potentially affecting the reliability of motion encoding.

A cross-dataset evaluation, in which models were trained on the DAUB dataset and tested on the NUDT dataset, and vice versa, revealed a substantial degradation in performance across all baseline methods as well as DEMNet. This sharp decline underscores the vulnerability of existing models to distribution shift and highlights the heightened risk of overfitting. Such findings emphasize the importance of addressing generalization in real-world applications, where data characteristics are often non-stationary and may differ significantly from training conditions.

To narrow the robustness gap induced by distribution shift, several promising directions merit further exploration. First, leveraging models pretrained on large-scale image–text datasets may enable the extraction of more robust and transferable representations. Second, incorporating natural language priors during training or inference could provide additional semantic cues to improve adaptability under varying data conditions. Finally, systematic evaluation on larger and higher-quality real infrared small-target datasets will be essential to obtain a more faithful assessment of model performance, thereby reducing the risks associated with synthetic data artifacts.

Author Contributions

Conceptualization, Q.Z. and F.H.; methodology, Q.Z. and F.H.; software, Q.Z.; validation, Q.Z., F.H. and Y.L.; formal analysis, Q.Z. and F.H.; data curation, Q.Z. and T. W.; writing—original draft preparation, Q.Z. and F.H.; writing—review and editing, F.H.; visualization, T.W.; supervision, F.H.; funding acquisition, F.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original data presented in this study are openly available in [33,34]. The source code is available at https://github.com/hfhust/DEMNet, accessed on 30 May 2025.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wu, L.; Ma, Y.; Fan, F.; Wu, M.; Huang, J. A Double-Neighborhood Gradient Method for Infrared Small Target Detection. IEEE Geosci. Remote Sens. Lett. 2021, 18, 1476–1480. [Google Scholar] [CrossRef]
Tong, X.; Sun, B.; Wei, J.; Zuo, Z.; Su, S. EAAU-Net: Enhanced Asymmetric Attention U-Net for Infrared Small Target Detection. Remote Sens. 2021, 13, 3200. [Google Scholar] [CrossRef]
Shao, X.; Fan, H.; Lu, G.; Xu, J. An Improved Infrared Dim and Small Target Detection Algorithm Based on the Contrast Mechanism of Human Visual System. Infrared Phys. Technol. 2012, 55, 403–408. [Google Scholar] [CrossRef]
Qi, S.; Ma, J.; Tao, C.; Yang, C.; Tian, J. A Robust Directional Saliency-Based Method for Infrared Small-Target Detection Under Various Complex Backgrounds. IEEE Geosci. Remote Sens. Lett. 2013, 10, 495–499. [Google Scholar]
Gao, C.; Meng, D.; Yang, Y.; Wang, Y.; Zhou, X.; Hauptmann, A.G. Infrared Patch-Image Model for Small Target Detection in a Single Image. IEEE Trans. Image Process. 2013, 22, 4996–5009. [Google Scholar] [CrossRef]
Chen, C.L.P.; Li, H.; Wei, Y.; Xia, T.; Tang, Y.Y. A Local Contrast Method for Small Infrared Target Detection. IEEE Trans. Geosci. Remote Sens. 2014, 52, 574–581. [Google Scholar] [CrossRef]
Han, J.; Ma, Y.; Zhou, B.; Fan, F.; Liang, K.; Fang, Y. A Robust Infrared Small Target Detection Algorithm Based on Human Visual System. IEEE Geosci. Remote Sens. Lett. 2014, 11, 2168–2172. [Google Scholar]
Zhang, H.; Zhang, L.; Yuan, D.; Chen, H. Infrared Small Target Detection Based on Local Intensity and Gradient Properties. Infrared Phys. Technol. 2018, 89, 88–96. [Google Scholar] [CrossRef]
Moradi, S.; Moallem, P.; Sabahi, M.F. A False-Alarm Aware Methodology to Develop Robust and Efficient Multi-Scale Infrared Small Target Detection Algorithm. Infrared Phys. Technol. 2018, 89, 387–397. [Google Scholar] [CrossRef]
Dai, Y.; Wu, Y.; Zhou, F.; Barnard, K. Asymmetric Contextual Modulation for Infrared Small Target Detection. In Proceedings of the 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2021; pp. 949–958. [Google Scholar]
Dai, Y.; Wu, Y.; Zhou, F.; Barnard, K. Attentional Local Contrast Networks for Infrared Small Target Detection. IEEE Trans. Geosci. Remote Sens. 2021, 59, 9813–9824. [Google Scholar] [CrossRef]
Wang, K.; Du, S.; Liu, C.; Cao, Z. Interior Attention-Aware Network for Infrared Small Target Detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5002013. [Google Scholar] [CrossRef]
Wu, X.; Hong, D.; Chanussot, J. UIU-Net: U-Net in U-Net for Infrared Small Object Detection. IEEE Trans. Image Process. 2023, 32, 364–376. [Google Scholar] [CrossRef]
Li, B.; Xiao, C.; Wang, L.; Wang, Y.; Lin, Z.; Li, M.; An, W.; Guo, Y. Dense Nested Attention Network for Infrared Small Target Detection. IEEE Trans. Image Process. 2023, 32, 1745–1758. [Google Scholar] [CrossRef]
Liu, F.; Gao, C.; Chen, F.; Meng, D.; Zuo, W.; Gao, X. Infrared Small and Dim Target Detection with Transformer Under Complex Backgrounds. IEEE Trans. Image Process. 2023, 32, 5921–5932. [Google Scholar] [CrossRef]
He, H.; Wan, M.; Xu, Y.; Kong, X.; Liu, Z.; Chen, Q.; Gu, G. WTAPNet: Wavelet Transform-Based Augmented Perception Network for Infrared Small-Target Detection. IEEE Trans. Instrum. Meas. 2024, 73, 5037217. [Google Scholar] [CrossRef]
Liu, Q.; Liu, R.; Zheng, B.; Wang, H.; Fu, Y. Infrared Small Target Detection with Scale and Location Sensitivity. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; IEEE: Seattle, WA, USA, 2024; pp. 17490–17499. [Google Scholar]
Zhang, M.; Li, X.; Gao, F.; Guo, J.-R. IRMamba: Pixel Difference Mamba with Layer Restoration for Infrared Small Target Detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025. [Google Scholar]
Wang, P.; Wang, J.; Chen, Y.; Zhang, R.; Li, Y.; Miao, Z. Paying More Attention to Local Contrast: Improving Infrared Small Target Detection Performance via Prior Knowledge. arXiv 2024, arXiv:2411.13260. [Google Scholar] [CrossRef]
Chen, T.; Tan, Z.; Chu, Q.; Wu, Y.; Liu, B.; Yu, N. TCI-Former: Thermal Conduction-Inspired Transformer for Infrared Small Target Detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024. [Google Scholar]
Lin, F.; Bao, K.; Li, Y.; Zeng, D.; Ge, S. Learning Contrast-Enhanced Shape-Biased Representations for Infrared Small Target Detection. IEEE Trans. Image Process. 2024, 33, 3047–3058. [Google Scholar] [CrossRef]
Liu, C.; Song, X.; Yu, D.; Qiu, L.; Xie, F.; Zi, Y.; Shi, Z. Infrared Small Target Detection Based on Prior Guided Dense Nested Network. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5002015. [Google Scholar] [CrossRef]
Chib, P.; Singh, P. Leveraging Language Prior for Infrared Small Target Detection. arXiv 2025, arXiv:2507.13113. [Google Scholar] [CrossRef]
Zhang, M.; Li, X.; Gao, F.; Guo, J.-R.; Gao, X.; Zhang, J. SAIST: Segment Any Infrared Small Target Model Guided by Contrastive Language-Image Pretraining. In Proceedings of the Computer Vision and Pattern Recognition, Nashville, TN, USA, 10–17 June 2025. [Google Scholar]
Zhang, M.; Wang, Y.; Guo, J.; Li, Y.; Gao, X.; Zhang, J. IRSAM: Advancing Segment Anything Model for Infrared Small Target Detection. arXiv 2024, arXiv:2407.07520. [Google Scholar] [CrossRef]
Zhang, F.; Li, C.; Shi, L. Detecting and Tracking Dim Moving Point Target in IR Image Sequence. Infrared Phys. Technol. 2005, 46, 323–328. [Google Scholar] [CrossRef]
Kim, S.; Sun, S.-G.; Kim, K.-T. Highly Efficient Supersonic Small Infrared Target Detection Using Temporal Contrast Filter. Electron. Lett. 2014, 50, 81–83. [Google Scholar] [CrossRef]
Deng, L.; Zhu, H.; Tao, C.; Wei, Y. Infrared Moving Point Target Detection Based on Spatial–Temporal Local Contrast Filter. Infrared Phys. Technol. 2016, 76, 168–173. [Google Scholar] [CrossRef]
Zhu, H.; Guan, Y.; Deng, L.; Li, Y.; Li, Y. Infrared Moving Point Target Detection Based on an Anisotropic Spatial-Temporal Fourth-Order Diffusion Filter. Comput. Electr. Eng. 2018, 68, 550–556. [Google Scholar] [CrossRef]
Sun, Y.; Yang, J.; An, W. Infrared Dim and Small Target Detection via Multiple Subspace Learning and Spatial-Temporal Patch-Tensor Model. IEEE Trans. Geosci. Remote Sens. 2021, 59, 3737–3752. [Google Scholar] [CrossRef]
Wu, F.; Yu, H.; Liu, A.; Luo, J.; Peng, Z. Infrared Small Target Detection Using Spatiotemporal 4-D Tensor Train and Ring Unfolding. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5002922. [Google Scholar] [CrossRef]
Li, J.; Zhang, P.; Zhang, L.; Zhang, Z. Sparse Regularization-Based Spatial–Temporal Twist Tensor Model for Infrared Small Target Detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5000417. [Google Scholar] [CrossRef]
Li, R.; An, W.; Xiao, C.; Li, B.; Wang, Y.; Li, M.; Guo, Y. Direction-Coded Temporal U-Shape Module for Multiframe Infrared Small Target Detection. IEEE Trans. Neural Netw. Learn. Syst. 2025, 36, 555–568. [Google Scholar] [CrossRef]
Hui, B.; Song, Z.; Fan, H.; Zhong, P.; Hu, W.; Zhang, X.; Ling, J.; Su, H.; Jin, W.; Zhang, Y.; et al. A Dataset for Infrared Detection and Tracking of Dim-Small Aircraft Targets under Ground / Air Background. China Sci. Data 2020, 5, 291–302. [Google Scholar] [CrossRef]
Yao, S.; Zhu, Q.; Zhang, T.; Cui, W.; Yan, P. Infrared Image Small-Target Detection Based on Improved FCOS and Spatio-Temporal Features. Electronics 2022, 11, 933. [Google Scholar] [CrossRef]
Kwan, C.; Gribben, D. Practical Approaches to Target Detection in Long Range and Low Quality Infrared Videos. Signal Image Process. Int. J. 2021, 12, 1–16. [Google Scholar] [CrossRef]
Kwan, C.; Gribben, D.; Budavari, B. Target Detection and Classification Performance Enhancement Using Super-Resolution Infrared Videos. Signal Image Process. Int. J. 2021, 12, 33–45. [Google Scholar] [CrossRef]
Ying, X.; Wang, Y.; Wang, L.; Sheng, W.; Liu, L.; Lin, Z.; Zhou, S. Local Motion and Contrast Priors Driven Deep Network for Infrared Small Target Super-Resolution. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 5480–5495. [Google Scholar] [CrossRef]
Yan, P.; Hou, R.; Duan, X.; Yue, C.; Wang, X.; Cao, X. STDMANet: Spatio-Temporal Differential Multiscale Attention Network for Small Moving Infrared Target Detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5602516. [Google Scholar] [CrossRef]
Chen, S.; Ji, L.; Zhu, J.; Ye, M.; Yao, X. SSTNet: Sliced Spatio-Temporal Network with Cross-Slice ConvLSTM for Moving Infrared Dim-Small Target Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5000912. [Google Scholar] [CrossRef]
Duan, W.; Ji, L.; Chen, S.; Zhu, S.; Ye, M. Triple-Domain Feature Learning with Frequency-Aware Memory Enhancement for Moving Infrared Small Target Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5006014. [Google Scholar] [CrossRef]
Ying, X.; Liu, L.; Lin, Z.; Shi, Y.; Wang, Y.; Li, R.; Cao, X.; Li, B.; Zhou, S.; An, W. Infrared Small Target Detection in Satellite Videos: A New Dataset and A Novel Recurrent Feature Refinement Framework. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5002818. [Google Scholar] [CrossRef]
Liu, X.; Zhu, W.; Yan, P.; Tan, Y. IR-MPE: A Long-Term Optical Flow-Based Motion Pattern Extractor for Infrared Small Dim Targets. IEEE Trans. Instrum. Meas. 2025, 74, 5005415. [Google Scholar] [CrossRef]
Peng, S.; Ji, L.; Chen, S.; Duan, W.; Zhu, S. Moving Infrared Dim and Small Target Detection by Mixed Spatio-Temporal Encoding. Eng. Appl. Artif. Intell. 2025, 144, 110100. [Google Scholar] [CrossRef]
Zhu, S.; Ji, L.; Chen, S.; Duan, W. Spatial–Temporal-Channel Collaborative Feature Learning with Transformers for Infrared Small Target Detection. Image Vis. Comput. 2025, 154, 105435. [Google Scholar] [CrossRef]
Zhang, L.; Zhou, Z.; Xi, Y.; Tan, F.; Hou, Q. STIDNet: Spatiotemporally Integrated Detection Network for Infrared Dim and Small Targets. Remote Sens. 2025, 17, 250. [Google Scholar] [CrossRef]
Zhu, S.; Ji, L.; Zhu, J.; Chen, S.; Duan, W. TMP: Temporal Motion Perception with Spatial Auxiliary Enhancement for Moving Infrared Dim-Small Target Detection. Expert Syst. Appl. 2024, 255, 124731. [Google Scholar] [CrossRef]
Zhang, M.; Ouyang, Y.; Gao, F.; Guo, J.; Zhang, Q.; Zhang, J. MOCID: Motion Context and Displacement Information Learning for Moving Infrared Small Target Detection. Proc. AAAI Conf. Artif. Intell. 2025, 39, 10022–10030. [Google Scholar] [CrossRef]
Yuan, S.; Qin, H.; Kou, R.; Yan, X.; Li, Z.; Peng, C.; Wu, D.; Zhou, H. Beyond Full Labels: Energy-Double-Guided Single-Point Prompt for Infrared Small Target Label Generation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 8125–8137. [Google Scholar] [CrossRef]
Kingma, D.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the International Conference on Learning Representations, Banff, AB, Canada, 14–16 April 2014. [Google Scholar]
Xu, G.; Cao, H.; Dong, Y.; Yue, C.; Zou, Y. Stochastic Gradient Descent with Step Cosine Warm Restarts for Pathological Lymph Node Image Classification via PET/CT Images. In Proceedings of the 2020 IEEE 5th International Conference on Signal and Image Processing (ICSIP), Nanjing, China, 23–25 October 2020; pp. 490–493. [Google Scholar]
Diakogiannis, F.I.; Waldner, F.; Caccetta, P.; Wu, C. ResUNet-a: A Deep Learning Framework for Semantic Segmentation of Remotely Sensed Data. ISPRS J. Photogramm. Remote Sens. 2020, 162, 94–114. [Google Scholar] [CrossRef]

Figure 1. Network architecture of DEMNet.

Figure 2. Encoder of SFEM.

Figure 3. Decoder of SFEM.

Figure 4. MI Encoder module.

Figure 5. Illustration of motion characteristics of clutter and target: (a) clutter, shows randomness of movement; (b) target, with continuous trajectory.

Figure 6. Example of the inter-frame encoding process. (a) Illustration of the target’s motion trajectory across consecutive frames. (b) Multi-frame MaxPooling maximum value positions at location A.

Figure 7. The process of the target crossing through four 2 × 2 pooling regions in the orange dashed box. When a target passes through this box over several consecutive frames, both the maximum value and its location within each pooling region change in each frame.

Figure 8. MI decoder module.

Figure 9. An example image in sequence 21 of DAUB dataset. The red box indicates the target, and the blue boxes mark the blind pixel regions.

Figure 10. Illustration of the limitations of the Fa metric. Red and orange pixels are false alarms and have the same number of pixels; however, the orange and red pixels will be taken as three false targets and one false target, respectively.

Figure 11. Visualized results of different methods. Images 1 and 2 are from NUDT; images 3 and 4 are from DAUB. Targets are highlighted and zoomed in with red boxes, while false alarm targets are highlighted and zoomed in with green boxes.

Figure 12. Comparison of ROC curves for different networks on the DAUB dataset.

Table 1. Ablation study of the MI Encoder. The best results are shown in boldface and the second best results are shown in underlined.

Model	Module			DAUB			NUDT(All)
Model	3D Max-Pooling	Intra-Frame Encoding	Inter-Frame Encoding	P_d/%	F_a/10⁻⁶	F_aT/%	P_d/%	F_a/10⁻⁶	F_aT/%
A	√			94.71	8.13	19.20	96.01	27.17	13.62
B	√	√		96.37	4.99	10.22	97.39	18.16	9.01
C	√		√	95.56	3.08	5.21	96.82	10.34	7.35
D	√	√	√	98.28	1.88	5.01	98.21	8.56	6.19

Table 2. Ablation study of the spatial feature extractor. The best results are shown in boldface and the second best results are shown in underlined.

Model	Module	DAUB			NUDT(All)
Model	Module	P_d/%	F_a/10⁻⁶	F_aT/%	P_d/%	F_a/10⁻⁶	F_aT/%
A	ResBlock and Upsample	94.62	9.28	6.48	94.39	9.28	6.48
B	w/o Attention Fusion	97.06	8.20	6.07	96.04	8.19	6.02
C	Complete module	98.28	1.88	5.01	98.21	8.56	6.19

Table 3. Ablation study of numbers of MI Encoder layers. The best results are shown in boldface and the second best results are shown in underlined.

Layers	DAUB			NUDT			Resource
Layers	P_d/%	$F_{a} / 10^{- 6}$	F_aT/%	P_d/%	$F_{a} / 10^{- 6}$	F_aT/%	Params/M	Flops/G
3	93.71	4.94	11.07	97.69	8.57	9.31	3.883	62.448
4	98.28	1.88	5.01	98.21	8.56	6.19	4.458	64.131
5	96.99	3.82	5.44	98.50	7.63	6.48	6.754	65.812

Table 4. Ablation study of number of frames. The best results are shown in boldface and the second best results are shown in underlined.

Frames	DAUB			NUDT			Resource
Frames	P_d/%	$F_{a} / 10^{- 6}$	F_aT/%	P_d/%	$F_{a} / 10^{- 6}$	F_aT/%	Params/M	Flops/G
3	92.71	8.13	12.21	91.02	21.73	19.61	4.153	37.317
5	98.28	1.88	5.01	98.21	8.56	6.19	4.458	64.131
7	98.84	1.81	5.38	98.07	8.03	6.47	4.763	92.848

Table 5. Experimental results and resource overhead of different models on the DAUB dataset. The best results are shown in boldface and the second best results are shown in underlined.

		DAUB			Resource and Speed
		P_d/%	F_a/10⁻⁶	F_aT/%	Params/M	Flops/G	FPS
Single Frame	ResUNet [52]	85.70	25.48	20.45	0.914	2.589	79.9
	DNANet [14]	92.15	14.35	21.63	1.134	7.795	37.6
	UIUNet [13]	86.54	7.85	16.18	50.541	54.501	32.7
	MSHNet [17]	87.29	24.77	32.74	4.065	6.065	45.1
Multi Frame	DTUM [33]	95.86	6.01	10.31	0.298	15.351	24.4
	SST [40]	89.76	\	5.05	11.418	43.242	21.8
	RFR [42]	92.16	8.79	19.43	1.206	14.719	40.2
	DEMNet	98.28	1.88	5.01	4.458	64.131 *	14.8

(* Note: When feature extraction is performed on every single frame, the computation cost is 64.131G. However, when the background remains nearly unchanged, features from the latest 4 frames can be reused, reducing the computation cost to 18.299 G).

Table 6. Experimental results of different models on NUDT dataset. The best results are shown in boldface and the second best results are shown in underlined.

		NUDT(SNR ≤ 3)			NUDT(3 < SNR < 10)			NUDT(All)
		P_d/%	F_a/10⁻⁶	F_aT/%	P_d/%	F_a/10⁻⁶	F_aT/%	P_d/%	F_a/10⁻⁶	F_aT/%
Single Frame	ResUNet [52]	17.58	506.95	246.13	81.33	472.12	116.67	61.48	485.97	155.87
	DNANet [14]	19.28	441.42	227.60	89.83	123.56	35.83	68.25	249.91	94.51
	UIUNet [13]	28.36	195.82	106.62	82.67	62.81	28.47	66.05	115.69	52.17
	MSHNet [17]	4.537	441.44	175.61	86.00	66.63	29.75	61.08	215.65	74.38
Multi Frame	DTUM [33]	90.74	9.22	14.18	99.08	9.23	3.33	96.53	9.23	6.65
	SST [40]	51.04	\	32.9	80.75	\	26.33	71.66	\	28.34
	RFR [42]	39.41	60.907	39.779	90.42	106.28	41.50	74.527	88.240	40.96
	DEMNet	96.41	6.77	11.72	99.00	9.74	3.75	98.21	8.56	6.19

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

He, F.; Zhang, Q.; Li, Y.; Wang, T. DEMNet: Dual Encoder–Decoder Multi-Frame Infrared Small Target Detection Network with Motion Encoding. Remote Sens. 2025, 17, 2963. https://doi.org/10.3390/rs17172963

AMA Style

He F, Zhang Q, Li Y, Wang T. DEMNet: Dual Encoder–Decoder Multi-Frame Infrared Small Target Detection Network with Motion Encoding. Remote Sensing. 2025; 17(17):2963. https://doi.org/10.3390/rs17172963

Chicago/Turabian Style

He, Feng, Qiran Zhang, Yichuan Li, and Tianci Wang. 2025. "DEMNet: Dual Encoder–Decoder Multi-Frame Infrared Small Target Detection Network with Motion Encoding" Remote Sensing 17, no. 17: 2963. https://doi.org/10.3390/rs17172963

APA Style

He, F., Zhang, Q., Li, Y., & Wang, T. (2025). DEMNet: Dual Encoder–Decoder Multi-Frame Infrared Small Target Detection Network with Motion Encoding. Remote Sensing, 17(17), 2963. https://doi.org/10.3390/rs17172963

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DEMNet: Dual Encoder–Decoder Multi-Frame Infrared Small Target Detection Network with Motion Encoding

Abstract

1. Introduction

1.1. Related Works

1.1.1. Single-Frame Methods

1.1.2. Multi-Frame Methods

1.2. Motivation

2. Methods

2.1. Overall Architecture

2.2. Spatial Feature Extractor Module

2.2.1. Encoder of SFEM

2.2.2. Decoder of SFEM

2.3. Motion Information Encoder Module

2.3.1. Inter-Frame Motion Encoding Module

2.3.2. Intra-Frame Positional Encoding Module

2.4. Motion Information Decoder Module

3. Experiments and Results

3.1. Dataset

3.2. Performance Evaluation Indices

3.3. Network Training

3.4. Ablation Study

3.4.1. Effectiveness of the MI Encoder

3.4.2. Effectiveness of the Spatial Feature Extractor

3.4.3. Optimal Number of Layers of MI Encoders

3.4.4. Optimal Number of Frames

3.5. Comparative Experiments

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI