Next Article in Journal
Cross-Domain Land Surface Temperature Retrieval via Strategic Fine-Tuning-Based Transfer Learning: Application to GF5-02 VIMI Imagery
Previous Article in Journal
Reconstructed SWHs Based on a Deep Learning Method and the Revealed Long-Term SWH Variance Characteristics During 1993–2024
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Memory-Based Temporal Transformer U-Net for Multi-Frame Infrared Small Target Detection

1
College of Aerospace Science and Engineering, National University of Defense Technology, Changsha 410073, China
2
School of Communication and Information Engineering, Xi’an University of Posts and Telecommunications, Xi’an 710121, China
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Remote Sens. 2025, 17(23), 3801; https://doi.org/10.3390/rs17233801
Submission received: 15 October 2025 / Revised: 11 November 2025 / Accepted: 19 November 2025 / Published: 23 November 2025
(This article belongs to the Section AI Remote Sensing)

Highlights

What are the main findings?
  • We propose a memory-based temporal Transformer U-Net (MTTU-Net) for multi-frame infrared small target detection, which employs a memory mechanism to adaptively extract spatio-temporal features from long sequences, thereby achieving better detection performance.
  • MTTU-Net adopts a Transformer-based spatio-temporal feature interactive fusion approach, which can deal with targets and backgrounds with various motion states effectively.
What are the implication of the main findings?
  • Our proposed MTTU-Net overcomes the limitations imposed by the time window paradigm, which restricts spatio-temporal feature extraction in existing algorithms.
  • It also relieves the dependency on frame alignment that is typically required for target enhancement and background suppression, thereby adapting to more complex motion scenarios.

Abstract

In the field of infrared small target detection (ISTD), single-frame ISTD (SISTD), using only spatial features, cannot deal well with dim targets in cluttered backgrounds. In contrast, multi-frame ISTD (MISTD), utilizing spatio-temporal information from videos, can significantly enhance moving target features and effectively suppress background interference. However, current MISTD algorithms are limited by fixed-size time windows, resulting in an inability to adaptively adjust the input amount of spatio-temporal information for different detection scenarios. Moreover, utilizing spatio-temporal features remains a significant challenge in MISTD, particularly in scenarios involving slow-moving targets and fast-moving backgrounds. To address the above problems, we propose a memory-based temporal Transformer U-Net (MTTU-Net), which integrates a memory-based temporal Transformer module (MTTM) into U-Net. Specifically, MTTM utilizes the proposed D-ConvLSTM to sequentially transmit the temporal information in the form of memory, breaking through the limitation of the time window paradigm. And we propose a Transformer-based interactive fusion approach, which is dominated by spatial features of the to-be-detected frame and supplemented by temporal features in the memory, thereby effectively dealing with targets and backgrounds with various motion states. In addition, MTTM is divided into a temporal channel-cross Transformer module (TCTM) and a temporal space-cross Transformer module (TSTM), which achieve target feature enhancement and global background perception through feature interactive fusion in the channel and space dimensions, respectively. Extensive experiments on IRDST and IDSMT datasets demonstrate that our MTTU-Net outperforms existing MISTD algorithms, and they verify the effectiveness of the proposed modules.

1. Introduction

Infared small target detection (ISTD) aims to accurately locate small targets in infrared images and videos [1]. Benefiting from the remarkable performance of infrared imaging under all-weather and low-light conditions, ISTD is widely used in intrusion warning, target guidance, and maritime rescue [2]. Recently, ISTD has been developing rapidly, and it focuses on two main challenges. Firstly, infrared targets are tiny in size (less than 9 × 9 pixels) and lack obvious appearance features, such as textures and shapes. Secondly, under interference from complex background clutter, the targets typically exhibit characteristics of low contrast and low signal-to-noise ratio (SNR).
For traditional schemes, model-driven detection algorithms have achieved impressive results, such as local-contrast-measure-based algorithms [3,4,5] and low-rank-based algorithms [6,7]. They analyze contrast and texture differences in image features to distinguish small targets from background. However, they heavily depend on hand-crafted features and prior knowledge, lacking adaptability for diverse scenarios. In contrast, data-driven detection algorithms [8,9,10], mainly based on deep learning, have become the mainstream schemes due to their powerful learning ability for feature representations.
According to the form of input data, ISTD can be classified into single-frame ISTD (SISTD) and multi-frame ISTD (MISTD). SISTD performs static target detection in a single image using only spatial features [4,8,11,12], often with the advantage of low computational complexity. Nevertheless, it suffers from miss detection due to dim target features and false alarms caused by severe background interference. By contrast, MISTD performs moving target detection in the video sequence [13,14,15,16], which can utilize the inter-frame spatio-temporal information to enhance target features and suppress background clutter. Therefore, the research focus of MISTD centers on fully extracting spatio-temporal features from video sequences and effectively utilizing them to improve detection performance.
In terms of extracting spatio-temporal features, almost all existing MISTD algorithms adopt the time window paradigm [5,13,17]. Specifically, these algorithms set a sliding time window with size n (from t − n + 1 to t), where spatio-temporal features are extracted by simultaneously feeding n frames into backbones or U-net’s encoders, as shown in Figure 1a,b, to detect targets in the current frame (to-be-detected frame, Frame t). This paradigm causes computational redundancy in the overlapping part of time windows, reducing the running efficiency. Moreover, as a hyper-parameter, n can only be adjusted manually. If n is too small, algorithms will lack sufficient spatio-temporal information, consequently failing to detect dim targets in challenging scenarios. If n is too large, it will seriously increase the computational complexity and lead to lower efficiency. To balance performance and efficiency, MISTD algorithms with time windows typically extract spatio-temporal features from short sequences, such as five frames [13], though this represents a suboptimal compromise. Fundamentally, the time window paradigm restricts algorithms from adaptively determining the appropriate spatio-temporal information amount for different detection scenarios.
In terms of utilizing spatio-temporal features, current MISTD algorithms typically rely on frame alignment as a prerequisite to achieve target feature enhancement and background interference suppression. For example, Du et al. propose the inter-frame energy accumulation enhancement (IFEA) to enhance the target features [18]; STDMANet utilizes inter-frame differential features from aligned sequences to enhance motion target perception [19]; DTUM encodes inter-frame motion directions to obtain target motion features [17]. However, on the one hand, these algorithms suffer from performance degradation when detecting stationary or slow-moving targets. On the other hand, they are susceptible to false alarms caused by frame alignment errors, especially when the background moves dramatically.
In the video processing research field, the memory mechanism can perform efficient spatio-temporal modeling, which is widely used for various tasks, such as spatio-temporal prediction [20] and video object segmentation [21]. It has the significant advantage of adaptively extracting temporal information from long sequences. Meanwhile, some Transformer-based algorithms (e.g., TimeSformer [22] and VSR Transformer [23]) leverage global attention to adapt to minor inter-frame displacements and efficiently query spatio-temporal information for video-related tasks. Inspired by this, we propose to integrate a memory mechanism with a Transformer to overcome the limitations of the time window paradigm and address the drawbacks of the frame alignment scheme.
According to the above analysis, we propose MTTU-Net, a memory-based temporal Transformer U-Net. As shown in Figure 1c, our MTTU-Net adds a memory-based temporal Transformer module (MTTM) between the encoder and decoder of U-Net for achieving segmentation-based MISTD. Firstly, to break through the limitations of the time window paradigm, MTTM adopts the memory mechanism to store and transmit the spatio-temporal information. Secondly, to utilize spatio-temporal information without frame alignment as the premise, it uses a Transformer to implement the interactive fusion between the spatial features of the current frame and the temporal features in memory. Thirdly, to enhance the perception of targets and backgrounds, MTTM comprehensively implements channel attention and space attention.
In detail, MTTM consists of a temporal channel-cross Transformer module (TCTM) and a temporal space-cross Transformer module (TSTM): TCTM implements feature interactive fusion in the channel dimension to enhance target features for reducing misdetection, while TSTM implements feature interactive fusion in the space dimension to achieve global background perception for reducing false alarms. In TCTM and TSTM, the spatial features of the current frame are set as a query (Q) to be dominant, while the memory is set as a key (K) and value (V) to play an auxiliary role. This scheme makes our MTTU-Net query the temporal information in the memory to obtain performance gain based on detecting targets using the spatial information of the current frame. It not only maintains the sensitivity to stationary or slow-moving targets but also obtains high robustness to the rapid background movement. Additionally, we propose a dual-output convolutional long short-term memory network (D-ConvLSTM) to update the memory about K and V through the gating mechanism. Our codes are available at https://github.com/ZCFengF/MTTU-Net (accessed on 1 October 2025).
In summary, the main contributions of our work are summarized as follows:
(1)
We propose MTTU-Net, a memory-based temporal Transformer U-Net for MISTD, which utilizes the proposed D-ConvLSTM to save and update temporal information in memory. It overcomes the limitations of the time window, which is beneficial to adaptively extract adequate spatio-temporal features from long sequences (more than 10 frames) to improve the detection performance.
(2)
We propose MTTM, which adopts a Transformer-based spatio-temporal feature interactive fusion approach. It is dominated by the spatial features of the current frame and supplemented by the temporal features in memory, which can deal with targets and backgrounds with various motion states effectively.
(3)
In MTTM, we present TCTM and TSTM to achieve target feature enhancement and global background perception through feature cross fusion in the channel and space dimensions, which reduce misdetection and false alarms, respectively.

2. Related Work

2.1. Single-Frame Infrared Small Target Detection

Single-frame infrared small target detection takes a single image as input and extracts its spatial features to detect targets, and it is usually categorized into model-driven and data-driven.
Model-driven algorithms include spatial-filter-based, local-contrast-measure-based and low-rank-based algorithms. Spatial-filter-based algorithms utilize the characteristics of infrared images to enhance target features and suppress background clutter, such as Top-Hat [24] and FKRW [25]. Local-contrast-measure-based algorithms extract salient target regions by measuring the maximum contrast between central pixels and their neighboring regions, such as RLCM [4] and MPCM [3]. Low-rank-based algorithms separate small targets based on the low-rank characteristic of backgrounds, such as IPI [6] and RIPT [26]. Model-driven algorithms are suitable for small targets with high SNR, and they may produce serious false alarms in complex scenarios.
Data-driven algorithms primarily exploit deep neural networks to learn target feature representations by training on numerous labeled samples. These algorithms demonstrate outstanding performance in complex scenarios, making them a current research hotspot. For example, ISTDU-Net [27] adds merge connections into U-Net to enhance the difference between small targets and backgrounds. DNANet [11] designs a dense nested structure to enforce the connection between the encoder and decoder in U-Net. MTU-Net [9] combines Vision Transformer (ViT) [28] and a CNN to fuse the multi-level spatial features in U-Net. UIUNet [29] embeds small U-Nets into a large U-Net to learn multi-level feature representations. MSHNet [30] proposes the scale and location sensitive loss to optimize target localization and segmentation. SCTransNet [31] proposes a spatial-channel cross Transformer to realize the interactive fusion of multi-level semantic features in U-Net. In summary, most data-driven SISTD algorithms employ the U-shape network to segment targets from backgrounds, which has the advantage of locating targets and estimating shapes at the pixel level. And, they focus on improving the connection between the encoder and decoder of U-Net to achieve higher performance. In particular, SCTransNet and MTU-Net verify the effectiveness of Transformer-based schemes in the channel and space dimensions, respectively.

2.2. Multi-Frame Infrared Small Target Detection

Multi-frame infrared small target detection takes multiple consecutive frames as input. Compared with SISTD, its advantage lies in extracting inter-frame spatio-temporal features to enhance moving target features and suppress background clutter. MISTD algorithms can also be classified as model-driven and data-driven.
Model-driven algorithms typically process spatio-temporal tensors constructed with multiple frames, including spatio-temporal local contrast measure and spatio-temporal low-rank tensor analysis. Spatio-temporal local contrast measure algorithms, such as STRL-LBCM [32] and STLCTD [5], detect targets by utilizing the contrast difference between targets and backgrounds among multiple frames. Spatio-temporal low-rank tensor analysis algorithms model the infrared image as the sum of target, background, and noise, and they attempt to separate targets from background by applying low-rank and sparse decomposition (LRSD) to spatio-temporal tensors, such as NFTDGSTV [7] and FST-FLNN [33]. LRSD requires the alternating direction method of multipliers (ADMM), which is inefficient for running.
Data-driven algorithms can more comprehensively mine the spatio-temporal information from multiple frames by deep neural networks. Researchers have proposed various algorithms to exploit the temporal information. For example, Liu et al. [34] utilize ConvLSTM [20] and 3D convolution to extract the spatio-temporal features from multiple frames, and they use fully connected layers to obtain the location and scale of targets. Du et al. [18] propose IFEA to enhance the target features in the current frame, then they use Faster-RCNN [35] to detect targets. TAD [36] detects small moving targets by capturing the inconsistency of pixel displacements between adjacent frames. ST-Trans [10] takes YOLO [37] as the detection framework and uses the video swin-Transformer [38] to fuse multi-frame spatio-temporal features. SSTNet [13] proposes a cross-slice ConvLSTM to guide CSPDarknet [37] to extract the spatio-temporal features from multiple frames, and it uses a motion-coupling neck to further fuse these features. Tridos [39] detects small targets by combining spatial, temporal, and frequency features of multiple frames. The above algorithms employ target-based frameworks, which perform target localization on downsampled feature maps (typical factor is 8×) and output bounding boxes as detection results, as shown in Figure 1a. These low-resolution feature maps lead to the dilution of small target features and difficulty in precise target localization. In contrast, some algorithms adopt segmentation-based frameworks to obtain target positions and shapes more accurately, as shown in Figure 1b. For example, STDMANet [19] acquires frame difference maps and then feeds them into the improved DNANet to detect moving targets. DTUM [17] replaces 2D convolution in U-net with 3D convolution to extract spatio-temporal features and encodes motion directions to obtain target motion information. LMAFormer [40] uses a local motion-aware Transformer for moving target segmentation. TSINF [16] introduces a spatial–temporal feature fusion Transformer at the deepest level of U-net to segment targets.
Furthermore, all the above algorithms adopt the time window paradigm, which cannot adaptively adjust the input amount of temporal information for various scenarios. Although some algorithms employ memory modules like ConvLSTM, they merely utilize these modules for extracting temporal information within the time window rather than constructing memory-based MISTD frameworks. Additionally, many of them rely on frame alignment to capture moving targets and mitigate dynamic background interference [17,18,19,40], but they often face two critical risks: difficulty in detecting stationary or slow-moving targets and susceptibility to frame alignment errors.

2.3. Memory Mechanism

The memory mechanism was first proposed for temporal modeling tasks, such as speech processing and temporal prediction, aiming to address the long-term dependency challenge. Its evolution diverged into two distinct pathways: (1) the implicit memory mechanism compresses the temporal information into the hidden tensor by gating units, which is efficient but capacity-limited, e.g., LSTM and GRU; (2) the explicit memory mechanism retrieves temporal information through the external read–write storage, which is storage-scalable but structurally-complex, e.g., MemNN [41] and NTM [42]. In recent years, similar memory mechanisms have been used in advanced vision tasks, e.g., visual tracking [43], spatio-temporal prediction [44], and video object segmentation [45]. Among these algorithms, some methods [44,45] group memory information into key and value characteristics, similar to the Transformer structure, where the key is used for addressing and the essential value characteristics are adaptively collected. For leveraging temporal information efficiently, our proposed MTTU-Net combines a Transformer and implicit memory mechanism.

3. Materials and Methods

We propose MTTU-Net, a memory-based temporal Transformer U-Net, whose main motivations include:
(1)
To break through the limitations of the time window paradigm, D-ConvLSTM is proposed to adaptively adjust the input amount of temporal information for different scenarios, as described in Section 3.2.3.
(2)
To handle challenging scenarios like slow-moving targets and fast-moving backgrounds, our proposed MTTM adopts a Transformer-based feature interactive fusion method, which is dominated by the spatial features of the current frame and supplemented by the temporal features in the memory, as described in Section 3.2.
(3)
To reduce misdetection and false alarms, MTTM integrates two components: TCTM fuses features in the channel dimension for enhancing target features, and TSTM fuses features in the space dimension for global background perception, as described in Section 3.2.1 and Section 3.2.2, respectively.
The overall pipeline of MTTU-Net is described in Section 3.1, which employs a segmentation-based framework to accurately localize targets and estimate their shapes.

3.1. Overall Pipeline

As shown in Figure 2, at the current time t, we only input the current frame (to-be-detected frame, Frame t) into MTTU-Net, and we use the memory M t 1 = { M C t 1 , M S t 1 } to provide the spatio-temporal information to the model. Our MTTU-Net does not require frame alignment processing, and it is different from other MISTD algorithms that need to simultaneously input n frames within the time window (from Frame t n + 1 to Frame t) to extract spatio-temporal features. Concretely, in the encoder of the U-shape framework, MTTU-Net employs four DownBlocks, consisting of residual block (ResBlock) and Maxpooling, to extract multi-level spatial features x i t R C i × H / i × W / i , ( i = 1 , 2 , 3 , 4 ). H and W are the height and width of the frame, and C i are the channel dimensions, which are set to 32, 64, 128, and 256, respectively. Next, we perform normalized patch embedding (NPE) on x i t using convolution with kernel size and stride size of 16, 8, 4, and 2 to obtain embedded features with the same size, in which c = 128 , h = H / 16 , h = W / 16 . Then, e i t and M t 1 are fed into the proposed MTTM to achieve multi-level spatio-temporal feature fusion and obtain outputs o t t R c × h × m and the updated memory M t = { M c t , M s t } . Details of MTTM are provided in the next section. Further, they are recovered to the size of the original encoder processing using reconstruction operation (RO), which consists of bilinear interpolation and convolution, obtaining r i t R C i × H / i × W / i . Meanwhile, we employ a residual connection to merge the features between the encoder and decoder. The process described above can be expressed mathematically as follows:
r t , M t = x t + RO MTTM e t , M t 1
In the decoder of the U-shape framework, we use four UpBlocks, consisting of upsampling, channel cross-attention (CCA) [46], and CBR (Conv + BN + ReLU), to fuse and decode the low-level and high-level features of r i t . Finally, the saliency map Output t is obtained through 1 × 1 convolution and a sigmoid function for reducing dimension and mapping, respectively.
During model training, to enhance the gradient propagation efficiency and feature representation, we perform a temporal average on the multi-level deeply supervised fusion strategy of SCTransNet to optimize our MTTU-Net. Specifically, to make the model learn how to store and update the memory, we provide m consecutive frames in a batch of training samples. MTTU-Net needs to process these frames sequentially and calculate the result loss for each frame. The final L o s s is the average of m losses, as follows:
L o s s = 1 m t = 1 m l Σ t O Σ t , Y t + i = 1 5 l i t O i t , Y t
where Y t R 1 × H × W is the ground truth of frame t, l i t is the loss of the upsampled output O i t R 1 × H × W at level i of frame t, and l Σ t is the loss of the all-level fusion output O Σ t R 1 × H × W of frame t. These losses are calculated using the binary cross entropy (BCE); please refer to [31] for more details.

3.2. Memory-Based Temporal Transformer Module

MTTM takes multi-level embedded features e i t as input and outputs spatio-temporal fused features o i t with the same sizes. Specifically, MTTM is composed of TCTM and TSTM, which are connected in a cascaded manner. In both TCTM and TSTM, D-ConvLSTM is used to store and update the memory, serving as the core component of extracting and transmitting temporal information. In addition, the core of utilizing temporal information lies in the Transformer-based spatio-temporal interactive fusion method, where the multi-level spatial features of the current frame are set as the dominant query (Q), and the temporal features in the memory are set as the supplementary key (K) and value (V).

3.2.1. Temporal Channel-Cross Transformer Module

As shown in the left of Figure 3, TCTM establishes channel-wise dependencies between the current frame and memory, aiming to make the model focus on more discriminative target features to avoid miss detection, as shown in Figure 4a. Given four levels of encoded features e i t , TCTM concatenates them in the channel dimension and performs layer normalization (LN) to obtain input tokens I C t 1 × 4 c × h × w (C:TCTM). Next, I C t will be processed in two paths (Q Path and K & V Path) to obtain Q C t , K C t and V C t .
Firstly, in Q Path, as shown by the red line, I C t is processed to obtain Q C t 1 × 4 c × h × w by utilizing 1 × 1 convolutions to consolidate pixel-wise cross-channel context and then applying 3 × 3 depth-wise convolutions to capture local spatial context. Mathematically,
Q C t = W C d W C p I C t
where W C p is the 1 × 1 point-wise convolution and W C d is 3 × 3 depth-wise convolution. Q C t contains the multi-level spatial features of the current frame.
Secondly, in K & V Path, as shown by the blue line, I C t is processed to obtain K C t R 1 × 4 c × h × w and V C t R 1 × 4 c × h × w by using the proposed D-ConvLSTM with the additional input K C t 1 and V C t 1 . Meanwhile, the previous memory M C t 1 is updated as M C t R 1 × 4 c × h × w . The above process is as follows:
M C t , K C t , V C t = D - ConvLSTM I C t , M C t 1 , K C t 1 , V C t 1
where M C t is only used as an intermediate variable in D-ConvLSTM and does not participate in any other processing. K C t and V C t contain the spatio-temporal features of the current frame and the previous frames.
Finally, in the channel dimension, applying the cross-attention as
CrossAtt ( Q C t , K C t , V C t ) = A C t V C t = Softmax Q C t ( K C t ) T d V C t
C A C t = C F N R O CrossAtt ( Q C t , K C t , V C t )
where d = 4 c is an optional scaling factor, A C t R 4 c × 4 c is the channel covariance-based attention map, and C A C t R 1 × 4 c × h × w is the result of the cross attention, which is split by channel dimension to obtain the final TCTM output c i t R c × h × w . A complementary feed-forward network (CFN) [31] is designed for detecting infrared small targets, which can efficiently combine global and local information.
TCTM achieves two kinds of feature interactive fusion simultaneously. One is the interactive fusion of multi-level spatial features in the current frame, which is conducive to the semantic interaction between different levels. The other is the interactive fusion between the spatial features of current frame and the temporal features of the memory. Note that the multi-level spatial features of the current frame, which are contained in Q C t , participate in two kinds of fusion simultaneously, thus playing a dominant role in the detection process. And, the spatio-temporal features in K C t and V C t only play an auxiliary role because they are only involved in the second fusion. Based on using the spatial information of the current frame for detecting targets, this strategy can enable the model to query the effective temporal information to obtain performance gain. It offers the distinct advantages of maintaining sensitivity to slow-moving targets and avoiding the interference caused by rapid background motion.

3.2.2. Temporal Space-Cross Transformer Module

In MISTD, the spatio-temporal features of backgrounds are crucial to suppressing clutter and eliminating false alarms. TSTM establishes the space-wise dependencies between the current frame and memory, aiming to enhance the global perception of backgrounds. As shown in the right of Figure 3, the structure of TSTM is similar to that of TCTM, but there are three differences: (1) The inputs c i t are concatenated by the batch dimension instead of the channel dimension to maintain the channel independence of the multi-level features in subsequent processing. (2) To reduce the computational complexity, the pyramid pooling [48] is used to compress the space dimensions of K S t and V S t (S:TSTM). (3) The cross attention is implemented in the space dimension.
In detail, the concatenated c i t is normalized using LN to obtain the input tokens I S t R 4 × c × h × w . Then, according to (7) and (8), we obtain Q S t R 4 × c × h × w , K S t R 4 × c × h × w and V S t R 4 × c × h × w , through Q Path and K & V Path, respectively.
Q S t = W S d W S p I S t
M C t , K C t , V C t = D - ConvLSTM I S t , M S t 1 , K S t 1 , V S t 1
where W S p is the 1 × 1 point-wise convolution and W S d is 3 × 3 depth-wise convolution. The previous memory M S t 1 is updated as the current memory M S t R 4 × c × h × w .
Next, K S t and V S t are compressed using pyramid pooling (PP) to obtain K ˜ S t R 4 × c × l and V ˜ S t R 4 × c × l , respectively, l = 110 , as shown in Figure 5. The essence of PP is multi-scale sparse sampling, which can reduce the computational complexity of the subsequent cross attention from O ( 2 c ( 4 h w ) 2 ) to O ( 2 c ( 4 l ) ( 4 h w ) ) .
Finally, in the space dimension, applying the cross-attention as
CrossAtt Q S t , K S t , V S t = A S t V ˜ S t = Softmax Q S t K ˜ S t T d V ˜ S t
C A S t = CFN RO CrossAtt Q S t , K S t , V S t
where d = c is an optional scaling factor. A S t R 4 h w × 4 l is the space covariance-based attention map, and C A S t R 4 × c × h × w is the result of the cross attention, which is split by batch dimension to obtain the final TSTM output o i t R c × h × w .
Similar to TCTM, the above processing is dominated by the features of the current frame, and it performs two kinds of feature interactive fusion simultaneously. One is the interactive fusion among the spatial features at different levels in the current frame, achieving multi-scale global background perception. The other is the feature interactive fusion between the current frame and the memory to utilize the temporal information of dynamic backgrounds. Overall, as shown in Figure 4b, TSTM enables our model to perceive more complete background regions, which is beneficial to reduce false alarms caused by background interference.

3.2.3. Dual-Output Convolutional LSTM

As a two-dimensional variant of the original LSTM, ConvLSTM can process two-dimensional image data by replacing the fully connected layers with convolutional layers. It has the characteristics of high efficiency and assigning more attention to recent temporal information, making it well-suited for MISTD task where recent temporal context is intuitively more crucial for detecting targets in the current frame. However, the limited receptive field of ConvLSTM makes it difficult to extract the temporal features of fast-moving scenarios. The common solution is to stack multiple memory units to expand the receptive field, such as PredRNN++ [49], yet it not only significantly increases the computational complexity but also causes the vanishing gradient problem. In our detection framework, dual-output ConvLSTM (D-ConvLSTM) can naturally extend its receptive field by incorporating external U-net’s encoders, which have multi-scale receptive fields. Thus, D-ConvLSTM only needs to take concatenated multi-level features I t { I C t , I S t } (C: TCTM; S: TSTM) as the input to alleviate this problem. Additionally, D-ConvLSTM sets two output branches to meet the requirement of obtaining K t { K C t , K S t } and V t { V C t , V S t } in TCTM and TSTM.
As shown in Figure 6, D-ConvLSTM takes I t , K t 1 and V t 1 as input, as well as taking K t and V t as output. M t 1 { M C t 1 , M S t 1 } represents the previous memory to store the temporal information before the current frame t, and it is updated every frame. Concretely, D-ConvLSTM sets a forget gate f t to selectively discard irrelevant information in the memory, an input gate i t to regulate the memory update, and two output gates o k t , o v t to generate K t and V t , which can integrate spatio-temporal features. Among them, o k t and o v t are mainly controlled by K t 1 and V t 1 , respectively, which facilitates maintaining the functional consistency of K t and V t as key and value in the sequence processing. The above gating mechanism enables the model to adaptively extract a longer span of temporal information. The detailed process is as follows:
f t , i t , M ˜ t = ϕ Ch W d M W p M [ I t , K t 1 , V t 1 ] o K t = Sigmoid W d K W p K [ I t , K t 1 ] o V t = Sigmoid W d V W p V [ I t , V t 1 ] M t = f t M t 1 + i t M ˜ t K t = o K t Tanh M t V t = o V t Tanh M t
where ϕ denotes the activation function, ⊙ represents the element-wise product, [ · ] indicates the channel concatenation, Ch stands for chunking channels, W p ( · ) is 1 × 1 point-wise convolution, W d ( · ) is 3 × 3 depth-wise convolution, and M t is the updated memory. The size of all variables is b × c ˜ × h × w , where b is the batch size and c ˜ is the channel number. In TCTM, b = 1 and c ˜ = 4 c ; in TSTM, b = 4 and c ˜ = c .
In terms of information transmission, K t and V t fuse the spatial features of the current frame and the temporal features of the memory. In terms of model design, D-ConvLSTM replaces the original way of obtaining K and V by convolutional layers in the Transformer, which is the core of achieving the memory-based temporal Transformer in our work.

4. Results

4.1. Datasets

We conducted experiments using two MISTD datasets with mask annotations, including IRDST [47] and IDSMT [50]. IRDST is a real dataset, which contains real infrared targets and backgrounds captured by handheld infrared cameras (Long-wave, 7.5–13.5 μm and 7–13 μm). According to [13], its training set consists of 42 videos containing 20,398 frames, and the test set consists of 43 videos containing 20,258 frames. The resolution of each frame is uniformly adjusted to 480 × 720. IDSMT is a semi-synthetic dataset, which contains the real infrared backgrounds captured by UAV-carried infrared cameras (Long-wave, 8–14 μm) and the simulated infrared targets generated by using an adversarial generation network. Its training set and test set have 100 videos, respectively; each video contains 300 frames, and the resolution of each frame is 512 × 640. These two datasets cover diverse scenarios, e.g., clouds, buildings, and vegetation, where targets and backgrounds are in motion. All targets are smaller than 9 × 9 pixels, while the majority of them exhibit an SNR of less than 5 dB.

4.2. Evaluation Metrics

We compare the proposed MTTU-Net with some SOTA algorithms using two kinds of evaluation metrics, including target-level and pixel-level.
Target-level metrics are used to measure the localization ability, including precision (P) (%), recall (R) (%), and F1 score (%). According to [11], a target is considered correctly predicted if its centroid deviation is less than 3.
P = T P T P + F P , R = T P T P + F N , F 1 = 2 × P × R P + R
where T P , F P , and F N denote the number of correct detection targets (true positives), false alarms (false positives) and missed detection targets (false negatives), respectively. According to [11,31], recall is also named as probability of detection (Pd). As a comprehensive metric, F1 score is computed by combining both precision and recall.
Pixel-level metrics are used to measure the shape description ability, including Intersection over Union ( I o U ) (%) and false alarm rate ( F a ).
I o U = A i n t e r A u n i o n , F a = P f a l s e P a l l
where A i n t e r and A u n i o n represent the intersection and union areas, respectively, and F a is the ratio of falsely predicted pixels P f a l s e over all pixels P a l l .
We use F1 and I o U as the main evaluation metrics. In addition to the fixed-threshold evaluation methods, we also utilize Precision–Recall curves (P-R) and Receiver Operation Characteristic (ROC) curves to comprehensively evaluate algorithms. ROC is used to describe the changing trends of P d under varying F a .

4.3. Implementation Details

We employ ResU-Net [51] as our detection backbone, the number of downsampling layers is 4, and the basic width is set to 32. The kernel size and stride size for NPE is 16, and the number of channels c is 128. The number of MTTM is 2, the total number of TCTM and TSTM is 4, and the memory is initialized with zero tensors. According to [48], the pyramid pooling size is set to [1, 3, 6, 8]. Our MTTU-Net does not use any pre-trained weights for training; every image undergoes normalization and random cropping into 256 × 256 patches. To avoid over-fitting, we augment the training data through random flipping and rotation. We initialize the weights and biases of our model using the Kaiming initialization method. The model is trained using the BCE loss function and optimized by the Adam optimizer with an initial learning rate of 0.001, and the learning rate is gradually decreased to 1 × 10 5 using the Cosine Annealing strategy. The epoch is set to 100, the batch size is set to 8, and the number of consecutive frames m in each batch is set to 5. Following [11,31], the fixed threshold to segment the salient map is set to 0.5. The proposed MTTU-Net is implemented on a single Nvidia GeForce 3090 GPU.

4.4. Comparisons with Other Algorithms

To evaluate the performance of our algorithm, we compare MTTU-Net to 15 SOTA algorithms, including 9 SISTD algorithms and 6 MISTD algorithms. Specifically, SISTD algorithms include model-driven algorithms (Top-Hat [24], FKRW [25], MPCM [3], RIPT [26]) and data-driven algorithms (ISTDU-Net [27], DNANet [11], RDIAN [47], MSHNet [30], SCTransNet [31]). MISTD algorithms include model-driven algorithms (NFTDGSTV [7], STRL-LBCM [32]) and data-driven algorithms (TAD [36], SSTNet [13], Tridos [39], DTUM [17]). To guarantee an equitable comparison, we retrained all the learning-based algorithms using the same training datasets as our MTTU-Net.

4.4.1. Quantitative Comparison

The quantitative results are shown in Table 1. Overall, the model-driven algorithms, both SISTD algorithms and MISTD algorithms, perform worse on all metrics and have a large gap with the data-driven algorithms. A major reason is that model-driven algorithms heavily depend on hand-crafted features and prior knowledge, lacking adaptability for diverse scenarios.
For data-driven algorithms, existing MISTD algorithms do not demonstrate absolute performance advantages over SISTD algorithms, and some MISTD algorithms even perform worse. One possible reason is that the segmentation-based detection framework has a remarkable advantage in the small target detection task because they can preserve the high-resolution feature representation for pixel-level target localization. Thus, they obtain excellent detection results using only the spatial features of a single frame. For example, MSHNet achieves a high F1 of 93.02% and a high I o U of 60.64% on IRDST, and SCTranNet achieves a high F1 of 92.77% and a high I o U of 73.59% on IDSMT. In contrast, the target-based detection framework downsamples feature maps through backbones, causing small target features to be diluted. This restricts their performance even though spatiotemporal information is utilized. For example, SSTNet and Tridos utilize CSPDarknet to extract features and use the anchor-based method to detect targets on low-resolution feature maps. Additionally, the segmentation-based MISTD is still in its primary stage, and existing algorithms fail to effectively combine the U-shape segmentation framework and the spatio-temporal feature utilization. For instance, DTUM simply replaces the 2D convolution in U-Net with 3D convolution to extract the spatio-temporal features from aligned multiple frames. Its result has a high false alarm rate when dealing with complex moving situations, with a F a of 4.28× 10 6 on IRDST and a F a of 4.68× 10 6 on IDSMT.
Our MTTU-Net achieves the best performance on most evaluation metrics over IRDST and IDSMT, especially F1 and I o U . This proves that our algorithm performs well in both target localization and shape prediction. Moreover, MTTU-Net has the lowest F a , indicating that our segmentation results have fewer false alarm pixels. Specifically, on IRDST, although MTTU-Net’s 93.78% is slightly lower than SSTNet’s 94.01% in terms of P, and its 94.82% is lower than RDIAN’s 95.62% in terms of R, MTTU-Net still obtains the highest F1 of 94.30%, achieving a superior balance between P and R. In terms of pixel-level metrics, MTTU-Net obtains the highest I o U of 60.71% and the lowest F a of 1.02× 10 6 . On IDSMT, MTTU-Net demonstrates the best performance across all metrics, especially outperforming the second-place SCTransNet by a large margin, leading by 2.71% and 2.93% in terms of F1 and I o U , respectively. In addition, in terms of F1 and I o U , most algorithms have drastic performance fluctuations between the two datasets, e.g., MSHNet, SSTNet, and Tridos, while MTTU-Net and SCTransNet can remain stable and show strong robustness to different scenarios. This may be attributed to the powerful attention mechanism of the Transformer, which improves the adaptability of models. Figure 7 shows the P-R curves and ROC curves of data-driven algorithms. The larger the area under the curve, the better the algorithm’s performance. By comparison, on IRDST, our P-R curve shows a slight advantage, while our Pd-Fa curve has an obvious improvement. On IDSMT, both of our curves distinctly exceed the curves of other algorithms. This verifies that our MTTU-Net has the best overall performance with the best balance between P and R, as well as P d and F a .

4.4.2. Qualitative Comparison

The qualitative results of nine representative algorithms on IRDST and IDSMT are given in Figure 8, where the target-based algorithms take the bounding box as output and the segmentation-based algorithms take the binary image as output. Model-driven algorithms, such as Top-Hat and STRL-LBCM, produce a large number of false alarms and missed detections in complex scenarios. Conversely, data-driven algorithms exhibit remarkable adaptability to complex scenarios by training on large-scale datasets, and they can accurately detect small targets even in challenging urban scenarios, as shown in Figure 8(4). Furthermore, target-based algorithms, e.g., TAD, SSTNet, and Tridos, usually have small positional biases in their output bounding boxes. This is because they employ the generic detection frameworks that localize targets on low-resolution feature maps, typically obtained by 8× downsampling, hence struggling to precisely locate small targets of merely several pixels. Segmentation-based algorithms avoid this problem due to preserving the high-resolution feature representation.
Overall, our MTTU-Net successfully detects targets in various scenarios. Figure 8(2) illustrates a difficult case of a target close to a tree branch, where SISTD algorithms all fail to detect it, while MISTD algorithms, e.g., SSTNet, Tridos, and our MTTU-Net, utilize the spatio-temporal information to avoid the interference and detect it successfully. Figure 8(3),(5) show the cases of slow-moving target and fast-moving background, respectively, which does not affect SISTD algorithms, such as DNANet and SCTransNet, but increases the difficulty of effectively utilizing spatio-temporal information for MISTD algorithms. For example, Tridos and DTUM miss slow-moving targets; when backgrounds move fast, TAD, SSTNet, and Tridos produce missed detections, and DTUM, which requires frame alignment, generates severe false alarms. In contrast, our MTTU-Net is dominated by the spatial features of the current frame and queries valid temporal information from the memory to enhance performance. This approach not only maintains the sensitivity to slow-moving targets but also effectively avoids the interference caused by fast-moving backgrounds.

5. Discussion

5.1. Effects of Different Components

MTTU-Net adopts a segmentation-based detection framework for MISTD, including ResU-Net for building the main structure, CCA for enhancing decoding, and deep supervised (DS) for model training. In Table 2, by incrementally incorporating these components, we can observe that the algorithm performance improves consistently, which validates their effectiveness for ISTD. Then, by adding our MTTM into the above framework, the algorithm performance is greatly improved, which verifies the effectiveness of MTTM. Specifically, MTTM enhances F1 by 2.58% and I o U by 4.17% on IRDST, and it enhances F1 by 3.63% and I o U by 7.16% on IDSMT. In MTTM, TCTM improves the performance more significantly than TSTM, achieving higher F1 by 0.43% and higher I o U by 0.52% on IRDST, and higher F1 by 1.97% and higher I o U by 1.94% on IDSMT, while combining the two modules can achieve the best performance. This indicates that although the target semantic feature enhancement achieved by channel attention is more important for detecting infrared small targets, the global background perception achieved by space attention is still indispensable.
When two MTTMs are used, TCTM and TSTM can be arranged in four ways, and Table 3 shows the effects of different arrangements on algorithm performance. Among them, the alternate arrangement works better, such as CSCS and SCSC, and there is no obvious difference between them. One possible reason is that the alternate mode is more conducive to the complementation of TCTM and TSTM.

5.2. Ablation Study in TCTM and TSTM

Keeping the optimal setting of TSTM, we analyze the impact of CFN and the proposed D-ConvLSTM in TCTM. Specifically, we compare CFN with FFN and D-ConvLSTM with the convolutional layers; both are originally used in the Transformer [28]. Namely, we generate K C t and V C t in the same way as Q C t . As shown in Table 4, CFN can improve algorithm performance at the cost of slightly increasing computation amount and parameters. D-ConvLSTM allows TCTM to utilize the temporal features to enhance the target features and achieve a significant performance improvement. However, D-ConvLSTM needs to store and update the memory, resulting in large parameters and computation amounts.
Keeping the optimal setting of TCTM, we analyze the impact of batch concatenation (BC), CFN, D-ConvLSTM, and pyramid pooling (PP) in TSTM. Correspondingly, we use channel concatenation, FFN, and convolutional layers as comparison settings, respectively. By analyzing Table 5, three conclusions can be obtained as follows:
(1)
Similar to TCTM, both CFN and D-ConvLSTM in TSTM improve algorithm performance effectively. Note that, combining Table 2 and Table 5 for comparison, using TSTM without D-ConvLSTM, as shown in the second row of Table 5, results in lower evaluation metrics than not using TSTM, as shown in the fourth row of Table 2. This indicates that only TSTM using spatio-temporal features is beneficial to our MTTU-Net, while only using spatial features makes it less robust against background clutter.
(2)
Pyramid pooling does not increase model parameters. And, it can reduce the computation amount of cross-attention in the space dimension by about 26 G, which greatly improves the running speed from 7.57 fps to 15.07 fps without performance degradation.
(3)
Compared with channel concatenation, batch concatenation reduces the channel number of input tokens so that the parameters and computation amount of CFN and D-ConvLSTM in TSTM are significantly less than those in TCTM. In addition, batch concatenation can maintain the channel independence of multi-level features to avoid confusion in the cross-attention, which makes our MTTU-Net achieve higher performance.

5.3. Effects of Different Input Forms in D-ConvLSTM

As the outputs of D-ConvLSTM, K t and V t serve as critical interfaces for TCTM and TSTM to leverage temporal information. They are separately controlled by output gates o K t and o V t , which are influenced by different input forms, as shown in Table 6. The simplest input form is K t 1 or V t 1 , but it yields the worst results due to the lack of interaction with current-frame information I t . The most comprehensive input form is the concatenated [ I t , K t 1 , V t 1 ] , but it obscures the functional consistency of K t and V t as key and value during sequence processing, ultimately leading to performance degradation. In contrast, the optimal input form is the concatenated [ I t , K t 1 ] or [ I t , V t 1 ] , which effectively avoids the above problems and obtains the best result.

5.4. How MTTM Works

To demonstrate the effectiveness of MTTM, we give two typical examples to explain how MTTM works, as shown in Figure 9. According to (1), in MTTU-Net, the decoder’s input r i t is the sum of the encoder’s output x i t and the output o i t obtained by processing x i t using MTTM. Therefore, the effect of MTTM on MTTU-Net can be observed by comparing x i t and r i t . Analyzing Figure 9, we can obtain three conclusions as follows:
(1)
Comparing x t and r t , MTTM is able to highlight the target regions for enhancing target features through the attention mechanism, regardless of whether D-ConvLSTM is used or not. This verifies that the Transformer-based structure in MTTU-Net is effective.
(2)
Comparing r t at different levels, the higher the level, the better the target feature enhancement. This proves that a larger receptive field is beneficial to extract the spatial context information to recognize small targets, and it also verifies that the deep semantic information is crucial for ISTD.
(3)
Comparing r t when D-ConvLSTM is used or not, it demonstrates that utilizing the spatio-temporal features can effectively suppress background interference to reduce false alarms, as shown in Figure 9(1). Additionally, it greatly enhances the target features to avoid misdetection; for instance, the lower left target shown in Figure 9(2) can be accurately located even when the target is almost immersed in the background due to thermal crossover.
To further verify the effectiveness of MTTM, for the two examples in Figure 9, we observe the changes in their 4th-level feature r 4 t by constraining MTTU-Net to run in different time ranges, as shown in Figure 10. In a small time range, MTTU-Net cannot obtain enough spatio-temporal information to detect targets, resulting in false alarms and missed detection. As the time range increases, MTTU-Net can obtain more spatio-temporal information, which not only effectively suppresses background noise and reduces false alarms but also further enhances the features of real targets. Benefiting from the memory-based scheme, MTTU-Net can adaptively adjust the required amount of temporal information. As shown in Figure 10, the first scenario utilizes the temporal information of 11 frames to completely eliminate the false alarm, while the second scenario requires only 6 frames.
In contrast, existing MISTD algorithms usually adopt the time window paradigm, where a fixed number of consecutive frames, such as five frames, are simultaneously fed into the model. Thus, they can only extract spatio-temporal features from short sequences and cannot perform adaptive adjustment, which is not the optimal way to utilize spatio-temporal information. Figure 11 illustrates the performance changes of MTTU-Net in different time ranges. When the length of the time range exceeds five frames, the performance curves still maintain the increasing trend, which verifies the necessity of extracting spatio-temporal features from long sequences. When the length of the time range reaches 11 frames (from t 10 to t) on IRDST and 8 frames (from t 7 to t) on IDSMT, the growth of performance curves nearly stagnates. This indicates that our MTTU-Net can utilize more than ten frames of temporal information through the implicit memory mechanism, while it can adaptively adjust the amount of temporal information used for different scenarios.

5.5. Core Hyper-Parameter Analysis

As shown in Table 7, we evaluate the impact of hyper-parameters on the performance of MTTU-Net, including the number N of MTTMs and the number c of channels in MTTM. When N is set to 2, namely, the total number of TCTM and TSTM is 4, MTTU-Net has the best overall performance on both datasets. When N = 3 , F1 on DSMT is optimal, but it requires more parameters and computation. In addition, as c increases, MTTU-Net can handle more complex scenarios, so its performance gradually improves. When c = 128 , the performance of MTTU-Net reaches its peak, but when c = 192 , its performance degrades due to overfitting. Overall, the optimal N and c values are 2 and 128, respectively.
During training, we provide m consecutive frames in a batch of samples. MTTU-Net needs to process these frames sequentially and calculate the output loss for every frame. m is able to affect the ability of MTTU-Net to exploit spatio-temporal information, and Table 8 shows the effect of the training hyper-parameter m on algorithm performance. As m increases, the performance of MTTU-Net gradually improves, which indicates that the model is better able to learn how to extract and utilize spatio-temporal features from long sequences. When m is set to 5, the algorithm performance reaches a relatively satisfactory level. Considering that larger m values require more memory, we set m = 5 . Note that training the model using five consecutive frames does not mean that it can only exploit the spatio-temporal information from five frames during inference.

6. Conclusions

This paper proposed a memory-based temporal Transformer U-Net (MTTU-Net) for MISTD. It addresses the problems of breaking through the limitation of the time window paradigm and dealing with targets and backgrounds with various motion states. In detail, we propose a memory-based temporal Transformer module (MTTM), which implements the interactive fusion between the multi-level spatial features of the to-be-detected frame and the spatio-temporal features in the memory in the channel and space dimensions, respectively. Comparison experiments on IRSDT and IDSMT demonstrate the superiority of our MTTU-Net over existing MISTD algorithms. Moreover, ablation studies further verify the effectiveness and merits of all elaborately designed components, including TCTM, TSTM, and D-ConvLSTM. In future work, we will combine implicit and explicit memory mechanisms to explore the impact of temporal information from longer sequences on the MISTD task.
[custom]

Author Contributions

Methodology, Z.F. and W.Z.; software, Z.F. and D.L.; validation, A.S. and D.L.; investigation, W.Z., X.T., and Y.Y.; writing, Z.F. and W.Z. All authors have read and agreed to the published version of the manuscript.

Funding

The research was supported by the National Natural Science Foundation of China (12202485).

Data Availability Statement

The data presented in this study are available upon request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Correction Statement

This article has been republished with a minor correction to the Funding statement. This change does not affect the scientific content of the article.

References

  1. Zhao, M.; Li, W.; Li, L.; Hu, J.; Ma, P.; Tao, R. Single-frame infrared small-target detection: A survey. IEEE Trans. Geosci. Remote Sens. Mag. 2022, 10, 87–119. [Google Scholar] [CrossRef]
  2. Kou, R.; Wang, C.; Peng, Z.; Zhao, Z.; Chen, Y.; Han, J.; Huang, F.; Yu, Y.; Fu, Q. Infrared small target segmentation networks: A survey. Pattern Recognit. 2023, 143, 109788. [Google Scholar] [CrossRef]
  3. Wei, Y.; You, X.; Li, H. Multiscale patch-based contrast measure for small infrared target detection. Infrared Phys. Technol. 2016, 58, 216–226. [Google Scholar] [CrossRef]
  4. Han, J.; Liang, K.; Zhou, B.; Zhu, X.; Zhao, J.; Zhao, L. Infrared Small Target Detection Utilizing the Multiscale Relative Local Contrast Measure. IEEE Geosci. Remote Sens. Lett. 2016, 15, 612–616. [Google Scholar] [CrossRef]
  5. Zhao, B.; Xiao, S.; Lu, H.; Wu, D. Spatial-temporal local contrast for moving point target detection in space-based infrared imaging system. Infr. Phys. Technol. 2018, 95, 53–60. [Google Scholar] [CrossRef]
  6. Gao, C.; Meng, D.; Yang, Y.; Wang, Y.; Zhou, X.; Hauptmann, A.G. Infrared patch-image model for small target detection in a single image. IEEE Trans. Image Process. 2013, 22, 4996–5009. [Google Scholar] [CrossRef]
  7. Liu, T.; Yang, J.; Li, B.; Wang, Y.; An, W. Infrared small target detection via nonconvex tensor Tucker decomposition with factor prior. IEEE Trans. Geosci. Remote Sens. 2013, 61, 5617317. [Google Scholar] [CrossRef]
  8. Yan, S.; Chen, R.; Sang, H.; Zhou, Y.; Long, J.; Cai, N.; Xu, S.; Chen, J. Multihop anchor-free network with tolerance-adjustable measure for infrared tiny target detection. IEEE Trans. Instrum. Meas. 2025, 74, 5502011. [Google Scholar] [CrossRef]
  9. Wu, T.; Li, B.; Luo, Y.; Wang, Y.; Xiao, C.; Liu, T.; Yang, J.; An, W.; Guo, Y. MTU-Net: Multilevel TransUNet for space-based infrared tiny ship detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5601015. [Google Scholar] [CrossRef]
  10. Tong, X.; Zuo, Z.; Su, S.; Wei, J.; Sun, X.; Wu, P.; Zhao, Z. ST-Trans: Spatial-temporal Transformer for infrared small target detection in sequential images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5001819. [Google Scholar] [CrossRef]
  11. Li, B.; Xiao, C.; Wang, L.; Wang, Y.; Lin, Z.; Li, M.; An, W.; Guo, Y. Dense nested attention network for infrared small target detection. IEEE Trans. Image Process. 2023, 32, 1745–1758. [Google Scholar] [CrossRef]
  12. Xiao, X.; Lian, S.; Luo, Z.; Li, S. Infrared Small Target Detection Using Directional Derivative Correlation Filtering and a Relative Intensity Contrast Measure. Remote Sens. 2023, 61, 1921. [Google Scholar] [CrossRef]
  13. Chen, S.; Ji, L.; Zhu, J.; Ye, M.; Yao, X. SSTNet: Sliced spatio-temporal network with cross-slice ConvLSTM for moving infrared dim-small target detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5000912. [Google Scholar] [CrossRef]
  14. Li, N.; Yang, X.; Zhao, H. DBMSTN: A dual branch multiscale spatio-temporal network for dim-small target detection in infrared image. Pattern Recognit. 2025, 162, 111372. [Google Scholar] [CrossRef]
  15. Zhou, F.; Fu, M.; Qian, Y.; Yang, J.; Da, Y. Sparse prior is not all you need: When differential directionality meets saliency coherence for infrared small target detection. IEEE Trans. Instrum. Meas. 2024, 73, 5039818. [Google Scholar] [CrossRef]
  16. Ma, T.; Wang, H.; Liang, J.; Wang, Y.; Peng, J.; Kai, Z.; Liu, X. Temporal-spatial information fusion network for multiframe infrared small target detection. IEEE Trans. Instrum. Meas. 2025, 74, 4505219. [Google Scholar] [CrossRef]
  17. Li, R.; An, W.; Xiao, C.; Li, B.; Wang, Y.; Li, M.; Guo, Y. Direction-coded temporal U-shape module for multiframe infrared small target detection. IEEE Trans. Neural Netw. Learn. Syst. 2025, 36, 555–568. [Google Scholar] [CrossRef]
  18. Du, J.; Lu, H.; Zhang, L.; Hu, M.; Chen, S.; Deng, Y. A spatial-temporal feature-based detection framework for infrared dim small target. IEEE Trans. Geosci. Remote Sens. 2022, 60, 3000412. [Google Scholar] [CrossRef]
  19. Yan, P.; Hou, R.; Duan, X.; Yue, C.; Wang, X.; Cao, X. STDMANet: Spatio-temporal differential multiscale attention network for small moving infrared target detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5602516. [Google Scholar] [CrossRef]
  20. Shi, X.; Chen, Z.; Wang, H.; Yeung, D.Y.; Wong, W.; Woo, W. Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting. In Proceedings of the Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015. [Google Scholar]
  21. Oh, S.; Lee, J.; Xu, N.; Kim, S. Video object segmentation using space-time memory networks. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9225–9234. [Google Scholar]
  22. Bertasius, G.; Wang, H.; Torresani, L. Is space-time attention all you need for video understanding. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, Virtual, 18–24 July 2021. [Google Scholar]
  23. Shi, S.; Gu, J.; Xie, L.; Wang, X.; Yang, Y.; Dong, C. Rethinking alignment in video super-resolution transformers. In Proceedings of the 36th Conference on Neural Information Processing Systems, NeurIPS 2022, New Orleans, LA, USA, 28 November–9 December 2022; Volume 35, pp. 36081–36093. [Google Scholar]
  24. Bai, X.; Zhou, F. Analysis of new top-hat transformation and the application for infrared dim small target detection. Pattern Recognit. 2010, 43, 2145–2156. [Google Scholar] [CrossRef]
  25. Qin, Y.; Bruzzone, L.; Gao, C.; Li, B. Infrared small target detection based on facet kernel and random walker. IEEE Trans. Geosci. Remote Sens. 2010, 57, 7104–7118. [Google Scholar] [CrossRef]
  26. Dai, Y.; Wu, Y. Reweighted infrared patch-tensor model with both nonlocal and local priors for single-frame small target detection. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2017, 10, 3752–3767. [Google Scholar] [CrossRef]
  27. Hou, Q.; Zhang, L.; Tan, F.; Xi, Y.; Zheng, H.; Li, N. ISTDU-Net: Infrared small-target detection u-net. IEEE Trans. Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
  28. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
  29. Wu, X.; Hong, D.; Chanussot, J. UIU-Net: U-net in u-net for infrared small object detection. IEEE Trans. Image Process. 2022, 32, 364–376. [Google Scholar] [CrossRef]
  30. Liu, Q.; Liu, R.; Zheng, B.; Wang, H.; Fu, Y. Infrared small target detection with scale and location sensitivity. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–18 June 2024; pp. 17490–17499. [Google Scholar]
  31. Yuan, S.; Qin, H.; Yan, X.; Akhtar, N.; Mian, A. SCTransNet: Spatial-channel cross Transformer network for infrared small target detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5002615. [Google Scholar] [CrossRef]
  32. Yuan, S.; Qin, H.; Yan, X.; Akhtar, N.; Mian, A. Spatial-temporal tensor representation learning with priors for infrared small target detection. IEEE Trans. Aerosp. Electron. Syst. 2023, 59, 9598–9620. [Google Scholar] [CrossRef]
  33. Luo, Y.; Li, X.; Chen, S. Feedback spatial-temporal infrared small target detection based on orthogonal subspace projection. IEEE Trans. Geosci. Remote Sens. 2023, 62, 5001919. [Google Scholar] [CrossRef]
  34. Liu, X.; Li, X.; Li, L.; Su, X.; Chen, F. Dim and small target detection in multi-frame sequence using bi-Conv-LSTM and 3D-conv structure. IEEE Access 2021, 9, 135845–135855. [Google Scholar] [CrossRef]
  35. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Geosci. Remote Sens. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
  36. Cui, Y.; Song, T.; Wu, G.; Wang, L. A real-time and lightweight method for tiny airborne object detection. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
  37. Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. YOLOX: Exceeding YOLO series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar] [CrossRef]
  38. Liu, Z.; Ning, J.; Cao, Y.; Wei, Y.; Zhang, Z.; Lin, S.; Hu, H. Video Swin Transformer. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 3192–3201. [Google Scholar]
  39. Duan, W.; Ji, L.; Chen, S.; Zhu, S.; Ye, M. Triple-domain feature learning with frequency-aware memory enhancement for moving infrared small target detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5006014. [Google Scholar] [CrossRef]
  40. Huang, Y.; Zhi, X.; Hu, J.; Yu, L.; Han, Q.; Chen, W. LMAFormer: Local motion aware Transformer for small moving infrared target detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5008117. [Google Scholar] [CrossRef]
  41. Sainbayar, S.; Szlam, A.; Weston, J.; Fergus, R. End-To-End Memory Networks. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; Volume 28. [Google Scholar]
  42. Graves, A.; Wayne, G.; Danihelka, I. Neural turing machines. arXiv 2014, arXiv:1410.5401. [Google Scholar] [CrossRef]
  43. Fu, Z.; Liu, Q.; Fu, Z.; Wang, Y. STMTrack: Template-free visual tracking with space-time memory networks. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 13774–13783. [Google Scholar]
  44. Tang, S.; Li, C.; Zhang, P.; Tang, R. Swinlstm: Improving spatiotemporal prediction accuracy using swin transformer and lstm. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 13470–13479. [Google Scholar]
  45. Xie, H.; Yao, H.; Zhou, S.; Zhou, S.; Sun, W. Efficient regional memory network for video object segmentation. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 1286–1295. [Google Scholar]
  46. Wang, H.; Cao, P.; Wang, J.; Zaiane, O.R. UCTransnet: Rethinking the skip connections in u-net from a channel-wise perspective with transformer. In Proceedings of the AAAI-22 Technical Tracks 3, Virtual, 22 February–1 March 2022; Volume 36, pp. 2441–2449. [Google Scholar]
  47. Sun, H.; Bai, J.; Yang, F.; Bai, X. Receptive-field and direction induced attention network for infrared dim small target detection with a large-scale dataset IRDST. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–13. [Google Scholar] [CrossRef]
  48. Zhu, Z.; Xu, M.; Bai, S.; Huang, T.; Bai, X. Asymmetric non-local neural networks for semantic segmentation. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
  49. Wang, Y.; Gao, Z.; Long, M. PredRNN++: Towards A Resolution of the Deep-in-Time Dilemma in Spatiotemporal Predictive Learning. In Proceedings of the 5th International Conference on Machine Learning (ICML 2018), Stockholm, Sweden, 10–15 July 2018; pp. 4904–4913. [Google Scholar]
  50. Feng, Z.; Zhang, W.; Sun, X.; Guo, L.; Liu, D. A Semi-Synthetic Dataset of Infrared Dim Small Moving Targets for Detection and Segmentation. 2025. Available online: https://www.scidb.cn/en/detail?dataSetId=36901b64578d4384a9144f57194c866e (accessed on 1 October 2025).
  51. Xiao, X.; Lian, S.; Luo, Z.; Li, S. Weighted res-unet for high-quality retina vessel segmentation. In Proceedings of the 2018 9th International Conference on Information Technology in Medicine and Education (ITME), Hangzhou, China, 19–21 October 2018; pp. 327–331. [Google Scholar]
Figure 1. Comparison of existing MISTD frameworks and our proposed memory-based MISTD framework. (a) Target-based MISTD framework with time window. (b) Segmentation-based MISTD framework with time window. (c) Our segmentation-based MISTD framework with the memory (M).
Figure 1. Comparison of existing MISTD frameworks and our proposed memory-based MISTD framework. (a) Target-based MISTD framework with time window. (b) Segmentation-based MISTD framework with time window. (c) Our segmentation-based MISTD framework with the memory (M).
Remotesensing 17 03801 g001
Figure 2. Overview of the proposed MTTU-Net for multi-frame infrared small target detection. Our MTTU-Net adopts a U-shape structure and adds a memory-based temporal Transformer module (MTTM), which consists of a temporal channel-cross Transformer module (TCTM) and a temporal space-cross Transformer module (TSTM).
Figure 2. Overview of the proposed MTTU-Net for multi-frame infrared small target detection. Our MTTU-Net adopts a U-shape structure and adds a memory-based temporal Transformer module (MTTM), which consists of a temporal channel-cross Transformer module (TCTM) and a temporal space-cross Transformer module (TSTM).
Remotesensing 17 03801 g002
Figure 3. The structures of the temporal channel-cross Transformer module (TCTM) and temporal space-cross Transformer module (TSTM). The two modules are arranged in an alternate mode when multiple MTTMs are used.
Figure 3. The structures of the temporal channel-cross Transformer module (TCTM) and temporal space-cross Transformer module (TSTM). The two modules are arranged in an alternate mode when multiple MTTMs are used.
Remotesensing 17 03801 g003
Figure 4. Visualization feature maps r t with or without TCTM and with or without TSTM. (a,b) from IRDST [47].
Figure 4. Visualization feature maps r t with or without TCTM and with or without TSTM. (a,b) from IRDST [47].
Remotesensing 17 03801 g004
Figure 5. Pyramid pooling for compressing the space dimension in TSTM.
Figure 5. Pyramid pooling for compressing the space dimension in TSTM.
Remotesensing 17 03801 g005
Figure 6. Dual-output convolutional LSTM (D-ConvLSTM) for storing and updating the memory as well as outputting updated K and V.
Figure 6. Dual-output convolutional LSTM (D-ConvLSTM) for storing and updating the memory as well as outputting updated K and V.
Remotesensing 17 03801 g006
Figure 7. P-R curves and ROC curves of different algorithms on IRDST and IDSMT.
Figure 7. P-R curves and ROC curves of different algorithms on IRDST and IDSMT.
Remotesensing 17 03801 g007
Figure 8. Visual results obtained by 9 representative algorithms on IRDST and IDSMT. Circles in blue, yellow, and red represent correctly detected targets, misdetections, and false alarms, respectively.
Figure 8. Visual results obtained by 9 representative algorithms on IRDST and IDSMT. Circles in blue, yellow, and red represent correctly detected targets, misdetections, and false alarms, respectively.
Remotesensing 17 03801 g008
Figure 9. Visualization maps of multi-level features r t with or without D-ConvLSTM. x t is the encoder’s output feature. Circles in yellow and red represent misdetections and false alarms, respectively. (1) from IRDST and (2) from IDSMT.
Figure 9. Visualization maps of multi-level features r t with or without D-ConvLSTM. x t is the encoder’s output feature. Circles in yellow and red represent misdetections and false alarms, respectively. (1) from IRDST and (2) from IDSMT.
Remotesensing 17 03801 g009
Figure 10. Changes in 4th-level feature r 4 t of the two test images when different time ranges are set. t denotes the current frame, and t i t denotes the time range from frame t i to the current frame t. Circles in yellow and red represent misdetections and false alarms, respectively, while the circle in green represents that the false alarm is completely eliminated.
Figure 10. Changes in 4th-level feature r 4 t of the two test images when different time ranges are set. t denotes the current frame, and t i t denotes the time range from frame t i to the current frame t. Circles in yellow and red represent misdetections and false alarms, respectively, while the circle in green represents that the false alarm is completely eliminated.
Remotesensing 17 03801 g010
Figure 11. The performance of MTTU-Net in different time ranges. t i indicates that the detection result of frame t is obtained by using the spatio-temporal information from frame t i to frame t.
Figure 11. The performance of MTTU-Net in different time ranges. t i indicates that the detection result of frame t is obtained by using the spatio-temporal information from frame t i to frame t.
Remotesensing 17 03801 g011
Table 1. Quantitative comparison results of different SOTA algorithms on IRDST and IDSMT. The best and second best results are highlighted in red and blue, respectively. # denotes the target-based MISTD algorithm.
Table 1. Quantitative comparison results of different SOTA algorithms on IRDST and IDSMT. The best and second best results are highlighted in red and blue, respectively. # denotes the target-based MISTD algorithm.
SchemesTasksAlgorithmsIRDSTIDSMT
P (%) R (%) F1 (%) IoU (%) Fa ( 10 6 ) P (%) R (%) F1 (%) IoU (%) Fa ( 10 6 )
Model-drivenSISTDTop-Hat [24]16.5984.4627.7318.1947.923.9437.147.136.51109.49
FKRW [25]6.9767.9912.646.6181.7511.6223.8515.636.5227.81
MPCM [3]1.1048.332.151.70564.090.2350.440.470.334830.45
RIPT [26]1.8874.993.681.221596.370.6444.461.270.464023.38
MISTDNFTDGSTV [7]0.225.980.430.081063.390.3834.100.750.561409.45
STRL-LBCM [32]33.567.3212.021.922.2411.3219.6714.374.8328.15
Data-drivenSISTDISTDU-Net [27]86.2495.2190.5057.003.5496.5986.3691.1967.740.60
DNANet [11]90.9494.4792.6760.491.9193.9183.7588.5469.031.35
RDIAN [47]68.0195.6279.4846.3911.2281.8285.8883.8063.083.89
MSHNet [30]92.0094.0693.0260.641.2894.2281.0187.1164.181.39
SCTransNet [31]87.6794.9591.1658.653.0995.1690.5092.7773.590.86
MISTDTAD # [36]83.2085.4884.32--89.8564.2874.94--
SSTNet # [13]94.0190.4092.17--93.2658.7672.10--
Tridos # [39]90.8795.4493.10--92.1373.2381.60--
DTUM [17]84.8688.1786.4851.124.2871.6390.3379.9069.924.68
MTTU-Net (Ours)93.7894.8294.3060.711.0297.5793.4995.4876.520.41
Table 2. Ablation study of different components in MTTU-Net on IRDST and IDSMT.
Table 2. Ablation study of different components in MTTU-Net on IRDST and IDSMT.
ResU-NetCCADSMTTMIRDSTIDSMT
TCTM TSTM F1 (%) IoU (%) F1 (%) IoU (%)
90.6855.2789.2666.48
91.4356.0890.0567.67
91.7256.5491.8569.36
92.4658.1395.1076.03
92.0357.6193.1374.09
94.3060.7195.4876.52
Table 3. F1(%)/IoU(%) values achieved by arranging TCTM (C) and TSTM (S) in different ways.
Table 3. F1(%)/IoU(%) values achieved by arranging TCTM (C) and TSTM (S) in different ways.
PermutationsCCSSSSCCSCSCCSCS
IRDST92.93/59.2592.80/59.7194.28/60.5394.30/60.71
IDSMT94.06/75.0194.50/75.4695.91/76.3195.48/76.52
Table 4. Ablation study of CFN and D-ConvLSTM in TCTM.
Table 4. Ablation study of CFN and D-ConvLSTM in TCTM.
CFND-ConvLSTMIRDSTIDSMTParams (M) ↓Flops (G) ↓Speed (FPS) ↑
F1 (%)/IoU (%) F1 (%)/IoU (%)
93.16/58.0893.33/74.5212.80122.7315.52
93.41/58.5493.83/74.6914.27126.3015.35
94.15/60.2995.25/76.1118.59132.8415.23
94.30/60.7195.48/76.5220.06141.8515.07
Table 5. Ablation study of BC, CFN, D-ConvLSTM, and PP in TSTM.
Table 5. Ablation study of BC, CFN, D-ConvLSTM, and PP in TSTM.
BCCFND-ConvLSTMPPIRDSTIDSMTParams (M) ↓Flops (G) ↓Speed (FPS) ↑
F1 (%)/IoU (%) F1 (%)/IoU (%)
91.90/57.9694.20/74.1819.59162.817.68
92.44/58.3694.64/74.6019.70163.967.61
94.21/60.5095.85/76.0320.06168.337.57
94.30/60.7195.48/76.5220.06141.8515.07
93.88/59.9495.26/75.9432.77168.3614.67
Table 6. F1/IoU values achieved by using different input forms for output gates in D-ConvLSTM.
Table 6. F1/IoU values achieved by using different input forms for output gates in D-ConvLSTM.
Input FormIRDSTIDSMT
for o K t for o V t F1 (%)/IoU (%) F1 (%)/IoU (%)
K t 1 [ V t 1 , V t 1 ]92.55/58.4194.13/74.09
[ I t , K t 1 ] [ I t , V t 1 ]94.30/60.7195.48/76.52
[ I t , K t 1 , V t 1 ] [ I t , K t 1 , V t 1 ] 93.84/59.8694.96/75.66
Table 7. Hyper-parameter study of the number of MTTMs and the number of channels in MTTM.
Table 7. Hyper-parameter study of the number of MTTMs and the number of channels in MTTM.
Hyper-ParamIRDSTIDSMTParams (M) ↓Flops (G) ↓Speed (FPS) ↑
F1 (%) IoU (%) F1 (%) IoU (%)
The number of MTTMs
N = 1 93.8459.8294.9475.5512.83116.7716.86
N = 2 94.3060.7195.4876.5220.06141.8515.07
N = 3 94.2260.2695.9475.9527.30166.9413.80
N = 4 93.9159.9495.3276.0134.53192.0312.63
The number of channels in MTTM
c = 32 93.0659.3694.6675.075.0288.2216.60
c = 64 93.6559.9094.9975.898.25100.5716.24
c = 128 94.3060.7195.4876.5220.06141.8515.07
c = 192 94.1560.3195.0176.1639.02205.2314.17
Table 8. F1(%)/IoU(%) values achieved using different training hyper-parameter m values.
Table 8. F1(%)/IoU(%) values achieved using different training hyper-parameter m values.
m3456
IRDST93.78/60.1294.02/60.3494.30/60.7194.22/60.79
IDSMT93.95/74.4094.72/75.3395.48/76.5296.03/76.42
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Feng, Z.; Zhang, W.; Liu, D.; Tao, X.; Su, A.; Yang, Y. Memory-Based Temporal Transformer U-Net for Multi-Frame Infrared Small Target Detection. Remote Sens. 2025, 17, 3801. https://doi.org/10.3390/rs17233801

AMA Style

Feng Z, Zhang W, Liu D, Tao X, Su A, Yang Y. Memory-Based Temporal Transformer U-Net for Multi-Frame Infrared Small Target Detection. Remote Sensing. 2025; 17(23):3801. https://doi.org/10.3390/rs17233801

Chicago/Turabian Style

Feng, Zicheng, Wenlong Zhang, Donghui Liu, Xingfu Tao, Ang Su, and Yixin Yang. 2025. "Memory-Based Temporal Transformer U-Net for Multi-Frame Infrared Small Target Detection" Remote Sensing 17, no. 23: 3801. https://doi.org/10.3390/rs17233801

APA Style

Feng, Z., Zhang, W., Liu, D., Tao, X., Su, A., & Yang, Y. (2025). Memory-Based Temporal Transformer U-Net for Multi-Frame Infrared Small Target Detection. Remote Sensing, 17(23), 3801. https://doi.org/10.3390/rs17233801

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop