Memory-Based Temporal Transformer U-Net for Multi-Frame Infrared Small Target Detection
Highlights
- We propose a memory-based temporal Transformer U-Net (MTTU-Net) for multi-frame infrared small target detection, which employs a memory mechanism to adaptively extract spatio-temporal features from long sequences, thereby achieving better detection performance.
- MTTU-Net adopts a Transformer-based spatio-temporal feature interactive fusion approach, which can deal with targets and backgrounds with various motion states effectively.
- Our proposed MTTU-Net overcomes the limitations imposed by the time window paradigm, which restricts spatio-temporal feature extraction in existing algorithms.
- It also relieves the dependency on frame alignment that is typically required for target enhancement and background suppression, thereby adapting to more complex motion scenarios.
Abstract
1. Introduction
- (1)
- We propose MTTU-Net, a memory-based temporal Transformer U-Net for MISTD, which utilizes the proposed D-ConvLSTM to save and update temporal information in memory. It overcomes the limitations of the time window, which is beneficial to adaptively extract adequate spatio-temporal features from long sequences (more than 10 frames) to improve the detection performance.
- (2)
- We propose MTTM, which adopts a Transformer-based spatio-temporal feature interactive fusion approach. It is dominated by the spatial features of the current frame and supplemented by the temporal features in memory, which can deal with targets and backgrounds with various motion states effectively.
- (3)
- In MTTM, we present TCTM and TSTM to achieve target feature enhancement and global background perception through feature cross fusion in the channel and space dimensions, which reduce misdetection and false alarms, respectively.
2. Related Work
2.1. Single-Frame Infrared Small Target Detection
2.2. Multi-Frame Infrared Small Target Detection
2.3. Memory Mechanism
3. Materials and Methods
- (1)
- To break through the limitations of the time window paradigm, D-ConvLSTM is proposed to adaptively adjust the input amount of temporal information for different scenarios, as described in Section 3.2.3.
- (2)
- To handle challenging scenarios like slow-moving targets and fast-moving backgrounds, our proposed MTTM adopts a Transformer-based feature interactive fusion method, which is dominated by the spatial features of the current frame and supplemented by the temporal features in the memory, as described in Section 3.2.
- (3)
- To reduce misdetection and false alarms, MTTM integrates two components: TCTM fuses features in the channel dimension for enhancing target features, and TSTM fuses features in the space dimension for global background perception, as described in Section 3.2.1 and Section 3.2.2, respectively.
3.1. Overall Pipeline
3.2. Memory-Based Temporal Transformer Module
3.2.1. Temporal Channel-Cross Transformer Module
3.2.2. Temporal Space-Cross Transformer Module
3.2.3. Dual-Output Convolutional LSTM
4. Results
4.1. Datasets
4.2. Evaluation Metrics
4.3. Implementation Details
4.4. Comparisons with Other Algorithms
4.4.1. Quantitative Comparison
4.4.2. Qualitative Comparison
5. Discussion
5.1. Effects of Different Components
5.2. Ablation Study in TCTM and TSTM
- (1)
- Similar to TCTM, both CFN and D-ConvLSTM in TSTM improve algorithm performance effectively. Note that, combining Table 2 and Table 5 for comparison, using TSTM without D-ConvLSTM, as shown in the second row of Table 5, results in lower evaluation metrics than not using TSTM, as shown in the fourth row of Table 2. This indicates that only TSTM using spatio-temporal features is beneficial to our MTTU-Net, while only using spatial features makes it less robust against background clutter.
- (2)
- Pyramid pooling does not increase model parameters. And, it can reduce the computation amount of cross-attention in the space dimension by about 26 G, which greatly improves the running speed from 7.57 fps to 15.07 fps without performance degradation.
- (3)
- Compared with channel concatenation, batch concatenation reduces the channel number of input tokens so that the parameters and computation amount of CFN and D-ConvLSTM in TSTM are significantly less than those in TCTM. In addition, batch concatenation can maintain the channel independence of multi-level features to avoid confusion in the cross-attention, which makes our MTTU-Net achieve higher performance.
5.3. Effects of Different Input Forms in D-ConvLSTM
5.4. How MTTM Works
- (1)
- Comparing and , MTTM is able to highlight the target regions for enhancing target features through the attention mechanism, regardless of whether D-ConvLSTM is used or not. This verifies that the Transformer-based structure in MTTU-Net is effective.
- (2)
- Comparing at different levels, the higher the level, the better the target feature enhancement. This proves that a larger receptive field is beneficial to extract the spatial context information to recognize small targets, and it also verifies that the deep semantic information is crucial for ISTD.
- (3)
- Comparing when D-ConvLSTM is used or not, it demonstrates that utilizing the spatio-temporal features can effectively suppress background interference to reduce false alarms, as shown in Figure 9(1). Additionally, it greatly enhances the target features to avoid misdetection; for instance, the lower left target shown in Figure 9(2) can be accurately located even when the target is almost immersed in the background due to thermal crossover.
5.5. Core Hyper-Parameter Analysis
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Correction Statement
References
- Zhao, M.; Li, W.; Li, L.; Hu, J.; Ma, P.; Tao, R. Single-frame infrared small-target detection: A survey. IEEE Trans. Geosci. Remote Sens. Mag. 2022, 10, 87–119. [Google Scholar] [CrossRef]
- Kou, R.; Wang, C.; Peng, Z.; Zhao, Z.; Chen, Y.; Han, J.; Huang, F.; Yu, Y.; Fu, Q. Infrared small target segmentation networks: A survey. Pattern Recognit. 2023, 143, 109788. [Google Scholar] [CrossRef]
- Wei, Y.; You, X.; Li, H. Multiscale patch-based contrast measure for small infrared target detection. Infrared Phys. Technol. 2016, 58, 216–226. [Google Scholar] [CrossRef]
- Han, J.; Liang, K.; Zhou, B.; Zhu, X.; Zhao, J.; Zhao, L. Infrared Small Target Detection Utilizing the Multiscale Relative Local Contrast Measure. IEEE Geosci. Remote Sens. Lett. 2016, 15, 612–616. [Google Scholar] [CrossRef]
- Zhao, B.; Xiao, S.; Lu, H.; Wu, D. Spatial-temporal local contrast for moving point target detection in space-based infrared imaging system. Infr. Phys. Technol. 2018, 95, 53–60. [Google Scholar] [CrossRef]
- Gao, C.; Meng, D.; Yang, Y.; Wang, Y.; Zhou, X.; Hauptmann, A.G. Infrared patch-image model for small target detection in a single image. IEEE Trans. Image Process. 2013, 22, 4996–5009. [Google Scholar] [CrossRef]
- Liu, T.; Yang, J.; Li, B.; Wang, Y.; An, W. Infrared small target detection via nonconvex tensor Tucker decomposition with factor prior. IEEE Trans. Geosci. Remote Sens. 2013, 61, 5617317. [Google Scholar] [CrossRef]
- Yan, S.; Chen, R.; Sang, H.; Zhou, Y.; Long, J.; Cai, N.; Xu, S.; Chen, J. Multihop anchor-free network with tolerance-adjustable measure for infrared tiny target detection. IEEE Trans. Instrum. Meas. 2025, 74, 5502011. [Google Scholar] [CrossRef]
- Wu, T.; Li, B.; Luo, Y.; Wang, Y.; Xiao, C.; Liu, T.; Yang, J.; An, W.; Guo, Y. MTU-Net: Multilevel TransUNet for space-based infrared tiny ship detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5601015. [Google Scholar] [CrossRef]
- Tong, X.; Zuo, Z.; Su, S.; Wei, J.; Sun, X.; Wu, P.; Zhao, Z. ST-Trans: Spatial-temporal Transformer for infrared small target detection in sequential images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5001819. [Google Scholar] [CrossRef]
- Li, B.; Xiao, C.; Wang, L.; Wang, Y.; Lin, Z.; Li, M.; An, W.; Guo, Y. Dense nested attention network for infrared small target detection. IEEE Trans. Image Process. 2023, 32, 1745–1758. [Google Scholar] [CrossRef]
- Xiao, X.; Lian, S.; Luo, Z.; Li, S. Infrared Small Target Detection Using Directional Derivative Correlation Filtering and a Relative Intensity Contrast Measure. Remote Sens. 2023, 61, 1921. [Google Scholar] [CrossRef]
- Chen, S.; Ji, L.; Zhu, J.; Ye, M.; Yao, X. SSTNet: Sliced spatio-temporal network with cross-slice ConvLSTM for moving infrared dim-small target detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5000912. [Google Scholar] [CrossRef]
- Li, N.; Yang, X.; Zhao, H. DBMSTN: A dual branch multiscale spatio-temporal network for dim-small target detection in infrared image. Pattern Recognit. 2025, 162, 111372. [Google Scholar] [CrossRef]
- Zhou, F.; Fu, M.; Qian, Y.; Yang, J.; Da, Y. Sparse prior is not all you need: When differential directionality meets saliency coherence for infrared small target detection. IEEE Trans. Instrum. Meas. 2024, 73, 5039818. [Google Scholar] [CrossRef]
- Ma, T.; Wang, H.; Liang, J.; Wang, Y.; Peng, J.; Kai, Z.; Liu, X. Temporal-spatial information fusion network for multiframe infrared small target detection. IEEE Trans. Instrum. Meas. 2025, 74, 4505219. [Google Scholar] [CrossRef]
- Li, R.; An, W.; Xiao, C.; Li, B.; Wang, Y.; Li, M.; Guo, Y. Direction-coded temporal U-shape module for multiframe infrared small target detection. IEEE Trans. Neural Netw. Learn. Syst. 2025, 36, 555–568. [Google Scholar] [CrossRef]
- Du, J.; Lu, H.; Zhang, L.; Hu, M.; Chen, S.; Deng, Y. A spatial-temporal feature-based detection framework for infrared dim small target. IEEE Trans. Geosci. Remote Sens. 2022, 60, 3000412. [Google Scholar] [CrossRef]
- Yan, P.; Hou, R.; Duan, X.; Yue, C.; Wang, X.; Cao, X. STDMANet: Spatio-temporal differential multiscale attention network for small moving infrared target detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5602516. [Google Scholar] [CrossRef]
- Shi, X.; Chen, Z.; Wang, H.; Yeung, D.Y.; Wong, W.; Woo, W. Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting. In Proceedings of the Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015. [Google Scholar]
- Oh, S.; Lee, J.; Xu, N.; Kim, S. Video object segmentation using space-time memory networks. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9225–9234. [Google Scholar]
- Bertasius, G.; Wang, H.; Torresani, L. Is space-time attention all you need for video understanding. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, Virtual, 18–24 July 2021. [Google Scholar]
- Shi, S.; Gu, J.; Xie, L.; Wang, X.; Yang, Y.; Dong, C. Rethinking alignment in video super-resolution transformers. In Proceedings of the 36th Conference on Neural Information Processing Systems, NeurIPS 2022, New Orleans, LA, USA, 28 November–9 December 2022; Volume 35, pp. 36081–36093. [Google Scholar]
- Bai, X.; Zhou, F. Analysis of new top-hat transformation and the application for infrared dim small target detection. Pattern Recognit. 2010, 43, 2145–2156. [Google Scholar] [CrossRef]
- Qin, Y.; Bruzzone, L.; Gao, C.; Li, B. Infrared small target detection based on facet kernel and random walker. IEEE Trans. Geosci. Remote Sens. 2010, 57, 7104–7118. [Google Scholar] [CrossRef]
- Dai, Y.; Wu, Y. Reweighted infrared patch-tensor model with both nonlocal and local priors for single-frame small target detection. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2017, 10, 3752–3767. [Google Scholar] [CrossRef]
- Hou, Q.; Zhang, L.; Tan, F.; Xi, Y.; Zheng, H.; Li, N. ISTDU-Net: Infrared small-target detection u-net. IEEE Trans. Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Wu, X.; Hong, D.; Chanussot, J. UIU-Net: U-net in u-net for infrared small object detection. IEEE Trans. Image Process. 2022, 32, 364–376. [Google Scholar] [CrossRef]
- Liu, Q.; Liu, R.; Zheng, B.; Wang, H.; Fu, Y. Infrared small target detection with scale and location sensitivity. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–18 June 2024; pp. 17490–17499. [Google Scholar]
- Yuan, S.; Qin, H.; Yan, X.; Akhtar, N.; Mian, A. SCTransNet: Spatial-channel cross Transformer network for infrared small target detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5002615. [Google Scholar] [CrossRef]
- Yuan, S.; Qin, H.; Yan, X.; Akhtar, N.; Mian, A. Spatial-temporal tensor representation learning with priors for infrared small target detection. IEEE Trans. Aerosp. Electron. Syst. 2023, 59, 9598–9620. [Google Scholar] [CrossRef]
- Luo, Y.; Li, X.; Chen, S. Feedback spatial-temporal infrared small target detection based on orthogonal subspace projection. IEEE Trans. Geosci. Remote Sens. 2023, 62, 5001919. [Google Scholar] [CrossRef]
- Liu, X.; Li, X.; Li, L.; Su, X.; Chen, F. Dim and small target detection in multi-frame sequence using bi-Conv-LSTM and 3D-conv structure. IEEE Access 2021, 9, 135845–135855. [Google Scholar] [CrossRef]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Geosci. Remote Sens. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
- Cui, Y.; Song, T.; Wu, G.; Wang, L. A real-time and lightweight method for tiny airborne object detection. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
- Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. YOLOX: Exceeding YOLO series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar] [CrossRef]
- Liu, Z.; Ning, J.; Cao, Y.; Wei, Y.; Zhang, Z.; Lin, S.; Hu, H. Video Swin Transformer. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 3192–3201. [Google Scholar]
- Duan, W.; Ji, L.; Chen, S.; Zhu, S.; Ye, M. Triple-domain feature learning with frequency-aware memory enhancement for moving infrared small target detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5006014. [Google Scholar] [CrossRef]
- Huang, Y.; Zhi, X.; Hu, J.; Yu, L.; Han, Q.; Chen, W. LMAFormer: Local motion aware Transformer for small moving infrared target detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5008117. [Google Scholar] [CrossRef]
- Sainbayar, S.; Szlam, A.; Weston, J.; Fergus, R. End-To-End Memory Networks. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; Volume 28. [Google Scholar]
- Graves, A.; Wayne, G.; Danihelka, I. Neural turing machines. arXiv 2014, arXiv:1410.5401. [Google Scholar] [CrossRef]
- Fu, Z.; Liu, Q.; Fu, Z.; Wang, Y. STMTrack: Template-free visual tracking with space-time memory networks. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 13774–13783. [Google Scholar]
- Tang, S.; Li, C.; Zhang, P.; Tang, R. Swinlstm: Improving spatiotemporal prediction accuracy using swin transformer and lstm. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 13470–13479. [Google Scholar]
- Xie, H.; Yao, H.; Zhou, S.; Zhou, S.; Sun, W. Efficient regional memory network for video object segmentation. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 1286–1295. [Google Scholar]
- Wang, H.; Cao, P.; Wang, J.; Zaiane, O.R. UCTransnet: Rethinking the skip connections in u-net from a channel-wise perspective with transformer. In Proceedings of the AAAI-22 Technical Tracks 3, Virtual, 22 February–1 March 2022; Volume 36, pp. 2441–2449. [Google Scholar]
- Sun, H.; Bai, J.; Yang, F.; Bai, X. Receptive-field and direction induced attention network for infrared dim small target detection with a large-scale dataset IRDST. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–13. [Google Scholar] [CrossRef]
- Zhu, Z.; Xu, M.; Bai, S.; Huang, T.; Bai, X. Asymmetric non-local neural networks for semantic segmentation. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
- Wang, Y.; Gao, Z.; Long, M. PredRNN++: Towards A Resolution of the Deep-in-Time Dilemma in Spatiotemporal Predictive Learning. In Proceedings of the 5th International Conference on Machine Learning (ICML 2018), Stockholm, Sweden, 10–15 July 2018; pp. 4904–4913. [Google Scholar]
- Feng, Z.; Zhang, W.; Sun, X.; Guo, L.; Liu, D. A Semi-Synthetic Dataset of Infrared Dim Small Moving Targets for Detection and Segmentation. 2025. Available online: https://www.scidb.cn/en/detail?dataSetId=36901b64578d4384a9144f57194c866e (accessed on 1 October 2025).
- Xiao, X.; Lian, S.; Luo, Z.; Li, S. Weighted res-unet for high-quality retina vessel segmentation. In Proceedings of the 2018 9th International Conference on Information Technology in Medicine and Education (ITME), Hangzhou, China, 19–21 October 2018; pp. 327–331. [Google Scholar]











| Schemes | Tasks | Algorithms | IRDST | IDSMT | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| P (%) | R (%) | F1 (%) | IoU (%) | Fa () | P (%) | R (%) | F1 (%) | IoU (%) | Fa () | |||
| Model-driven | SISTD | Top-Hat [24] | 16.59 | 84.46 | 27.73 | 18.19 | 47.92 | 3.94 | 37.14 | 7.13 | 6.51 | 109.49 |
| FKRW [25] | 6.97 | 67.99 | 12.64 | 6.61 | 81.75 | 11.62 | 23.85 | 15.63 | 6.52 | 27.81 | ||
| MPCM [3] | 1.10 | 48.33 | 2.15 | 1.70 | 564.09 | 0.23 | 50.44 | 0.47 | 0.33 | 4830.45 | ||
| RIPT [26] | 1.88 | 74.99 | 3.68 | 1.22 | 1596.37 | 0.64 | 44.46 | 1.27 | 0.46 | 4023.38 | ||
| MISTD | NFTDGSTV [7] | 0.22 | 5.98 | 0.43 | 0.08 | 1063.39 | 0.38 | 34.10 | 0.75 | 0.56 | 1409.45 | |
| STRL-LBCM [32] | 33.56 | 7.32 | 12.02 | 1.92 | 2.24 | 11.32 | 19.67 | 14.37 | 4.83 | 28.15 | ||
| Data-driven | SISTD | ISTDU-Net [27] | 86.24 | 95.21 | 90.50 | 57.00 | 3.54 | 96.59 | 86.36 | 91.19 | 67.74 | 0.60 |
| DNANet [11] | 90.94 | 94.47 | 92.67 | 60.49 | 1.91 | 93.91 | 83.75 | 88.54 | 69.03 | 1.35 | ||
| RDIAN [47] | 68.01 | 95.62 | 79.48 | 46.39 | 11.22 | 81.82 | 85.88 | 83.80 | 63.08 | 3.89 | ||
| MSHNet [30] | 92.00 | 94.06 | 93.02 | 60.64 | 1.28 | 94.22 | 81.01 | 87.11 | 64.18 | 1.39 | ||
| SCTransNet [31] | 87.67 | 94.95 | 91.16 | 58.65 | 3.09 | 95.16 | 90.50 | 92.77 | 73.59 | 0.86 | ||
| MISTD | TAD # [36] | 83.20 | 85.48 | 84.32 | - | - | 89.85 | 64.28 | 74.94 | - | - | |
| SSTNet # [13] | 94.01 | 90.40 | 92.17 | - | - | 93.26 | 58.76 | 72.10 | - | - | ||
| Tridos # [39] | 90.87 | 95.44 | 93.10 | - | - | 92.13 | 73.23 | 81.60 | - | - | ||
| DTUM [17] | 84.86 | 88.17 | 86.48 | 51.12 | 4.28 | 71.63 | 90.33 | 79.90 | 69.92 | 4.68 | ||
| MTTU-Net (Ours) | 93.78 | 94.82 | 94.30 | 60.71 | 1.02 | 97.57 | 93.49 | 95.48 | 76.52 | 0.41 | ||
| ResU-Net | CCA | DS | MTTM | IRDST | IDSMT | |||
|---|---|---|---|---|---|---|---|---|
| TCTM | TSTM | F1 (%) | IoU (%) | F1 (%) | IoU (%) | |||
| ✓ | ✗ | ✗ | ✗ | ✗ | 90.68 | 55.27 | 89.26 | 66.48 |
| ✓ | ✓ | ✗ | ✗ | ✗ | 91.43 | 56.08 | 90.05 | 67.67 |
| ✓ | ✓ | ✓ | ✗ | ✗ | 91.72 | 56.54 | 91.85 | 69.36 |
| ✓ | ✓ | ✓ | ✓ | ✗ | 92.46 | 58.13 | 95.10 | 76.03 |
| ✓ | ✓ | ✓ | ✗ | ✓ | 92.03 | 57.61 | 93.13 | 74.09 |
| ✓ | ✓ | ✓ | ✓ | ✓ | 94.30 | 60.71 | 95.48 | 76.52 |
| Permutations | CCSS | SSCC | SCSC | CSCS |
|---|---|---|---|---|
| IRDST | 92.93/59.25 | 92.80/59.71 | 94.28/60.53 | 94.30/60.71 |
| IDSMT | 94.06/75.01 | 94.50/75.46 | 95.91/76.31 | 95.48/76.52 |
| CFN | D-ConvLSTM | IRDST | IDSMT | Params (M) ↓ | Flops (G) ↓ | Speed (FPS) ↑ |
|---|---|---|---|---|---|---|
| F1 (%)/IoU (%) | F1 (%)/IoU (%) | |||||
| ✗ | ✗ | 93.16/58.08 | 93.33/74.52 | 12.80 | 122.73 | 15.52 |
| ✓ | ✗ | 93.41/58.54 | 93.83/74.69 | 14.27 | 126.30 | 15.35 |
| ✗ | ✓ | 94.15/60.29 | 95.25/76.11 | 18.59 | 132.84 | 15.23 |
| ✓ | ✓ | 94.30/60.71 | 95.48/76.52 | 20.06 | 141.85 | 15.07 |
| BC | CFN | D-ConvLSTM | PP | IRDST | IDSMT | Params (M) ↓ | Flops (G) ↓ | Speed (FPS) ↑ |
|---|---|---|---|---|---|---|---|---|
| F1 (%)/IoU (%) | F1 (%)/IoU (%) | |||||||
| ✓ | ✗ | ✗ | ✗ | 91.90/57.96 | 94.20/74.18 | 19.59 | 162.81 | 7.68 |
| ✓ | ✓ | ✗ | ✗ | 92.44/58.36 | 94.64/74.60 | 19.70 | 163.96 | 7.61 |
| ✓ | ✓ | ✓ | ✗ | 94.21/60.50 | 95.85/76.03 | 20.06 | 168.33 | 7.57 |
| ✓ | ✓ | ✓ | ✓ | 94.30/60.71 | 95.48/76.52 | 20.06 | 141.85 | 15.07 |
| ✗ | ✓ | ✓ | ✓ | 93.88/59.94 | 95.26/75.94 | 32.77 | 168.36 | 14.67 |
| Input Form | IRDST | IDSMT | |
|---|---|---|---|
| for | for | F1 (%)/IoU (%) | F1 (%)/IoU (%) |
| [, ] | 92.55/58.41 | 94.13/74.09 | |
| [,] | 94.30/60.71 | 95.48/76.52 | |
| 93.84/59.86 | 94.96/75.66 | ||
| Hyper-Param | IRDST | IDSMT | Params (M) ↓ | Flops (G) ↓ | Speed (FPS) ↑ | ||
|---|---|---|---|---|---|---|---|
| F1 (%) | IoU (%) | F1 (%) | IoU (%) | ||||
| The number of MTTMs | |||||||
| 93.84 | 59.82 | 94.94 | 75.55 | 12.83 | 116.77 | 16.86 | |
| 94.30 | 60.71 | 95.48 | 76.52 | 20.06 | 141.85 | 15.07 | |
| 94.22 | 60.26 | 95.94 | 75.95 | 27.30 | 166.94 | 13.80 | |
| 93.91 | 59.94 | 95.32 | 76.01 | 34.53 | 192.03 | 12.63 | |
| The number of channels in MTTM | |||||||
| 93.06 | 59.36 | 94.66 | 75.07 | 5.02 | 88.22 | 16.60 | |
| 93.65 | 59.90 | 94.99 | 75.89 | 8.25 | 100.57 | 16.24 | |
| 94.30 | 60.71 | 95.48 | 76.52 | 20.06 | 141.85 | 15.07 | |
| 94.15 | 60.31 | 95.01 | 76.16 | 39.02 | 205.23 | 14.17 | |
| m | 3 | 4 | 5 | 6 |
|---|---|---|---|---|
| IRDST | 93.78/60.12 | 94.02/60.34 | 94.30/60.71 | 94.22/60.79 |
| IDSMT | 93.95/74.40 | 94.72/75.33 | 95.48/76.52 | 96.03/76.42 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Feng, Z.; Zhang, W.; Liu, D.; Tao, X.; Su, A.; Yang, Y. Memory-Based Temporal Transformer U-Net for Multi-Frame Infrared Small Target Detection. Remote Sens. 2025, 17, 3801. https://doi.org/10.3390/rs17233801
Feng Z, Zhang W, Liu D, Tao X, Su A, Yang Y. Memory-Based Temporal Transformer U-Net for Multi-Frame Infrared Small Target Detection. Remote Sensing. 2025; 17(23):3801. https://doi.org/10.3390/rs17233801
Chicago/Turabian StyleFeng, Zicheng, Wenlong Zhang, Donghui Liu, Xingfu Tao, Ang Su, and Yixin Yang. 2025. "Memory-Based Temporal Transformer U-Net for Multi-Frame Infrared Small Target Detection" Remote Sensing 17, no. 23: 3801. https://doi.org/10.3390/rs17233801
APA StyleFeng, Z., Zhang, W., Liu, D., Tao, X., Su, A., & Yang, Y. (2025). Memory-Based Temporal Transformer U-Net for Multi-Frame Infrared Small Target Detection. Remote Sensing, 17(23), 3801. https://doi.org/10.3390/rs17233801

