SCT-Diff: Seamless Contextual Tracking via Diffusion Trajectory
Abstract
1. Introduction
- We propose SCT-Diff, a video-level diffusion tracking framework designed to holistically reconstruct the tracking trajectory. This enables bidirectional spatiotemporal perception, overcoming the limitations of static template matching and one-shot temporal priors integration.
- We introduce a novel decoder architecture incorporating Mamba-based lightweight vision-language experts, seamlessly bridging global context aggregation for motion and appearance dynamics.
- A non-causal interaction mechanism exploits future situations to facilitate self-correction of trajectory hypotheses. This exploits temporal propagation consistency to mitigate update risk. Extensive results from the large-scale VOT benchmark demonstrate the effectiveness of the proposed method.
2. Related Work
2.1. Visual Object Tracking
2.2. Temporal Relation Modeling
2.3. Diffusion Model
3. Methodologies
3.1. Preliminaries
3.1.1. Spatiotemporal Tracking Framework
3.1.2. Diffusion Model Framework
3.2. SCT-Diff Framework
3.2.1. Trajectory Coordinate Tokenization
3.2.2. Diffusion Models for Trajectory Generation
3.2.3. Encoder
3.2.4. Decoder
3.3. Training
3.3.1. Training Loss
3.3.2. Two-Stage Training
3.4. Inference
Trajectory Refinement
| Algorithm 1 Inferencealgorithm (decoder only) |
|
4. Experiments
4.1. Implementation Details
4.2. Overall Performance
4.2.1. GOT-10k
4.2.2. LaSOT
4.2.3. TrackingNet
4.2.4. TNL2K
4.2.5. OTB-100, NFS, and TC-128
4.2.6. Qualitative Analysis
4.3. Ablation and Analysis
4.3.1. Video Clip Length
4.3.2. Time Window
4.3.3. Depth of Diffusion Layer
4.3.4. Vision Expert and Language Expert
4.3.5. Unified Position Reference
4.3.6. Training Strategy
4.3.7. Loss Combination
4.4. Limitations and Future Work
Limitations
Extension to Transportation Scenarios
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Kugarajeevan, J.; Kokul, T.; Ramanan, A.; Fernando, S. Transformers in single object tracking: An experimental survey. IEEE Access 2023, 11, 80297–80326. [Google Scholar] [CrossRef]
- Abdelaziz, O.; Shehata, M.; Mohamed, M. Beyond traditional visual object tracking: A survey. Int. J. Mach. Learn. Cybern. 2025, 16, 1435–1460. [Google Scholar]
- Bertinetto, L.; Valmadre, J.; Henriques, J.F.; Vedaldi, A.; Torr, P.H. Fully-convolutional siamese networks for object tracking. In Proceedings of the Computer Vision–ECCV 2016 Workshops, Amsterdam, The Netherlands, 11–14 October 2016; pp. 850–865. [Google Scholar]
- Chen, X.; Yan, B.; Zhu, J.; Wang, D.; Yang, X.; Lu, H. Transformer tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8126–8135. [Google Scholar]
- Li, B.; Yan, J.; Wu, W.; Zhu, Z.; Hu, X. High performance visual tracking with siamese region proposal network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8971–8980. [Google Scholar]
- Lin, L.; Fan, H.; Zhang, Z.; Xu, Y.; Ling, H. Swintrack: A simple and strong baseline for transformer tracking. Adv. Neural Inf. Process. Syst. 2022, 35, 16743–16754. [Google Scholar]
- Yan, B.; Peng, H.; Fu, J.; Wang, D.; Lu, H. Learning spatio-temporal transformer for visual tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10448–10457. [Google Scholar]
- Meng, W.; Duan, S.; Ma, S.; Hu, B. Motion-Perception Multi-Object Tracking (MPMOT): Enhancing Multi-Object Tracking Performance via Motion-Aware Data Association and Trajectory Connection. J. Imaging 2025, 11, 144. [Google Scholar]
- Cui, Y.; Jiang, C.; Wang, L.; Wu, G. Mixformer: End-to-end tracking with iterative mixed attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 13608–13618. [Google Scholar]
- Song, Z.; Luo, R.; Yu, J.; Chen, Y.P.P.; Yang, W. Compact transformer tracker with correlative masked modeling. Proc. AAAI Conf. Artif. Intell. 2023, 37, 2321–2329. [Google Scholar] [CrossRef]
- Cai, W.; Liu, Q.; Wang, Y. Hiptrack: Visual tracking with historical prompts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 19258–19267. [Google Scholar]
- Mayer, C.; Danelljan, M.; Bhat, G.; Paul, M.; Paudel, D.P.; Yu, F.; Van Gool, L. Transforming model prediction for tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 8731–8740. [Google Scholar]
- Zhang, L.; Gonzalez-Garcia, A.; Weijer, J.V.D.; Danelljan, M.; Khan, F.S. Learning the model update for siamese trackers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 Octorber–2 November 2019; pp. 4010–4019. [Google Scholar]
- Xie, J.; Zhong, B.; Mo, Z.; Zhang, S.; Shi, L.; Song, S.; Ji, R. Autoregressive queries for adaptive tracking with spatio-temporal transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 19300–19309. [Google Scholar]
- Wei, X.; Bai, Y.; Zheng, Y.; Shi, D.; Gong, Y. Autoregressive Visual Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 9697–9706. [Google Scholar]
- Bai, Y.; Zhao, Z.; Gong, Y.; Wei, X. Artrackv2: Prompting autoregressive tracker where to look and how to describe. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 19048–19057. [Google Scholar]
- Chen, X.; Peng, H.; Wang, D.; Lu, H.; Hu, H. Seqtrack: Sequence to sequence learning for visual object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 14572–14581. [Google Scholar]
- Wang, X.; Chen, Z.; Tang, J.; Luo, B.; Wang, Y.; Tian, Y.; Wu, F. Dynamic attention guided multi-trajectory analysis for single object tracking. IEEE Trans. Circuits Syst. Video Technol. 2021, 31, 4895–4908. [Google Scholar] [CrossRef]
- Wang, H.; Liu, J.; Su, Y.; Yang, X. Trajectory guided robust visual object tracking with selective remedy. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 3425–3440. [Google Scholar] [CrossRef]
- Xu, L.; Diao, Z.; Wei, Y. Non-linear target trajectory prediction for robust visual tracking. Appl. Intell. 2022, 52, 8588–8602. [Google Scholar] [CrossRef]
- Prasannakumar, A.; Mishra, D. Deep Efficient Data Association for Multi-Object Tracking: Augmented with SSIM-Based Ambiguity Elimination. J. Imaging 2024, 10, 171. [Google Scholar]
- Xie, F.; Chu, L.; Li, J.; Lu, Y.; Ma, C. Videotrack: Learning to track objects via video transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 22826–22835. [Google Scholar]
- Ye, B.; Chang, H.; Ma, B.; Shan, S.; Chen, X. Joint feature learning and relation modeling for tracking: A one-stream framework. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 341–357. [Google Scholar]
- Dhariwal, P.; Nichol, A. Diffusion models beat gans on image synthesis. Adv. Neural Inf. Process. Syst. 2021, 34, 8780–8794. [Google Scholar]
- Ramesh, A.; Dhariwal, P.; Nichol, A.; Chu, C.; Chen, M. Hierarchical text-conditional image generation with clip latents. arXiv 2022, arXiv:2204.06125. [Google Scholar] [CrossRef]
- Saharia, C.; Ho, J.; Chan, W.; Salimans, T.; Fleet, D.J.; Norouzi, M. Image super-resolution via iterative refinement. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 4713–4726. [Google Scholar] [CrossRef]
- Li, B.; Wu, W.; Wang, Q.; Zhang, F.; Xing, J.; Yan, J. Siamrpn++: Evolution of siamese visual tracking with very deep networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4282–4291. [Google Scholar]
- Gao, S.; Zhou, C.; Zhang, J. Generalized relation modeling for transformer tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 18686–18695. [Google Scholar]
- He, K.; Chen, X.; Xie, S.; Li, Y.; Dollár, P.; Girshick, R. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 16000–16009. [Google Scholar]
- Li, X.; Wang, W.; Wu, L.; Chen, S.; Hu, X.; Li, J.; Tang, J.; Yang, J. Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection. Adv. Neural Inf. Process. Syst. 2020, 33, 21002–21012. [Google Scholar]
- Zhou, X.; Wang, D.; Krähenbühl, P. Objects as points. arXiv 2019, arXiv:1904.07850. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Bhat, G.; Danelljan, M.; Gool, L.V.; Timofte, R. Learning discriminative model prediction for tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 Octorber–2 November 2019; pp. 6182–6191. [Google Scholar]
- Danelljan, M.; Bhat, G.; Shahbaz Khan, F.; Felsberg, M. Eco: Efficient convolution operators for tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6638–6646. [Google Scholar]
- Danelljan, M.; Bhat, G.; Khan, F.S.; Felsberg, M. Atom: Accurate tracking by overlap maximization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4660–4669. [Google Scholar]
- Lan, J.P.; Cheng, Z.Q.; He, J.Y.; Li, C.; Luo, B.; Bao, X.; Xiang, W.; Geng, Y.; Xie, X. Procontext: Exploring progressive context transformer for tracking. In Proceedings of the ICASSP 2023–2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
- Li, X.; Ma, C.; Wu, B.; He, Z.; Yang, M.H. Target-aware deep tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1369–1378. [Google Scholar]
- Wang, G.; Luo, C.; Sun, X.; Xiong, Z.; Zeng, W. Tracking by instance detection: A meta-learning approach. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 6288–6297. [Google Scholar]
- Yang, T.; Chan, A.B. Learning dynamic memory networks for object tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 152–167. [Google Scholar]
- Dai, K.; Zhang, Y.; Wang, D.; Li, J.; Lu, H.; Yang, X. High-performance long-term tracking with meta-updater. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 6298–6307. [Google Scholar]
- Wang, X.; Nie, G.; Li, B.; Zhao, Y.; Kang, M.; Liu, B. Hierarchical memory-guided long-term tracking with meta transformer inquiry network. Knowl.-Based Syst. 2023, 269, 110504. [Google Scholar] [CrossRef]
- Nie, G.; Wang, X.; Yan, Z.; Xu, X.; Liu, B. Temporal relation transformer for robust visual tracking with dual-memory learning. Appl. Soft Comput. 2024, 167, 112229. [Google Scholar] [CrossRef]
- Zhou, Z.; Zhou, X.; Chen, Z.; Guo, P.; Liu, Q.Y.; Zhang, W. Memory network with pixel-level spatio-temporal learning for visual object tracking. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 6897–6911. [Google Scholar] [CrossRef]
- He, K.; Zhang, C.; Xie, S.; Li, Z.; Wang, Z. Target-aware tracking with long-term context attention. Proc. AAAI Conf. Artif. Intell. 2023, 37, 773–780. [Google Scholar] [CrossRef]
- Oh, S.W.; Lee, J.Y.; Xu, N.; Kim, S.J. Video object segmentation using space-time memory networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 Octorber–2 November 2019; pp. 9226–9235. [Google Scholar]
- Wang, N.; Zhou, W.; Wang, J.; Li, H. Transformer meets tracker: Exploiting temporal context for robust visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 1571–1580. [Google Scholar]
- Bhat, G.; Danelljan, M.; Van Gool, L.; Timofte, R. Know your surroundings: Exploiting scene information for object tracking. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 205–221. [Google Scholar]
- Sauer, A.; Aljalbout, E.; Haddadin, S. Tracking holistic object representations. arXiv 2019, arXiv:1907.12920. [Google Scholar] [CrossRef]
- Fu, Z.; Liu, Q.; Fu, Z.; Wang, Y. Stmtrack: Template-free visual tracking with space-time memory networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13774–13783. [Google Scholar]
- Liao, Z.; Xu, X.; Xu, Z.; Ismail, A. Discriminative learning of online appearance modeling methods for visual tracking. J. Opt. 2024, 53, 1129–1136. [Google Scholar] [CrossRef]
- Zheng, Y.; Zhong, B.; Liang, Q.; Li, G.; Ji, R.; Li, X. Toward unified token learning for vision-language tracking. IEEE Trans. Circuits Syst. Video Technol. 2023, 34, 2125–2135. [Google Scholar] [CrossRef]
- Fan, W.C.; Chen, Y.C.; Chen, D.; Cheng, Y.; Yuan, L.; Wang, Y.C.F. Frido: Feature pyramid diffusion for complex scene image synthesis. Proc. AAAI Conf. Artif. Intell. 2023, 37, 579–587. [Google Scholar] [CrossRef]
- Ruiz, N.; Li, Y.; Jampani, V.; Pritch, Y.; Rubinstein, M.; Aberman, K. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 22500–22510. [Google Scholar]
- Saharia, C.; Chan, W.; Saxena, S.; Li, L.; Whang, J.; Denton, E.L.; Ghasemipour, K.; Gontijo Lopes, R.; Karagol Ayan, B.; Salimans, T.; et al. Photorealistic text-to-image diffusion models with deep language understanding. Adv. Neural Inf. Process. Syst. 2022, 35, 36479–36494. [Google Scholar]
- Singer, U.; Polyak, A.; Hayes, T.; Yin, X.; An, J.; Zhang, S.; Hu, Q.; Yang, H.; Ashual, O.; Gafni, O.; et al. Make-a-video: Text-to-video generation without text-video data. arXiv 2022, arXiv:2209.14792. [Google Scholar]
- Yang, R.; Srivastava, P.; Mandt, S. Diffusion probabilistic modeling for video generation. Entropy 2023, 25, 1469. [Google Scholar] [CrossRef]
- Zhang, M.; Cai, Z.; Pan, L.; Hong, F.; Guo, X.; Yang, L.; Liu, Z. Motiondiffuse: Text-driven human motion generation with diffusion model. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 4115–4128. [Google Scholar] [CrossRef]
- Huang, R.; Zhao, Z.; Liu, H.; Liu, J.; Cui, C.; Ren, Y. Prodiff: Progressive fast diffusion model for high-quality text-to-speech. In Proceedings of the 30th ACM International Conference on Multimedia, Lisbon, Portugal, 10–14 October 2022; pp. 2595–2605. [Google Scholar]
- Kim, S.; Kim, H.; Yoon, S. Guided-tts 2: A diffusion model for high-quality adaptive text-to-speech with untranscribed data. arXiv 2022, arXiv:2205.15370. [Google Scholar]
- Levkovitch, A.; Nachmani, E.; Wolf, L. Zero-shot voice conditioning for denoising diffusion tts models. arXiv 2022, arXiv:2206.02246. [Google Scholar] [CrossRef]
- Wu, S.; Shi, Z. Stochastic Differential Equation is All You Need for Voice Generation. Available online: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4156409 (accessed on 5 January 2026).
- Yang, D.; Yu, J.; Wang, H.; Wang, W.; Weng, C.; Zou, Y.; Yu, D. Diffsound: Discrete diffusion model for text-to-sound generation. IEEE/ACM Trans. Audio Speech Lang. Process. 2023, 31, 1720–1733. [Google Scholar] [CrossRef]
- Austin, J.; Johnson, D.D.; Ho, J.; Tarlow, D.; Van Den Berg, R. Structured denoising diffusion models in discrete state-spaces. Adv. Neural Inf. Process. Syst. 2021, 34, 17981–17993. [Google Scholar]
- Gong, S.; Li, M.; Feng, J.; Wu, Z.; Kong, L. Diffuseq: Sequence to sequence text generation with diffusion models. arXiv 2022, arXiv:2210.08933. [Google Scholar]
- Li, X.; Thickstun, J.; Gulrajani, I.; Liang, P.S.; Hashimoto, T.B. Diffusion-lm improves controllable text generation. Adv. Neural Inf. Process. Syst. 2022, 35, 4328–4343. [Google Scholar]
- Gong, S.; Li, M.; Feng, J.; Wu, Z.; Kong, L. Diffuseq-v2: Bridging discrete and continuous text spaces for accelerated seq2seq diffusion models. arXiv 2023, arXiv:2310.05793. [Google Scholar]
- He, Y.; Cai, Z.; Gan, X.; Chang, B. DiffCap: Exploring continuous diffusion on image captioning. arXiv 2023, arXiv:2305.12144. [Google Scholar] [CrossRef]
- Luo, R.; Song, Z.; Ma, L.; Wei, J.; Yang, W.; Yang, M. Diffusiontrack: Diffusion model for multi-object tracking. Proc. AAAI Conf. Artif. Intell. 2024, 38, 3991–3999. [Google Scholar] [CrossRef]
- Ji, Y.; Chen, Z.; Xie, E.; Hong, L.; Liu, X.; Liu, Z.; Lu, T.; Li, Z.; Luo, P. Ddp: Diffusion model for dense visual prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 21741–21752. [Google Scholar]
- Brempong, E.A.; Kornblith, S.; Chen, T.; Parmar, N.; Minderer, M.; Norouzi, M. Denoising pretraining for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 4175–4186. [Google Scholar]
- Chen, T.; Li, L.; Saxena, S.; Hinton, G.; Fleet, D.J. A generalist framework for panoptic segmentation of images and videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision, New Orleans, LA, USA, 18–24 June 2023; pp. 909–919. [Google Scholar]
- Graikos, A.; Malkin, N.; Jojic, N.; Samaras, D. Diffusion models as plug-and-play priors. Adv. Neural Inf. Process. Syst. 2022, 35, 14715–14728. [Google Scholar]
- Kim, B.; Oh, Y.; Ye, J.C. Diffusion adversarial representation learning for self-supervised vessel segmentation. arXiv 2022, arXiv:2209.14566. [Google Scholar]
- Wolleb, J.; Sandkühler, R.; Bieder, F.; Valmaggia, P.; Cattin, P.C. Diffusion models for implicit image segmentation ensembles. In Proceedings of the International Conference on Medical Imaging with Deep Learning, Zurich, Switzerland, 6–8 July 2022; pp. 1336–1348. [Google Scholar]
- Chen, S.; Sun, P.; Song, Y.; Luo, P. Diffusiondet: Diffusion model for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 19830–19843. [Google Scholar]
- Xie, F.; Wang, Z.; Ma, C. Diffusiontrack: Point set diffusion model for visual object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 19113–19124. [Google Scholar]
- Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv 2023, arXiv:2312.00752. [Google Scholar] [CrossRef]
- Chen, T.; Zhang, R.; Hinton, G. Analog bits: Generating discrete data using diffusion models with self-conditioning. arXiv 2022, arXiv:2208.04202. [Google Scholar]
- Nichol, A.Q.; Dhariwal, P. Improved denoising diffusion probabilistic models. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 8162–8171. [Google Scholar]
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar]
- Fan, H.; Lin, L.; Yang, F.; Chu, P.; Deng, G.; Yu, S.; Bai, H.; Xu, Y.; Liao, C.; Ling, H. Lasot: A high-quality benchmark for large-scale single object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5374–5383. [Google Scholar]
- Huang, L.; Zhao, X.; Huang, K. Got-10k: A large high-diversity benchmark for generic object tracking in the wild. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 1562–1577. [Google Scholar] [CrossRef] [PubMed]
- Muller, M.; Bibi, A.; Giancola, S.; Alsubaihi, S.; Ghanem, B. Trackingnet: A large-scale dataset and benchmark for object tracking in the wild. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 300–317. [Google Scholar]
- Wang, X.; Shu, X.; Zhang, Z.; Jiang, B.; Wang, Y.; Tian, Y.; Wu, F. Towards more flexible and accurate object tracking with natural language: Algorithms and benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13763–13773. [Google Scholar]
- Wu, Y.; Lim, J.; Yang, M.H. Online object tracking: A benchmark. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 2411–2418. [Google Scholar]
- Kiani Galoogahi, H.; Fagg, A.; Huang, C.; Ramanan, D.; Lucey, S. Need for speed: A benchmark for higher frame rate object tracking. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 1125–1134. [Google Scholar]
- Liang, P.; Blasch, E.; Ling, H. Encoding color information for visual tracking: Algorithms and benchmark. IEEE Trans. Image Process. 2015, 24, 5630–5644. [Google Scholar] [CrossRef]
- Song, J.; Meng, C.; Ermon, S. Denoising diffusion implicit models. arXiv 2020, arXiv:2010.02502. [Google Scholar]
- Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. In Proceedings of the International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
- Nam, H.; Han, B. Learning Multi-Domain Convolutional Neural Networks for Visual Tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
- Gao, S.; Zhou, C.; Ma, C.; Wang, X.; Yuan, J. Aiatrack: Attention in attention for transformer visual tracking. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 146–164. [Google Scholar]
- Shi, L.; Zhong, B.; Liang, Q.; Li, N.; Zhang, S.; Li, X. Explicit Visual Prompts for Visual Object Tracking. Proc. AAAI Conf. Artif. Intell. 2024, 38, 4838–4846. [Google Scholar] [CrossRef]
- Wang, X.; Nie, G.; Meng, J.; Yan, Z. MIMTrack: In-Context Tracking via Masked Image Modeling. Proc. AAAI Conf. Artif. Intell. 2025, 39, 7979–7987. [Google Scholar] [CrossRef]
- Li, K.; Li, X.; Wang, Y.; He, Y.; Wang, Y.; Wang, L.; Qiao, Y. Videomamba: State space model for efficient video understanding. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer: Cham, Switzerland, 2024; pp. 237–255. [Google Scholar]
- Yang, C.; Han, X.; Han, T.; Su, Y.; Gao, J.; Zhang, H.; Wang, Y.; Chau, L.P. Signeye: Traffic sign interpretation from vehicle first-person view. IEEE Trans. Intell. Transp. Syst. 2025, 26, 19413–19425. [Google Scholar] [CrossRef]
- Guo, Y.; Feng, W.; Yin, F.; Liu, C.L. SignParser: An end-to-end framework for traffic sign understanding. Int. J. Comput. Vis. 2024, 132, 805–821. [Google Scholar] [CrossRef]










| Methods | GOT-10k | TrackingNet | LaSOT | TNL2K | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| AO | SR0.5 | SR0.75 | AUC | PNorm | P | AUC | PNorm | P | AUC | P | ||
| Discriminative | MDNet [90] | 29.9 | 30.3 | 9.9 | 60.6 | 70.5 | 56.5 | 39.7 | 46.0 | 37.3 | - | - |
| ATOM [35] | 55.6 | 63.4 | 40.2 | 70.3 | 77.1 | 64.8 | 51.5 | 57.6 | 50.5 | 40.1 | 39.2 | |
| SiamRPN++ [27] | 51.7 | 61.6 | 32.5 | 73.3 | 80.0 | 69.4 | 49.6 | 56.9 | 49.1 | 41.3 | 41.2 | |
| DiMP [33] | 61.1 | 71.7 | 49.2 | 74.0 | 80.1 | 68.7 | 56.9 | 65.0 | 56.7 | 44.7 | 43.4 | |
| TrDiMP [46] | 67.1 | 77.7 | 58.3 | 78.4 | 83.3 | 73.1 | 63.9 | - | 61.4 | - | - | |
| TransT [4] | 67.1 | 76.8 | 60.9 | 81.4 | 86.7 | 80.3 | 64.9 | 73.8 | 69.0 | 50.7 | 51.7 | |
| STARK [7] | 68.8 | 78.1 | 64.1 | 82.0 | 86.9 | - | 67.1 | 77.0 | - | - | - | |
| AiATrack [91] | 69.6 | 63.2 | 80.0 | 82.7 | 87.8 | 80.4 | 69.0 | 79.4 | 73.8 | - | - | |
| SwinTrack-T [6] | 71.3 | 81.9 | 64.5 | 81.1 | - | 78.4 | 67.2 | - | 70.8 | 55.9 | 57.1 | |
| MixFormer-22k [9] | 70.7 | 80.0 | 67.8 | 83.1 | 88.1 | 81.6 | 69.2 | 78.7 | 74.7 | - | - | |
| OSTrack [23] | 71.0 | 80.4 | 68.2 | 83.1 | 87.8 | 82.0 | 69.1 | 78.7 | 75.2 | 55.9 | - | |
| GRM [28] | 73.4 | 82.9 | 70.4 | 84.0 | 88.7 | 83.3 | 69.9 | 79.3 | 75.8 | - | - | |
| EVPTrack [92] | 73.3 | 83.6 | 70.7 | - | - | - | 70.4 | 80.9 | 77.2 | - | - | |
| Generative | ARTrack [15] | 73.5 | 82.2 | 70.9 | 84.2 | 88.7 | 83.5 | 70.4 | 79.5 | 76.6 | 57.5 | - |
| SeqTrack-B [17] | 74.7 | 84.7 | 71.8 | 83.3 | 88.3 | 82.2 | 69.9 | 79.7 | 76.3 | 54.9 | - | |
| DiffusionTrack [76] | 74.8 | 85.4 | 72.0 | 83.8 | 88.2 | 82.1 | 70.8 | 79.8 | 76.7 | 56.4 | 57.3 | |
| MIMTrack [93] | 72.6 | 83.2 | 69.3 | 83.1 | 87.7 | 80.9 | 69.1 | 78.8 | 75.7 | 57.9 | 57.7 | |
| SCT-Diff | 75.4 | 86.7 | 73.3 | 84.0 | 88.8 | 83.4 | 71.1 | 81.0 | 77.5 | 58.5 | 58.9 | |
| Method | SiamRPN++ [27] | DiMP [33] | TransT [4] | STARK [7] | ProContEXT [36] | AiATrack [91] | MixFormer [9] | OSTrack [23] | GRM [28] | ARTrack [15] | SCT-Diff |
|---|---|---|---|---|---|---|---|---|---|---|---|
| NFS | 50.2 | 61.8 | 65.3 | 65.2 | 70.0 | 67.9 | 65.4 | 64.7 | 65.6 | 63.5 | 71.4 |
| TC128 | 57.7 | 61.2 | 59.6 | 60.0 | 58.1 | 58.7 | 60.1 | 54.3 | 54.9 | 55.6 | 63.1 |
| Number | AO | SR0.5 | SR0.75 |
|---|---|---|---|
| 1 | 72.7 | 83.0 | 71.0 |
| 4 | 73.6 | 85.0 | 71.6 |
| 6 | 75.4 | 86.7 | 73.3 |
| 8 | 74.7 | 86.1 | 73.1 |
| 12 | 74.1 | 85.3 | 72.1 |
| 16 | 73.2 | 84.0 | 71.2 |
| Positional Reference | AO | SR0.5 | SR0.75 |
|---|---|---|---|
| Bidirectional | 75.4 | 86.7 | 73.3 |
| Unidirectional | 73.2 | 83.4 | 70.1 |
| V.Expert | L.Expert | AO ↑ | SR0.5 | SR0.75 |
|---|---|---|---|---|
| ✓ | 72.2 | 82.2 | 70.3 | |
| ✓ | 71.7 | 82.5 | 69.8 | |
| ✓ | ✓ | 75.4 | 86.7 | 73.3 |
| Positional Reference | AO | SR0.5 | SR0.75 |
|---|---|---|---|
| 0-th-frame reference | 75.4 | 86.7 | 73.3 |
| Random reference | 72.5 | 83.4 | 70.9 |
| Mean reference | 70.1 | 80.3 | 67.6 |
| Training Strategy | AO | SR0.5 | SR0.75 |
|---|---|---|---|
| w/pre-train | 75.4 | 86.7 | 73.3 |
| w/o pre-train | 73.8 | 83.1 | 71.6 |
| Random interval | 74.4 | 85.9 | 72.5 |
| Continuous scanning | 70.8 | 81.2 | 68.6 |
| CE | GIoU | L1 | AO | SR0.5 | SR0.75 |
|---|---|---|---|---|---|
| ✓ | 72.2 | 81.1 | 66.0 | ||
| ✓ | ✓ | 75.4 | 86.7 | 73.3 | |
| ✓ | ✓ | ✓ | 74.6 | 85.8 | 73.2 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Nie, G.; Wang, X.; Zhang, D.; Wang, H. SCT-Diff: Seamless Contextual Tracking via Diffusion Trajectory. J. Imaging 2026, 12, 38. https://doi.org/10.3390/jimaging12010038
Nie G, Wang X, Zhang D, Wang H. SCT-Diff: Seamless Contextual Tracking via Diffusion Trajectory. Journal of Imaging. 2026; 12(1):38. https://doi.org/10.3390/jimaging12010038
Chicago/Turabian StyleNie, Guohao, Xingmei Wang, Debin Zhang, and He Wang. 2026. "SCT-Diff: Seamless Contextual Tracking via Diffusion Trajectory" Journal of Imaging 12, no. 1: 38. https://doi.org/10.3390/jimaging12010038
APA StyleNie, G., Wang, X., Zhang, D., & Wang, H. (2026). SCT-Diff: Seamless Contextual Tracking via Diffusion Trajectory. Journal of Imaging, 12(1), 38. https://doi.org/10.3390/jimaging12010038

