Siamese Tracking Network with Spatial-Semantic-Aware Attention and Flexible Spatiotemporal Constraint
Abstract
:1. Introduction
2. Related Work
2.1. Siamese-Based Trackers
2.2. Trackers with the Attentional Mechanism
2.3. Trackers with Spatiotemporal Constraint
Tracker Type | Related Tracker | Year | Peculiarity |
---|---|---|---|
Siamese-based trackers | SiamFC [5] | 2016 | Similarity learning |
SiamRPN [7] | 2018 | Region proposal network | |
DSiam [18] | 2017 | Use multilayer features | |
SiamMCF [20] | 2018 | Use multilayer features | |
SiamBAN [8] | 2020 | Anchor-free strategy | |
SiamCAR [10] | 2020 | Anchor-free strategy | |
Trackers with the attentional mechanism | DVAT [26] | 2010 | Local and semi-local attention regions |
RTT [27] | 2016 | Recurrent neural networks | |
RASNet [5] | 2018 | Residual Attentional Siamese Network | |
SCSAtt [29] | 2020 | Stacked channel-spatial attention | |
Trackers with a spatiotemporal constraint | MOSSE [30] | 2010 | Cosine window constraint |
KCF [31] | 2014 | Hamming window | |
NA [32] | 2020 | Noise-aware framework |
3. The Proposed Method
3.1. Spatial-Semantic-Aware Attention Model
3.1.1. Spatial-Aware Attention Model
3.1.2. Semantic-Aware Attention Model
3.2. Flexible Spatiotemporal Constraint
3.3. Adaptive Weight Template Updating
4. Experiments
4.1. Settings and Datasets
4.2. Results on OTB100
4.3. Attribute-Based Comparison
4.4. Results on UAV123
4.5. Results on NFS
4.6. Results on VOT2016
4.7. Results on TC128
4.8. Visual Evaluation
4.9. Ablation Study
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Shirzadeh, M.; Asl, H.J.; Amirkhani, A.; Jalali, A.A. Vision-based control of a quadrotor utilizing artificial neural networks for tracking of moving targets. Eng. Appl. Artif. Intell. 2017, 58, 34–48. [Google Scholar] [CrossRef]
- Fernandez-Sanjurjo, M.; Bosquet, B.; Mucientes, M.; Brea, V.M. Real-time visual detection and tracking system for traffic monitoring. Eng. Appl. Artif. Intell. 2019, 85, 410–420. [Google Scholar] [CrossRef]
- Zhang, J.; Liu, J.; Wang, Z. Convolutional neural network for crowd counting on metro platforms. Symmetry 2021, 13, 703. [Google Scholar] [CrossRef]
- He, Z.; He, H. Unsupervised multi-object detection for video surveillance using memory-based recurrent attention networks. Symmetry 2018, 10, 375. [Google Scholar] [CrossRef]
- Bertinetto, L.; Valmadre, J.; Henriques, J.F.; Vedaldi, A.; Torr, P.H. Fully-convolutional siamese networks for object tracking. In Proceedings of the Computer Vision—ECCV 2016 Workshops, Amsterdam, The Netherlands, 8–10 and 15–16 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 850–865. [Google Scholar]
- Zhang, Z.; Peng, H. Deeper and wider siamese networks for real-time visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4591–4600. [Google Scholar]
- Li, B.; Yan, J.; Wu, W.; Zhu, Z.; Hu, X. High performance visual tracking with siamese region proposal network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8971–8980. [Google Scholar]
- Chen, Z.; Zhong, B.; Li, G.; Zhang, S.; Ji, R. Siamese box adaptive network for visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 6668–6677. [Google Scholar]
- Li, L.; Liu, Y.; Chen, Z. SiamCenter: An Anchor-free Siamese Network for Object Tracking. In Proceedings of the 2020 9th International Conference on Computing and Pattern Recognition, New York, NY, USA, 30 October–1 November 2020; pp. 460–466. [Google Scholar]
- Guo, D.; Wang, J.; Cui, Y.; Wang, Z.; Chen, S. SiamCAR: Siamese fully convolutional classification and regression for visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 6269–6277. [Google Scholar]
- Xu, Y.; Wang, Z.; Li, Z.; Yuan, Y.; Yu, G. Siamfc++: Towards robust and accurate visual tracking with target estimation guidelines. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12549–12556. [Google Scholar]
- Chen, D.; Tang, F.; Dong, W.; Yao, H.; Xu, C. SiamCPN: Visual tracking with the Siamese center-prediction network. Comput. Vis. Media 2021, 7, 253–265. [Google Scholar] [CrossRef]
- Peng, J.; Jiang, Z.; Gu, Y.; Wu, Y.; Wang, Y.; Tai, Y.; Wang, C.; Lin, W. Siamrcr: Reciprocal classification and regression for visual object tracking. arXiv 2021, arXiv:2105.11237. [Google Scholar]
- Zhang, J.; Miao, M.; Zhang, H.; Wang, J.; Zhang, J.; Qiu, Z. Siamese reciprocal classification and residual regression for robust object tracking. Digit. Signal Process. 2022, 123, 103451. [Google Scholar] [CrossRef]
- Kaiser, J.; Schafer, R. On the use of the I 0-sinh window for spectrum analysis. IEEE Trans. Acoust. Speech Signal Process. 1980, 28, 105–107. [Google Scholar] [CrossRef]
- Mohanty, N.C. Signal Processing: Signals, Filtering, and Detection; Springer: Berlin/Heidelberg, Germany, 2012. [Google Scholar]
- Bergen, S.W.; Antoniou, A. Design of ultraspherical window functions with prescribed spectral characteristics. Eurasip J. Adv. Signal Process. 2004, 2004, 196503. [Google Scholar] [CrossRef]
- Guo, Q.; Feng, W.; Zhou, C.; Huang, R.; Wan, L.; Wang, S. Learning dynamic siamese network for visual object tracking. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 1763–1771. [Google Scholar]
- Held, D.; Thrun, S.; Savarese, S. Learning to track at 100 fps with deep regression networks. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 749–765. [Google Scholar]
- Morimitsu, H. Multiple context features in siamese networks for visual object tracking. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, Munich, Germany, 8–14 September 2018. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28. [Google Scholar] [CrossRef]
- Huang, L.; Zhao, X.; Huang, K. Bridging the gap between detection and tracking: A unified approach. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3999–4009. [Google Scholar]
- Yang, Y.; Jiao, L.; Liu, X.; Liu, F.; Yang, S.; Li, L.; Chen, P.; Li, X.; Huang, Z. Dual wavelet attention networks for image classification. IEEE Trans. Circuits Syst. Video Technol. 2022, 33, 1899–1910. [Google Scholar] [CrossRef]
- Wang, F.; Jiang, M.; Qian, C.; Yang, S.; Li, C.; Zhang, H.; Wang, X.; Tang, X. Residual attention network for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3156–3164. [Google Scholar]
- Guo, D.; Shao, Y.; Cui, Y.; Wang, Z.; Zhang, L.; Shen, C. Graph attention tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 9543–9552. [Google Scholar]
- Fan, J.; Wu, Y.; Dai, S. Discriminative spatial attention for robust tracking. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2010; pp. 480–493. [Google Scholar]
- Cui, Z.; Xiao, S.; Feng, J.; Yan, S. Recurrently target-attending tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1449–1458. [Google Scholar]
- Wang, Q.; Teng, Z.; Xing, J.; Gao, J.; Hu, W.; Maybank, S. Learning attentions: Residual attentional siamese network for high performance online visual tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4854–4863. [Google Scholar]
- Rahman, M.M.; Fiaz, M.; Jung, S.K. Efficient visual tracking with stacked channel-spatial attention learning. IEEE Access 2020, 8, 100857–100869. [Google Scholar] [CrossRef]
- Bolme, D.S.; Beveridge, J.R.; Draper, B.A.; Lui, Y.M. Visual object tracking using adaptive correlation filters. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 13–18 June 2010; pp. 2544–2550. [Google Scholar]
- Henriques, J.F.; Caseiro, R.; Martins, P.; Batista, J. High-speed tracking with kernelized correlation filters. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 37, 583–596. [Google Scholar] [CrossRef] [PubMed]
- Li, S.; Zhao, S.; Cheng, B.; Chen, J. Noise-aware framework for robust visual tracking. IEEE Trans. Cybern. 2020, 52, 1179–1192. [Google Scholar] [CrossRef] [PubMed]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Ma, C.; Huang, J.B.; Yang, X.; Yang, M.H. Hierarchical convolutional features for visual tracking. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 3074–3082. [Google Scholar]
- Corbetta, M.; Shulman, G.L. Control of goal-directed and stimulus-driven attention in the brain. Nat. Rev. Neurosci. 2002, 3, 201–215. [Google Scholar] [CrossRef] [PubMed]
- Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722. [Google Scholar]
- Wu, Y.; Lim, J.; Yang, M.H. Object Tracking Benchmark. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1834–1848. [Google Scholar] [CrossRef] [PubMed]
- Mueller, M.; Smith, N.; Ghanem, B. A benchmark and simulator for uav tracking. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2016; pp. 445–461. [Google Scholar]
- Kiani Galoogahi, H.; Fagg, A.; Huang, C.; Ramanan, D.; Lucey, S. Need for speed: A benchmark for higher frame rate object tracking. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 1125–1134. [Google Scholar]
- Kristan, M.; Matas, J.; Leonardis, A.; Felsberg, M.; Cehovin, L.; Fernandez, G.; Vojir, T.; Hager, G.; Nebehay, G.; Pflugfelder, R. The visual object tracking vot2016 challenge results. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Las Vegas, NV, USA, 27–30 June 2016; pp. 1–23. [Google Scholar]
- Liang, P.; Blasch, E.; Ling, H. Encoding color information for visual tracking: Algorithms and benchmark. IEEE Trans. Image Process. 2015, 24, 5630–5644. [Google Scholar] [CrossRef]
- Zhu, Z.; Wang, Q.; Li, B.; Wu, W.; Yan, J.; Hu, W. Distractor-aware siamese networks for visual object tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 101–117. [Google Scholar]
- Li, P.; Chen, B.; Ouyang, W.; Wang, D.; Yang, X.; Lu, H. Gradnet: Gradient-guided network for visual object tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6162–6171. [Google Scholar]
- Danelljan, M.; Hager, G.; Shahbaz Khan, F.; Felsberg, M. Convolutional features for correlation filter based visual tracking. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Santiago, Chile, 7–13 December 2015; pp. 58–66. [Google Scholar]
- Danelljan, M.; Hager, G.; Shahbaz Khan, F.; Felsberg, M. Learning spatially regularized correlation filters for visual tracking. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 4310–4318. [Google Scholar]
- Danelljan, M.; Häger, G.; Khan, F.S.; Felsberg, M. Discriminative scale space tracking. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1561–1575. [Google Scholar] [CrossRef]
- Danelljan, M.; Bhat, G.; Khan, F.S.; Felsberg, M. Atom: Accurate tracking by overlap maximization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4660–4669. [Google Scholar]
- Li, B.; Wu, W.; Wang, Q.; Zhang, F.; Xing, J.; Yan, J. Siamrpn++: Evolution of siamese visual tracking with very deep networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4282–4291. [Google Scholar]
- Cao, Z.; Huang, Z.; Pan, L.; Zhang, S.; Liu, Z.; Fu, C. Towards Real-World Visual Tracking with Temporal Contexts. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 15834–15849. [Google Scholar] [CrossRef]
- Bao, J.; Chen, K.; Sun, X.; Zhao, L.; Diao, W.; Yan, M. SiamTHN: Siamese Target Highlight Network for Visual Tracking. IEEE Trans. Circuits Syst. Video Technol. 2023. [Google Scholar] [CrossRef]
- Ni, X.; Yuan, L.; Lv, K. Efficient Single-Object Tracker Based on Local-Global Feature Fusion. IEEE Trans. Circuits Syst. Video Technol. 2023. [Google Scholar] [CrossRef]
- Li, Y.; Zhu, J. A scale adaptive kernel correlation filter tracker with feature integration. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2014; pp. 254–265. [Google Scholar]
- Nam, H.; Han, B. Learning multi-domain convolutional neural networks for visual tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 4293–4302. [Google Scholar]
- Danelljan, M.; Bhat, G.; Shahbaz Khan, F.; Felsberg, M. Eco: Efficient convolution operators for tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6638–6646. [Google Scholar]
- Danelljan, M.; Robinson, A.; Shahbaz Khan, F.; Felsberg, M. Beyond correlation filters: Learning continuous convolution operators for visual tracking. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2016; pp. 472–488. [Google Scholar]
- Bhat, G.; Johnander, J.; Danelljan, M.; Khan, F.S.; Felsberg, M. Unveiling the power of deep tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 483–498. [Google Scholar]
- Cheng, S.; Zhong, B.; Li, G.; Liu, X.; Tang, Z.; Li, X.; Wang, J. Learning to filter: Siamese relation network for robust tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 4421–4431. [Google Scholar]
- Lukezic, A.; Matas, J.; Kristan, M. D3s-a discriminative single shot segmentation tracker. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 7133–7142. [Google Scholar]
- Yang, T.; Xu, P.; Hu, R.; Chai, H.; Chan, A.B. ROAM: Recurrently optimizing tracking model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 6718–6727. [Google Scholar]
- Hu, Q.; Zhou, L.; Wang, X.; Mao, Y.; Zhang, J.; Ye, Q. Spstracker: Sub-peak suppression of response map for robust object tracking. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Voluem 34, pp. 10989–10996. [Google Scholar]
- Fan, N.; Liu, Q.; Li, X.; Zhou, Z.; He, Z. Siamese residual network for efficient visual tracking. Inf. Sci. 2023, 624, 606–623. [Google Scholar] [CrossRef]
- Zhang, H.; Liang, J.; Zhang, J.; Zhang, T.; Lin, Y.; Wang, Y. Attention-Driven Memory Network for Online Visual Tracking. IEEE Trans. Neural Netw. Learn. Syst. 2023. [Google Scholar] [CrossRef]
- Zhang, J.; Ma, S.; Sclaroff, S. MEEM: Robust tracking via multiple experts using entropy minimization. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part VI 13. Springer: Berlin/Heidelberg, Germany, 2014; pp. 188–203. [Google Scholar]
- Hare, S.; Golodetz, S.; Saffari, A.; Vineet, V.; Cheng, M.M.; Hicks, S.L.; Torr, P.H. Struck: Structured output tracking with kernels. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 38, 2096–2109. [Google Scholar] [CrossRef]
Iteration Number | Success Rate (%) | FPS |
---|---|---|
100 | 68.4 | 10.2 |
200 | 68.6 | 9.3 |
300 | 68.7 | 8.5 |
400 | 69.0 | 8.1 |
500 | 69.2 | 7.9 |
600 | 69.3 | 7.3 |
700 | 69.0 | 7.0 |
Tracker | Precision | Success Rate |
---|---|---|
ATOM | 0.856 | 0.643 |
SiamRPN++ | 0.840 | 0.642 |
Ours | 0.849 | 0.648 |
SiamTHN | 0.836 | 0.635 |
LGFF | 0.834 | 0.632 |
SiamBAN | 0.833 | 0.631 |
DaSiamRPN | 0.781 | 0.569 |
SiamRPN | 0.768 | 0.557 |
TCTrack++ | 0.731 | 0.519 |
ECO | 0.741 | 0.525 |
SRDCF | 0.676 | 0.464 |
SAMF | 0.592 | 0.395 |
MDNet | ECO | C-COT | UPDT | ATOM | SiamBAN | Ours | LGFF | |
---|---|---|---|---|---|---|---|---|
AUC | 0.422 | 0.466 | 0.488 | 0.537 | 0.584 | 0.594 | 0.602 | 0.610 |
Tracker | EAO | Accuracy | Robustness |
---|---|---|---|
SiamRPN | 0.344 | 0.56 | 1.08 |
C-COT | 0.331 | 0.53 | 0.85 |
MDNet | 0.257 | 0.54 | 1.2 |
SiamRN | 0.277 | 0.55 | 1.37 |
D3S | 0.493 | 0.660 | 0.13 |
SiamBAN | 0.505 | 0.632 | 0.149 |
SPS | 0.459 | 0.625 | 0.158 |
ROAM | 0.441 | 0.559 | 0.131 |
SiamRNE | 0.300 | 0.540 | 1.120 |
SiamTHN | 0.510 | 0.625 | 0.126 |
Ours | 0.515 | 0.636 | 0.140 |
Tracker | DP | OP | Params (M) | FLOPs (M) | FPS |
---|---|---|---|---|---|
SiamBase | 0.894 | 0.682 | 53.932 | 5569.01 | 11.2 |
SiamSA | 0.897 | 0.686 | 54.619 | 6232.32 | 8.5 |
SiamSAST | 0.899 | 0.688 | 54.619 | 6232.32 | 8.4 |
SiamSDP | 0.903 | 0.692 | 59.801 | 6495.10 | 7.9 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhang, H.; Wang, P.; Zhang, J.; Wang, F.; Song, X.; Zhou, H. Siamese Tracking Network with Spatial-Semantic-Aware Attention and Flexible Spatiotemporal Constraint. Symmetry 2024, 16, 61. https://doi.org/10.3390/sym16010061
Zhang H, Wang P, Zhang J, Wang F, Song X, Zhou H. Siamese Tracking Network with Spatial-Semantic-Aware Attention and Flexible Spatiotemporal Constraint. Symmetry. 2024; 16(1):61. https://doi.org/10.3390/sym16010061
Chicago/Turabian StyleZhang, Huanlong, Panyun Wang, Jie Zhang, Fengxian Wang, Xiaohui Song, and Hebin Zhou. 2024. "Siamese Tracking Network with Spatial-Semantic-Aware Attention and Flexible Spatiotemporal Constraint" Symmetry 16, no. 1: 61. https://doi.org/10.3390/sym16010061
APA StyleZhang, H., Wang, P., Zhang, J., Wang, F., Song, X., & Zhou, H. (2024). Siamese Tracking Network with Spatial-Semantic-Aware Attention and Flexible Spatiotemporal Constraint. Symmetry, 16(1), 61. https://doi.org/10.3390/sym16010061