An Enhanced Siamese Network-Based Visual Tracking Algorithm with a Dual Attention Mechanism
Abstract
1. Introduction
- We improved the AlexNetV2 backbone network by reducing the output stride to increase feature map resolution while employing depthwise separable convolutions to significantly reduce the total number of parameters.
- We added a channel and spatial attention mechanism to enhance the network’s focus on key semantic features and suppress background noise, thereby improving target perception and localization accuracy in complex scenes.
- We incorporated IoU loss into the overall loss function to more effectively optimize both the position and size of the predicted bounding boxes, thereby enhancing localization accuracy.
2. Algorithm Framework
2.1. Lightweight Backbone Network Design for Feature Extraction
2.2. Attention Module
2.3. Loss Function
3. The Tracking Process
4. Experimental Results and Analysis
4.1. Evaluation Metrics
4.2. Ablation Experiments
4.3. Experiments on OTB2015
4.4. Experiments on VOT2018
4.5. Experiments on VOT2016
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Keawboontan, T.; Thammawichai, M. Toward real-time uav multi-target tracking using joint detection and tracking. IEEE Access 2023, 11, 65238–65254. [Google Scholar] [CrossRef]
- Liu, D.; Zhu, X.; Bao, W.; Fei, B.; Wu, J. SMART: Vision-based method of cooperative surveillance and tracking by multiple UAVs in the urban environment. IEEE Trans. Intell. Transp. Syst. 2022, 23, 24941–24956. [Google Scholar] [CrossRef]
- Jiao, L.; Wang, D.; Bai, Y.; Chen, P.; Liu, F. Deep learning in visual tracking: A review. IEEE Trans. Neural Netw. Learn. Syst. 2021, 34, 5497–5516. [Google Scholar] [CrossRef] [PubMed]
- Chen, X.; He, K. Exploring simple siamese representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
- Bertinetto, L.; Valmadre, J.; Henriques, J.F.; Vedaldi, A.; Torr, P.H. Fully-convolutional siamese networks for object tracking. In Proceedings of the Computer Vision—ECCV 2016 Workshops, Amsterdam, The Netherlands, 8–10 and 15–16 October 2016. [Google Scholar]
- Zhang, Z.; Peng, H. Deeper and wider siamese networks for real-time visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4591–4600. [Google Scholar]
- Zhu, Z.; Wang, Q.; Li, B.; Wu, W.; Yan, J.; Hu, W. Distractor-aware siamese networks for visual object tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 101–117. [Google Scholar]
- Dong, X.; Shen, J. Triplet loss in siamese network for object tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 459–474. [Google Scholar]
- Voigtlaender, P.; Luiten, J.; Torr, P.H.; Leibe, B. Siam r-cnn: Visual tracking by re-detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 6578–6588. [Google Scholar]
- He, A.; Luo, C.; Tian, X.; Zeng, W. A twofold siamese network for real-time object tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4834–4843. [Google Scholar]
- Bolme, D.S.; Beveridge, J.R.; Draper, B.A.; Lui, Y.M. Visual object tracking using adaptive correlation filters. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 13–18 June 2010. [Google Scholar]
- Henriques, J.F.; Caseiro, R.; Martins, P.; Batista, J. High-speed tracking with kernelized correlation filters. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 37, 583–596. [Google Scholar] [CrossRef] [PubMed]
- Henriques, J.F.; Caseiro, R.; Martins, P.; Batista, J. Exploiting the circulant structure of tracking-by-detection with kernels. In Proceedings of the Computer Vision–ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, 7–13 October 2012. [Google Scholar]
- Danelljan, M.; Bhat, G.; Shahbaz Khan, F.; Felsberg, M. Eco: Efficient convolution operators for tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
- Li, B.; Yan, J.; Wu, W.; Zhu, Z.; Hu, X. High performance visual tracking with siamese region proposal network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Fan, Q.; Zhuo, W.; Tang, C.-K.; Tai, Y.-W. Few-shot object detection with attention-RPN and multi-relation detector. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
- Xu, Y.; Wang, Z.; Li, Z.; Yuan, Y.; Yu, G. SiamFC++: Towards robust and accurate visual tracking with target estimation guidelines. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020. [Google Scholar]
- Shan, Y.; Hu, W. Review of visual object tracking algorithms of adaptive direction and scale. Comput. Eng. Appl. 2020, 56, 13–23. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. Available online: https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf (accessed on 17 May 2025).
- Zaremba, W.; Sutskever, I.; Vinyals, O. Recurrent neural network regularization. arXiv 2014, arXiv:1409.2329. [Google Scholar]
- Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. Commun. ACM 2020, 63, 139–144. [Google Scholar] [CrossRef]
- Li, X.; Hu, X.; Yang, J. Spatial group-wise enhance: Improving semantic feature learning in convolutional networks. arXiv 2019, arXiv:1905.09646. [Google Scholar]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
- Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
- Borji, A.; Itti, L. State-of-the-art in visual attention modeling. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 35, 185–207. [Google Scholar] [CrossRef] [PubMed]
- Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU loss: Faster and better learning for bounding box regression. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12993–13000. [Google Scholar]
- Lin, L.; Fan, H.; Zhang, Z.; Xu, Y.; Ling, H. Swintrack: A simple and strong baseline for transformer tracking. Adv. Neural Inf. Process. Syst. 2022, 35, 16743–16754. [Google Scholar]
- Ismail Fawaz, H.; Lucas, B.; Forestier, G.; Pelletier, C.; Schmidt, D.F.; Weber, J.; Webb, G.I.; Idoumghar, L.; Muller, P.-A. Inceptiontime: Finding alexnet for time series classification. Data Min. Knowl. Discov. 2020, 34, 1936–1962. [Google Scholar] [CrossRef]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
- Khan, Z.Y.; Niu, Z. CNN with depthwise separable convolutions and combined kernels for rating prediction. Expert Syst. Appl. 2021, 170, 114528. [Google Scholar] [CrossRef]
- Mnih, V.; Heess, N.; Graves, A.; Kavukcuoglu, K. Recurrent models of visual attention. Adv. Neural Inf. Process. Syst. 2017, 27. Available online: https://proceedings.neurips.cc/paper_files/paper/2014/file/3e456b31302cf8210edd4029292a40ad-Paper.pdf (accessed on 17 May 2025).
- Jaderberg, M.; Simonyan, K.; Zisserman, A. Spatial transformer networks. Adv. Neural Inf. Process. Syst. 2015, 28. Available online: https://proceedings.neurips.cc/paper_files/paper/2015/file/33ceb07bf4eeb3da587e268d663aba1a-Paper.pdf (accessed on 17 May 2025).
- Ren, J.; Zhang, M.; Yu, C.; Liu, Z. Balanced mse for imbalanced visual regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 7926–7935. [Google Scholar]
- Li, W.; Shang, R.; Ju, Z.; Feng, J.; Xu, S.; Zhang, W. Ellipse IoU Loss: Better Learning for Rotated Bounding Box Regression. IEEE Geosci. Remote Sens. Lett. 2023, 21, 1–5. [Google Scholar] [CrossRef]
- Wu, Y.; Lim, J.; Yang, M.-H. Online object tracking: A benchmark. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 2411–2418. [Google Scholar]
- Kristan, M.; Leonardis, A.; Matas, J.; Felsberg, M.; Pflugfelder, R. The sixth visual object tracking vot2018 challenge results. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, Munich, Germany, 8–14 September 2018. [Google Scholar]
- Roffo, G.; Melzi, S. The visual object tracking VOT2016 challenge results. In Computer Vision–ECCV 2016 Workshops: Amsterdam, The Netherlands, 8–10 and 15–16 October 2016, Proceedings, Part II; Springer International Publishing: Berlin/Heidelberg, Germany, 2016; pp. 777–823. [Google Scholar]
- Huang, L.; Zhao, X.; Huang, K. Got-10k: A large high-diversity benchmark for generic object tracking in the wild. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 1562–1577. [Google Scholar] [CrossRef] [PubMed]
- Ross, T.Y.; Dollár, G. Focal loss for dense object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2980–2988. [Google Scholar]
Stage | Layer | Kernel Size | Template Size | Search Size | Channel |
---|---|---|---|---|---|
Input | – | – | – | ||
C1 | Conv1_1-BN-ReLU | 96 | |||
MaxPool | – | – | |||
C2 | Conv2_1 | 96 | |||
Conv2_2- BN-ReLU | 256 | ||||
MaxPool | – | – | |||
MixAttention | – | – | – | 256 | |
C3 | Conv3_1-BN-ReLU | 384 | |||
MixAttention | – | – | – | 384 | |
C4 | Conv4_1-BN-ReLU | 384 | |||
MixAttention | – | – | – | 384 | |
C5 | Conv5_1-BN-ReLU | 256 | |||
Conv5_2 | 256 |
Group Number | Backbone | Loss Function | New Component | Precision | Success |
---|---|---|---|---|---|
1 | AlexNetV2 | Balanced Loss | Baseline | 0.609 | 0.462 |
2 | AlexNetV2 | Balanced Loss | MixAttention | 0.580 | 0.453 |
3 | AlexNetV2 | Balanced Loss | IoU Loss | 0.611 | 0.460 |
4 | AlexNetV2 | Balanced Loss | MixAttention + IoU Loss | 0.631 | 0.468 |
Group | Model | Backbone | Loss Function | Params (M) | Flops (M) | Inference Latency (ms) | FPS (frame/s) | Precision | Success |
---|---|---|---|---|---|---|---|---|---|
A | SiamFC | AlexNetv1 | Balanced Loss | 2.33 | 3179 | 2.28 | 439.30 | 0.589 | 0.360 |
B | SiamFC | AlexNetv2 | Balanced Loss | 1.95 | 6828 | 2.30 | 435.48 | 0.609 | 0.462 |
C | SiamFC | AlexNetv3 | Balanced Loss | 14.92 | 17864 | 2.74 | 428.30 | 0.610 | 0.425 |
D | SiamFC | AlexNetv3 | Focal Loss | 14.92 | 18801 | 3.48 | 287.30 | 0.461 | 0.347 |
E | Ours | MixAttention | BalancedLoss +IoU Loss | 3.21 | 10848 | 3.15 | 346.42 | 0.631 | 0.468 |
F | KCF | Gaussian Kernel | – | 0.05 | 200 | 1.16 | 480.78 | 0.573 | 0.368 |
Algorithm | Accuracy | Robustness | EAO |
---|---|---|---|
SiamFC-AlexNetv2-BlancedLoss | 0.418 | 0.304 | 0.531 |
SiamFC-AlexNetv1-BlancedLoss | 0.420 | 0.327 | 0.524 |
SiamFC-AlexNetv3-BlancedLoss | 0.411 | 0.343 | 0.512 |
SiamFC-AlexNetv3-FocalLoss | 0.404 | 0.384 | 0.453 |
Ours | 0.416 | 0.287 | 0.532 |
KCF | 0.401 | 0.425 | 0.403 |
Algorithm | Accuracy | Robustness | EAO | FPS (frame/s) |
---|---|---|---|---|
SiamFC-AlexNetv2-BlancedLoss | 0.417 | 0.318 | 0.531 | 36.5 |
SiamFC-AlexNetv1-BlancedLoss | 0.430 | 0.288 | 0.574 | 53.9 |
SiamFC-AlexNetv3-BlancedLoss | 0.402 | 0.354 | 0.518 | 41.2 |
SiamFC-AlexNetv3-FocalLoss | 0.404 | 0.377 | 0.499 | 36.7 |
Ours | 0.426 | 0.258 | 0.585 | 48.8 |
KCF | 0.391 | 0.421 | 0.404 | 35.7 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Cai, X.; Feng, S.; Masood, V.; Ying, S.; Zhou, B.; Jia, W.; Yang, J.; Wei, C.; Feng, Y. An Enhanced Siamese Network-Based Visual Tracking Algorithm with a Dual Attention Mechanism. Electronics 2025, 14, 2579. https://doi.org/10.3390/electronics14132579
Cai X, Feng S, Masood V, Ying S, Zhou B, Jia W, Yang J, Wei C, Feng Y. An Enhanced Siamese Network-Based Visual Tracking Algorithm with a Dual Attention Mechanism. Electronics. 2025; 14(13):2579. https://doi.org/10.3390/electronics14132579
Chicago/Turabian StyleCai, Xueying, Sheng Feng, Varshosaz Masood, Senang Ying, Binchao Zhou, Wentao Jia, Jianing Yang, Canlin Wei, and Yucheng Feng. 2025. "An Enhanced Siamese Network-Based Visual Tracking Algorithm with a Dual Attention Mechanism" Electronics 14, no. 13: 2579. https://doi.org/10.3390/electronics14132579
APA StyleCai, X., Feng, S., Masood, V., Ying, S., Zhou, B., Jia, W., Yang, J., Wei, C., & Feng, Y. (2025). An Enhanced Siamese Network-Based Visual Tracking Algorithm with a Dual Attention Mechanism. Electronics, 14(13), 2579. https://doi.org/10.3390/electronics14132579