Real-Time Visual Tracking with Variational Structure Attention Network
Abstract
:1. Introduction
- We propose a structure–attention network to minimize the structure distortion due to boundary effect and to help learn the representative structure of target during the online training of correlation filter.
- We propose a novel reconstructed loss and two denoising criteria for training the structure–attention network. This allows for capturing robust structural features of the target even in the boundary effect without losing detailed information of the target.
- Experimental evaluations on various standard benchmark datasets demonstrate that our method achieves a better or comparable performances compared with the state-of-the-art tracking in accuracy and real-time tracking speed.
2. Related Works
3. Proposed Method
3.1. Variational Auto-Encoder
3.2. Structure Attention Network
3.3. Pre-Training
3.4. Online Tracking
4. Experimental Results
4.1. Implementation Details
4.2. Evaluation Methodology
4.3. Evaluation on OTB2013
4.4. Evaluation on OTB2015
4.5. Evaluation on TempleColor-128
4.6. Ablation Study
4.7. Qualitative Evaluation
5. Conclusions
Author Contributions
Funding
Conflicts of Interest
References
- Bolme, D.S.; Beveridge, J.R.; Draper, B.A.; Lui, Y.M. Visual object tracking using adaptive correlation filters. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 13–18 June 2010; pp. 2544–2550. [Google Scholar]
- Henriques, J.F.; Caseiro, R.; Martins, P.; Batista, J. High-speed tracking with kernelized correlation filters. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 37, 583–596. [Google Scholar] [CrossRef] [PubMed]
- Danelljan, M.; Häger, G.; Khan, F.; Felsberg, M. Accurate scale estimation for robust visual tracking. In Proceedings of the British Machine Vision Conference, Nottingham, UK, 1–5 September 2014. [Google Scholar]
- Choi, J.; Jin Chang, H.; Jeong, J.; Demiris, Y.; Young Choi, J. Visual tracking using attention-modulated disintegration and integration. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 4321–4330. [Google Scholar]
- Choi, J.; Jin Chang, H.; Yun, S.; Fischer, T.; Demiris, Y.; Young Choi, J. Attentional correlation filter network for adaptive visual tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4807–4816. [Google Scholar]
- Kiani Galoogahi, H.; Sim, T.; Lucey, S. Correlation filters with limited boundaries. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 4630–4638. [Google Scholar]
- Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef]
- Danelljan, M.; Bhat, G.; Shahbaz Khan, F.; Felsberg, M. Eco: Efficient convolution operators for tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6638–6646. [Google Scholar]
- Danelljan, M.; Robinson, A.; Khan, F.S.; Felsberg, M. Beyond correlation filters: Learning continuous convolution operators for visual tracking. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 472–488. [Google Scholar]
- Ma, C.; Huang, J.B.; Yang, X.; Yang, M.H. Hierarchical convolutional features for visual tracking. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 3074–3082. [Google Scholar]
- Qi, Y.; Zhang, S.; Qin, L.; Yao, H.; Huang, Q.; Lim, J.; Yang, M.H. Hedged deep tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 4303–4311. [Google Scholar]
- Nam, H.; Han, B. Learning multi-domain convolutional neural networks for visual tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 4293–4302. [Google Scholar]
- Pu, S.; Song, Y.; Ma, C.; Zhang, H.; Yang, M.H. Deep attentive tracking via reciprocative learning. In Advances in Neural Information Processing Systems; Neural Information Processing Systems Foundation, Inc.: La Jolla, CA, USA, 2018; pp. 1931–1941. [Google Scholar]
- Song, Y.; Ma, C.; Wu, X.; Gong, L.; Bao, L.; Zuo, W.; Shen, C.; Lau, R.W.; Yang, M.H. Vital: Visual tracking via adversarial learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8990–8999. [Google Scholar]
- Bertinetto, L.; Valmadre, J.; Henriques, J.F.; Vedaldi, A.; Torr, P.H. Fully-convolutional siamese networks for object tracking. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 850–865. [Google Scholar]
- Valmadre, J.; Bertinetto, L.; Henriques, J.; Vedaldi, A.; Torr, P.H. End-to-end representation learning for correlation filter based tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2805–2813. [Google Scholar]
- Wang, Q.; Gao, J.; Xing, J.; Zhang, M.; Hu, W. Dcfnet: Discriminant correlation filters network for visual tracking. arXiv 2017, arXiv:1704.04057. [Google Scholar]
- Kingma, D.P.; Welling, M. Auto-encoding variational bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
- Henriques, J.F.; Caseiro, R.; Martins, P.; Batista, J. Exploiting the circulant structure of tracking-by-detection with kernels. In Proceedings of the European Conference on Computer Vision, Florence, Italy, 7–13 October 2012; Springer: Berlin/Heidelberg, Germany, 2012; pp. 702–715. [Google Scholar]
- Danelljan, M.; Shahbaz Khan, F.; Felsberg, M.; Van de Weijer, J. Adaptive color attributes for real-time visual tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 1090–1097. [Google Scholar]
- Danelljan, M.; Hager, G.; Shahbaz Khan, F.; Felsberg, M. Learning spatially regularized correlation filters for visual tracking. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 4310–4318. [Google Scholar]
- Chen, Z.; You, X.; Zhong, B.; Li, J.; Tao, D. Dynamically modulated mask sparse tracking. IEEE Trans. Cybern. 2016, 47, 3706–3718. [Google Scholar] [CrossRef] [PubMed]
- Danelljan, M.; Hager, G.; Shahbaz Khan, F.; Felsberg, M. Convolutional features for correlation filter based visual tracking. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Santiago, Chile, 7–13 December 2015; pp. 58–66. [Google Scholar]
- Hong, S.; You, T.; Kwak, S.; Han, B. Online tracking by learning discriminative saliency map with convolutional neural network. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 597–606. [Google Scholar]
- Zhang, T.; Xu, C.; Yang, M.H. Multi-task correlation particle filter for robust object tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4335–4343. [Google Scholar]
- Im, D.I.J.; Ahn, S.; Memisevic, R.; Bengio, Y. Denoising criterion for variational auto-encoding framework. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017. [Google Scholar]
- Zhu, Z.; Huang, G.; Zou, W.; Du, D.; Huang, C. Uct: Learning unified convolutional networks for real-time visual tracking. In Proceedings of the IEEE International Conference on Computer Vision, Honolulu, HI, USA, 21–26 July 2017; pp. 1973–1982. [Google Scholar]
- Wu, Y.; Lim, J.; Yang, M.H. Online object tracking: A benchmark. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 2411–2418. [Google Scholar]
- Wu, Y.; Lim, J.; Yang, M.H. Object tracking benchmark. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1834–1848. [Google Scholar] [CrossRef] [PubMed]
- Liang, P.; Blasch, E.; Ling, H. Encoding color information for visual tracking: Algorithms and benchmark. IEEE Trans. Image Process. 2015, 24, 5630–5644. [Google Scholar] [CrossRef] [PubMed]
- Kim, Y.; Shin, J.; Park, H.; Paik, J. Real-Time Visual Tracking with Variational Structure Attention Network Results Description. Available online: https://github.com/0binkim92/Real-Time-Visual-Tracking (accessed on 8 November 2019).
- Griffin, G.; Holub, A.; Perona, P. Caltech-256 Object Category Dataset; California Institute of Technology: Pasadena, CA, USA, 2007. [Google Scholar]
- Danelljan, M.; Hager, G.; Shahbaz Khan, F.; Felsberg, M. Adaptive decontamination of the training set: A unified formulation for discriminative visual tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 1430–1438. [Google Scholar]
- Zhang, J.; Ma, S.; Sclaroff, S. MEEM: Robust tracking via multiple experts using entropy minimization. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; Springer: Berlin/Heidelberg, Germany, 2014; pp. 188–203. [Google Scholar]
- Hare, S.; Golodetz, S.; Saffari, A.; Vineet, V.; Cheng, M.M.; Hicks, S.L.; Torr, P.H. Struck: Structured output tracking with kernels. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 38, 2096–2109. [Google Scholar] [CrossRef] [PubMed]
- Yun, S.; Choi, J.; Yoo, Y.; Yun, K.; Young Choi, J. Action-decision networks for visual tracking with deep reinforcement learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2711–2720. [Google Scholar]
- Choi, J.; Jin Chang, H.; Fischer, T.; Yun, S.; Lee, K.; Jeong, J.; Demiris, Y.; Young Choi, J. Context-aware deep feature compression for high-speed visual tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 479–488. [Google Scholar]
Layer | Input Channels | Output Channel | Filter Size | Stride | Padding |
---|---|---|---|---|---|
Conv1 | 32 | 64 | 4 × 4 | 2 | 1 |
BatchNorm-ReLU | 64 | 64 | - | - | - |
Conv2 | 64 | 128 | 4 × 4 | 2 | 1 |
BatchNorm-ReLU | 128 | 128 | - | - | |
Conv3 | 128 | 64 | 4 × 4 | 2 | 1 |
BatchNorm-ReLU | 64 | 64 | - | - | |
Fc1 | 13 × 13 × 64 | 512 | - | - | - |
Fc2 | 13 × 13 × 64 | 512 | - | - | - |
Fc3 | 512 | 13 × 13 × 64 | - | - | - |
T-Conv1 | 64 | 128 | 4 × 4 | 2 | 1 |
BatchNorm-ReLU | 128 | 128 | - | - | - |
T-Conv2 | 128 | 64 | 4 × 4 | 2 | 1 |
BatchNorm-ReLU | 64 | 64 | - | - | - |
T-Conv3 | 64 | 32 | 4 × 4 | 2 | 1 |
BatchNorm-ReLU | 32 | 32 | - | - | - |
Conv4 | 32 | 3 | 1 × 1 | 1 | - |
Tanh | 3 | 3 | - | - | - |
Ours | CFNet | SRDCF | SiamFC | SiamDCF | ADNet | DSST | TRACA | CNN-SVM | ACFN | |
---|---|---|---|---|---|---|---|---|---|---|
IV | 0.791 | 0.680 | 0.765 | 0.713 | 0.711 | 0.834 | 0.703 | 0.813 | 0.761 | 0.761 |
SV | 0.797 | 0.719 | 0.737 | 0.729 | 0.744 | 0.766 | 0.640 | 0.758 | 0.761 | 0.751 |
OCC | 0.801 | 0.681 | 0.710 | 0.701 | 0.762 | 0.693 | 0.585 | 0.750 | 0.704 | 0.724 |
DEF | 0.740 | 0.672 | 0.707 | 0.670 | 0.714 | 0.796 | 0.531 | 0.744 | 0.766 | 0.746 |
MB | 0.752 | 0.602 | 0.731 | 0.677 | 0.721 | 0.719 | 0.543 | 0.720 | 0.720 | 0.686 |
FM | 0.802 | 0.682 | 0.730 | 0.713 | 0.729 | 0.692 | 0.556 | 0.710 | 0.712 | 0.715 |
IPR | 0.833 | 0.749 | 0.721 | 0.724 | 0.757 | 0.767 | 0.680 | 0.788 | 0.791 | 0.756 |
OPR | 0.804 | 0.723 | 0.719 | 0.739 | 0.744 | 0.794 | 0.639 | 0.802 | 0.776 | 0.754 |
OV | 0.742 | 0.528 | 0.586 | 0.661 | 0.732 | 0.597 | 0.467 | 0.685 | 0.626 | 0.656 |
BC | 0.788 | 0.728 | 0.772 | 0.685 | 0.733 | 0.794 | 0.697 | 0.795 | 0.766 | 0.755 |
LR | 0.897 | 0.805 | 0.757 | 0.899 | 0.798 | 0.913 | 0.673 | 0.850 | 0.914 | 0.810 |
Ours | CFNet | SRDCF | SiamFC | SiamDCF | ADNet | DSST | TRACA | CNN-SVM | ACFN | |
---|---|---|---|---|---|---|---|---|---|---|
IV | 0.602 | 0.541 | 0.599 | 0.560 | 0.562 | 0.612 | 0.550 | 0.608 | 0.529 | 0.558 |
SV | 0.597 | 0.550 | 0.561 | 0.552 | 0.568 | 0.563 | 0.475 | 0.554 | 0.490 | 0.547 |
OCC | 0.603 | 0.533 | 0.549 | 0.536 | 0.585 | 0.518 | 0.454 | 0.561 | 0.507 | 0.531 |
DEF | 0.544 | 0.500 | 0.533 | 0.498 | 0.537 | 0.555 | 0.420 | 0.550 | 0.538 | 0.527 |
MB | 0.621 | 0.503 | 0.577 | 0.539 | 0.596 | 0.565 | 0.460 | 0.573 | 0.565 | 0.550 |
FM | 0.634 | 0.546 | 0.581 | 0.556 | 0.593 | 0.550 | 0.460 | 0.561 | 0.534 | 0.551 |
IPR | 0.611 | 0.564 | 0.534 | 0.550 | 0.568 | 0.559 | 0.500 | 0.571 | 0.540 | 0.536 |
OPR | 0.605 | 0.541 | 0.542 | 0.552 | 0.571 | 0.571 | 0.472 | 0.586 | 0.542 | 0.538 |
OV | 0.576 | 0.423 | 0.460 | 0.507 | 0.565 | 0.479 | 0.385 | 0.547 | 0.488 | 0.493 |
BC | 0.594 | 0.565 | 0.583 | 0.520 | 0.563 | 0.588 | 0.535 | 0.591 | 0.551 | 0.539 |
LR | 0.598 | 0.588 | 0.513 | 0.621 | 0.523 | 0.573 | 0.381 | 0.501 | 0.378 | 0.514 |
© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Kim, Y.; Shin, J.; Park, H.; Paik, J. Real-Time Visual Tracking with Variational Structure Attention Network. Sensors 2019, 19, 4904. https://doi.org/10.3390/s19224904
Kim Y, Shin J, Park H, Paik J. Real-Time Visual Tracking with Variational Structure Attention Network. Sensors. 2019; 19(22):4904. https://doi.org/10.3390/s19224904
Chicago/Turabian StyleKim, Yeongbin, Joongchol Shin, Hasil Park, and Joonki Paik. 2019. "Real-Time Visual Tracking with Variational Structure Attention Network" Sensors 19, no. 22: 4904. https://doi.org/10.3390/s19224904
APA StyleKim, Y., Shin, J., Park, H., & Paik, J. (2019). Real-Time Visual Tracking with Variational Structure Attention Network. Sensors, 19(22), 4904. https://doi.org/10.3390/s19224904