TSNet: Token Sparsification for Efficient Video Transformer
Abstract
:1. Introduction
2. Related Work
3. Methods
3.1. Overview
3.2. Token Sparsification
3.3. Objective Function
4. Experiments
4.1. Dataset and Backbone
4.2. Implementation Details
4.3. Main Results
4.4. Ablation Studies
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 4489–4497. [Google Scholar]
- Qiu, Z.; Yao, T.; Mei, T. Learning spatio-temporal representation with pseudo-3d residual networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5533–5541. [Google Scholar]
- Tran, D.; Wang, H.; Torresani, L.; Ray, J.; LeCun, Y.; Paluri, M. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6450–6459. [Google Scholar]
- Lin, J.; Gan, C.; Han, S. Tsm: Temporal shift module for efficient video understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 7083–7093. [Google Scholar]
- Li, Y.; Ji, B.; Shi, X.; Zhang, J.; Kang, B.; Wang, L. Tea: Temporal excitation and aggregation for action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 909–918. [Google Scholar]
- Zolfaghari, M.; Singh, K.; Brox, T. Eco: Efficient convolutional network for online video understanding. In Proceedings of the European Conference on Computer Vision(ECCV), Munich, Germany, 8–14 September 2018; pp. 695–712. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Zhou, D.; Kang, B.; Jin, X.; Yang, L.; Lian, X.; Jiang, Z.; Hou, Q.; Feng, J. Deepvit: Towards deeper vision transformer. arXiv 2021, arXiv:2103.11886. [Google Scholar]
- Liu, Z.; Luo, S.; Li, W.; Lu, J.; Wu, Y.; Sun, S.; Li, C.; Yang, L. Convtransformer: A convolutional transformer network for video frame synthesis. arXiv 2020, arXiv:2011.10185. [Google Scholar]
- Daniel Neimark, D.; Bar, O.; Zohar, M.; Asselmann, D. Video transformer network. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 3163–3172. [Google Scholar]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
- Liu, Z.; Ning, J.; Cao, Y.; Wei, Y.; Zhang, Z.; Lin, S.; Hu, H. Video swin transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 3202–3211. [Google Scholar]
- Fan, H.; Xiong, B.; Mangalam, K.; Li, Y.; Yan, Z.; Malik, J.; Feichtenhofer, C. Multiscale vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 6824–6835. [Google Scholar]
- Arnab, A.; Dehghani, M.; Heigold, G.; Sun, C.; Lučić, M.; Schmid, C. Vivit: A video vision transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 6836–6846. [Google Scholar]
- Bertasius, G.; Wang, H.; Torresani, L. Is space-time attention all you need for video understanding? In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; Volume 2, p. 4. [Google Scholar]
- Li, K.; Wang, Y.; Zhang, J.; Gao, P.; Song, G.; Liu, Y.; Li, H.; Qiao, Y. UniFormer: Unifying Convolution and Self-Attention for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 12581–12600. [Google Scholar] [CrossRef]
- Li, K.; Wang, Y.; Peng, G.; Song, G.; Liu, Y.; Li, H.; Qiao, Y. UniFormer: Unified Transformer for Efficient Spatial-Temporal Representation Learning. arXiv 2022, arXiv:2201.04676. [Google Scholar]
- Pan, B.; Panda, R.; Jiang, Y.; Wang, Z.; Feris, R.; Oliva, A. IA-RED2: Interpretability-Aware Redundancy Reduction for Vision Transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 24898–24911. [Google Scholar]
- Tang, Y.; Han, K.; Wang, Y.; Xu, C.; Guo, J.; Xu, C.; Tao, D. Patch slimming for efficient vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 12165–12174. [Google Scholar]
- Meng, L.; Li, H.; Chen, B.-C.; Lan, S.; Wu, Z.; Jiang, Y.-G.; Lim, S.-N. Adavit: Adaptive vision transformers for efficient image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 12309–12318. [Google Scholar]
- Rao, Y.; Zhao, W.; Liu, B.; Lu, J.; Zhou, J.; Hsieh, C.-J. Dynamicvit: Efficient vision transformers with dynamic token sparsification. Adv. Neural Inf. Process. Syst. 2021, 34, 13937–13949. [Google Scholar]
- Rao, Y.; Liu, Z.; Zhao, W.; Zhou, J.; Lu, J. Dynamic spatial sparsification for efficient vision transformers and convolutional neural networks. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 10883–10897. [Google Scholar] [CrossRef]
- Xu, Y.; Zhang, Z.; Zhang, M.; Sheng, K.; Li, K.; Dong, W.; Zhang, L.; Xu, C.; Sun, X. Evo-vit: Slow-fast token evolution for dynamic vision transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 22 February–1 March 2022; pp. 2964–2972. [Google Scholar]
- Yin, H.; Vahdat, A.; Alvarez, J.M.; Mallya, A.; Kautz, J.; Molchanov, P. A-vit: Adaptive tokens for efficient vision transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10809–10818. [Google Scholar]
- Jang, E.; Gu, S.; Poole, B. Categorical reparameterization with gumbel-softmax. arXiv 2016, arXiv:1611.01144. [Google Scholar]
- Carreira, J.; Zisserman, A. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6299–6308. [Google Scholar]
- Wu, C.; Zaheer, M.; Hu, H.; Manmatha, R.; Smola, A.J.; Krähenbühl, P. Compressed video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6026–6035. [Google Scholar]
- Feichtenhofer, C. X3d: Expanding architectures for efficient video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 203–213. [Google Scholar]
- Kondratyuk, D.; Yuan, L.; Li, Y.; Zhang, L.; Tan, M.; Brown, M.; Gong, B. Movinets: Mobile video networks for efficient video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Montreal, QC, Canada, 11–17 October 2021; pp. 16020–16030. [Google Scholar]
- Korbar, B.; Tran, D.; Torresani, L. Scsampler: Sampling salient clips from video for efficient action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6232–6242. [Google Scholar]
- Bhardwaj, S.; Srinivasan, M.; Khapra, M.M. Efficient video classification using fewer frames. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 354–363. [Google Scholar]
- Wu, Z.; Xiong, C.; Ma, C.-Y.; Socher, R.; Davis, L.S. Adaframe: Adaptive frame selection for fast video recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1278–1287. [Google Scholar]
- Wang, J.; Yang, X.; Li, H.; Liu, L.; Wu, Z.; Jiang, Y.-G. Efficient video transformers with spatial-temporal token selection. In Proceedings of the European Conference on Computer Vision, Tel-Aviv, Israel, 23–27 October 2022; pp. 69–86. [Google Scholar]
- Park, S.H.; Tack, J.; Heo, B.; Ha, J.-W.; Shin, J. K-centered patch sampling for efficient video recognition. In Proceedings of the European Conference on Computer Vision, Tel-Aviv, Israel, 23–27 October 2022; pp. 160–176. [Google Scholar]
- Ryoo, M.; Piergiovanni, A.; Arnab, A.; Dehghani, M.; Angelova, A. Tokenlearner: Adaptive space-time tokenization for videos. Adv. Neural Inf. Process. Syst. 2021, 34, 12786–12797. [Google Scholar]
- Loshchilov, I.; Hutter, F. Fixing weight decay regularization in adam. arXiv 2017, arXiv:1711.05101. [Google Scholar]
- Loshchilov, I.; Hutter, F. Sgdr: Stochastic gradient descent with warm restarts. arXiv 2016, arXiv:1608.03983. [Google Scholar]
- Goyal, P.; Dollár, P.; Girshick, R.; Noordhuis, P.; Wesolowski, L.; Kyrola, A.; Tulloch, A.; Jia, Y.; He, K. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv 2017, arXiv:1706.02677. [Google Scholar]
Method | GFLOPs | Top-1 (%) | Top-5 (%) |
---|---|---|---|
Baseline | 167 | 80.8 | 94.7 |
Random/0.8 | 144.4 (−13.4%) | 79.3 | 93.7 |
Random/0.5 | 99.2 (−40%) | 29.7 | 44.8 |
Random/0.3 | 74 (−55.5%) | 1.1 | 3.5 |
TSNet/0.8 | 144.4 (−13.4%) | 80.5 | 94.3 |
TSNet/0.5 | 99.2 (−40%) | 79.5 | 93.8 |
TSNet/0.3 | 74 (−55.5%) | 79.1 | 93.7 |
Method | #Frames | GFLOPs | Top-1 (%) | Top-5 (%) |
---|---|---|---|---|
X3D-M [28] | 16 × 3 × 10 | 186 | 76.0 | 92.3 |
X3D-XL [28] | 16 × 3 × 10 | 1452 | 79.1 | 93.9 |
VTN [10] | 250 × 1 × 1 | 3992 | 78.6 | 93.7 |
TimeSformer [15] | 96 × 3 × 1 | 5110 | 79.7 | 94.7 |
MViT-B16 [13] | 16 × 1 × 5 | 353 | 78.4 | 93.5 |
ViViT-L [14] | 16 × 3 × 4 | 17352 | 80.6 | 94.7 |
VideoSwin-T [12] | 32 × 3 × 4 | 1056 | 78.8 | 93.6 |
TSNet(VideoSwin-T)/0.8 | 32 × 3 × 4 | 871.9 (−17.4%) | 78.3 | 92.9 |
TSNet(VideoSwin-T)/0.5 | 32 × 3 × 4 | 628.1 (−40.5%) | 77.0 | 91.4 |
TSNet(VideoSwin-T)/0.3 | 32 × 3 × 4 | 484.8 (−54%) | 76.2 | 90.7 |
UniFormer-S [17] | 16 × 1 × 4 | 167 | 80.8 | 94.7 |
TSNet(UniFormer-S)/0.8 | 16 × 1 × 4 | 144.4 (−13.4%) | 80.5 | 94.3 |
TSNet(UniFormer-S)/0.5 | 16 × 1 × 4 | 99.2 (−40%) | 79.5 | 93.8 |
TSNet(UniFormer-S)/0.3 | 16 × 1 × 4 | 74 (−55.5%) | 79.1 | 93.7 |
Method | GFLOPs | Top-1 (%) | Top-5 (%) |
---|---|---|---|
UniFormer-S | 167 | 80.8 | 94.7 |
MSA_only/0.5 | 121.4 (−27.3%) | 79.8 | 93.3 |
FFN_only/0.5 | 140.5 (−15.8%) | 80.2 | 94.1 |
MSA/FFN/0.5 | 98.4 (−41.0%) | 79.3 | 93.6 |
TSNet(UniFormer-S)/0.5 (Ours) | 99.2 (−40%) | 79.5 | 93.8 |
Depth | Top-1 (%) | Top-5 (%) |
---|---|---|
1 | 79.4 | 93.3 |
5 | 79.5 | 93.5 |
9 | 80.5 | 94.5 |
11 | 80.7 | 94.6 |
Method | GFLOPs | Top-1 (%) | Top-5 (%) |
---|---|---|---|
134.8 (−19.3%) | 79.0 | 92.7 | |
134.2 (−19.6%) | 79.5 | 93.1 | |
132.4 (−20.7%) | 79.3 | 93.5 | |
134.1 (−19.7%) | 79.6 | 93.5 | |
134.9 (−19.2%) | 80.0 | 93.9 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wang, H.; Zhang, W.; Liu, G. TSNet: Token Sparsification for Efficient Video Transformer. Appl. Sci. 2023, 13, 10633. https://doi.org/10.3390/app131910633
Wang H, Zhang W, Liu G. TSNet: Token Sparsification for Efficient Video Transformer. Applied Sciences. 2023; 13(19):10633. https://doi.org/10.3390/app131910633
Chicago/Turabian StyleWang, Hao, Wenjia Zhang, and Guohua Liu. 2023. "TSNet: Token Sparsification for Efficient Video Transformer" Applied Sciences 13, no. 19: 10633. https://doi.org/10.3390/app131910633
APA StyleWang, H., Zhang, W., & Liu, G. (2023). TSNet: Token Sparsification for Efficient Video Transformer. Applied Sciences, 13(19), 10633. https://doi.org/10.3390/app131910633