Spatial–Temporal Analysis-Based Video Quality Assessment: A Two-Stream Convolutional Network Approach
Abstract
:1. Introduction
- (1)
- Different to traditional video quality assessment methods that evaluate video as a whole, considering that video contains information in different dimensions, we separately take spatial and temporal residual and distorted maps as inputs. Hence, we construct a new two-stream framework for VQA, which makes the model more consistent with the visual perception for videos.
- (2)
- Different to traditional image content-based feature extraction, considering that the human visual system (HVS) has a different perceptual complexity and mechanism for temporal and spatial information, we designed two convolutional feature extraction branches for spatial and temporal information, respectively, so that the model can extract temporal and spatial distortion features more accurately.
- (3)
- Different to the feature extraction strategy guided by an attention module, considering the impact of distortion and other visual factors in the quality assessment, we guided the feature extraction by distorted maps for corresponding residual maps. Meanwhile, we also designed a new spatial–temporal feature fusion model, so that different dimensions of features can jointly represent the distortion degree of the video.
2. Related Works
2.1. Traditional Quality Assessment Methods
2.2. Deep Learning Methods
3. Proposed TSCNN-VQA Method
3.1. Preprocessing for Distorted Video
3.2. Spatial and Temporal Feature Extraction
3.2.1. Spatial Feature Extraction
3.2.2. Temporal Feature Extraction
3.3. Spatial–Temporal Feature Fusion
3.4. Attention Module
3.5. Quality Score Prediction
4. Experimental Results
4.1. Datasets and Training Protocols
4.2. Ablation Experiments for TSCNN-VQA
4.2.1. The Effect of Frame Sample Number on Performance and Complexity
4.2.2. The Effect of Different Structures of the Feature Extraction on Performance
4.2.3. The Effect of Attention Module Location on Performance
4.2.4. Effect of Spatial Residual Map on Performance
4.3. Comparison with the State-of-the-Art Quality Assessment Methods
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Li, J.; Li, X. Study on no-reference video quality assessment method incorporating dual deep learning networks. Multimed. Tools Appl. 2023, 82, 3081–3100. [Google Scholar] [CrossRef]
- Ding, K.; Ma, K.; Wang, S.; Simoncelli, E.P. Image quality assessment: Unifying structure and texture similarity. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 2567–2581. [Google Scholar] [CrossRef] [PubMed]
- Fang, Y.; Yan, J.; Du, R.; Zuo, Y.; Wen, W.; Zeng, Y.; Li, L. Blind quality assessment for tone-mapped images by analysis of gradient and chromatic statistics. IEEE Trans. Multimed. 2020, 23, 955–966. [Google Scholar] [CrossRef]
- Ahn, S.; Choi, Y.; Yoon, K. Deep learning-based distortion sensitivity prediction for full-reference image quality assessment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 344–353. [Google Scholar]
- Xu, M.; Chen, J.; Wang, H.; Liu, S.; Li, G.; Bai, Z. C3DVQA: Full-reference video quality assessment with 3D convolutional neural network. In Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 4447–4451. [Google Scholar]
- Simonyan, K.; Zisserman, A. Two-stream convolutional networks for action recognition in videos. Adv. Neural Inf. Process. Syst. 2014, 27, 568–576. [Google Scholar]
- Qian, J.; Wu, D.; Li, L.; Cheng, D.; Wang, X. Image quality assessment based on multi-scale representation of structure. Digit. Signal Process. 2014, 33, 125–133. [Google Scholar] [CrossRef]
- Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed]
- Mittal, A.; Moorthy, A.K.; Bovik, A.C. No-reference image quality assessment in the spatial domain. IEEE Trans. Image Process. 2012, 21, 4695–4708. [Google Scholar] [CrossRef]
- Wang, Z.; Li, Q. Video quality assessment using a statistical model of human visual speed perception. J. Opt. Soc. Am. A 2007, 24, B61–B69. [Google Scholar] [CrossRef]
- Moorthy, A.K.; Bovik, A.C. Efficient video quality assessment along temporal trajectories. IEEE Trans. Circuits Syst. Video Technol. 2010, 20, 1653–1658. [Google Scholar] [CrossRef]
- Seshadrinathan, K.; Soundararajan, R.; Bovik, A.C.; Cormack, L.K. A subjective study to evaluate video quality assessment algorithms. In Proceedings of the Human Vision and Electronic Imaging XV, San Jose, CA, USA, 18–21 January 2010; Volume 7527, pp. 128–137. [Google Scholar]
- Seshadrinathan, K.; Bovik, A.C. Motion tuned spatio-temporal quality assessment of natural videos. IEEE Trans. Image Process. 2009, 19, 335–350. [Google Scholar] [CrossRef]
- Wang, Y.; Jiang, T.; Ma, S.; Gao, W. Novel spatio-temporal structural information based video quality metric. IEEE Trans. Circuits Syst. Video Technol. 2012, 22, 989–998. [Google Scholar] [CrossRef]
- Aydin, T.O.; Čadík, M.; Myszkowski, K.; Seidel, H.-P. Video quality assessment for computer graphics applications. ACM Trans. Graph. 2010, 29, 1–12. [Google Scholar] [CrossRef]
- He, L.; Lu, W.; Jia, C.; Hao, L. Video quality assessment by compact representation of energy in 3D-DCT domain. Neurocomputing 2017, 269, 108–116. [Google Scholar] [CrossRef]
- Vu, P.V.; Vu, C.T.; Chandler, D.M. A spatiotemporal most-apparent-distortion model for video quality assessment. In Proceedings of the 2011 18th IEEE International Conference on Image Processing (ICIP), Brussels, Belgium, 11–14 September 2011; pp. 2505–2508. [Google Scholar]
- Manasa, K.; Channappayya, S.S. An optical flow-based full reference video quality assessment algorithm. IEEE Trans. Image Process. 2016, 25, 2480–2492. [Google Scholar] [CrossRef]
- Yan, P.; Mou, X. Video quality assessment based on motion structure partition similarity of spatiotemporal slice images. J. Electron. Imaging 2018, 27, 033019. [Google Scholar] [CrossRef]
- Kim, W.; Kim, J.; Ahn, S.; Kim, L.; Lee, S. Deep video quality assessor: From spatio-temporal visual sensitivity to a convolutional neural aggregation network. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 219–234. [Google Scholar]
- Chen, J.; Wang, H.; Xu, M.; Li, G.; Liu, S. Deep Neural Networks for End-to-End Spatiotemporal Video Quality Prediction and Aggregation. In Proceedings of the 2021 IEEE International Conference on Multimedia and Expo (ICME), Shenzhen, China, 5–9 July 2021; pp. 1–6. [Google Scholar]
- Daly, S.J. Visible differences predictor: An algorithm for the assessment of image fidelity. In Proceedings of the Human Vision, Visual Processing, and Digital Display III, San Jose, CA, USA, 10–13 February 1992; Volume 1666, pp. 2–15. [Google Scholar]
- Kim, J.; Nguyen, A.D.; Lee, S. Deep CNN-based blind image quality predictor. IEEE Trans. Neural Netw. Learn. Syst. 2018, 30, 11–24. [Google Scholar] [CrossRef]
- You, J.; Ebrahimi, T.; Perkis, A. Attention driven foveated video quality assessment. IEEE Trans. Image Process. 2013, 23, 200–213. [Google Scholar] [PubMed]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
- Shorten, C.; Khoshgoftaar, T.M. A survey on image data augmentation for deep learning. J. Big Data. 2019, 6, 60. [Google Scholar] [CrossRef]
- Seshadrinathan, K.; Soundararajan, R.; Bovik, A.C.; Cormack, L.K. Study of subjective and objective quality assessment of video. IEEE Trans. Image Process. 2010, 19, 1427–1441. [Google Scholar] [CrossRef]
- Vu, P.V.; Chandler, D.M. ViS3: An algorithm for video quality assessment via analysis of spatial and spatiotemporal slices. J. Electron. Imaging 2014, 23, 013016. [Google Scholar] [CrossRef]
- Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
- Sheikh, H.R.; Bovik, A.C. Image information and visual quality. IEEE Trans. Image Process. 2006, 15, 430–444. [Google Scholar] [CrossRef] [PubMed]
- Li, Z.; Aaron, A.; Katsavounidis, I.; Morthy, A.; Manohara, M. Toward a Practical Perceptual Video Quality Metric, September 2019. Available online: https://medium.com/netflix-techblog/toward-a-practical-perceptual-video-quality-metric-653f208b9652 (accessed on 6 June 2016).
Sampled Frame Numbers | PLCC | SROCC | Time (s) |
---|---|---|---|
6 | 0.913 | 0.927 | 372 |
12 | 0.942 | 0.953 | 732 |
24 | 0.920 | 0.928 | 1483 |
48 | 0.901 | 0.908 | 2862 |
Scheme | PLCC | SROCC |
---|---|---|
Scheme 1 | 0.875 | 0.918 |
Scheme 2 | 0.913 | 0.916 |
Scheme 3 | 0.939 | 0.936 |
Proposed | 0.942 | 0.953 |
Location | PLCC | SROCC |
---|---|---|
None | 0.904 | 0.909 |
➀ | 0.923 | 0.930 |
➁ | 0.912 | 0.917 |
➂ | 0.673 | 0.651 |
➃ | 0.773 | 0.801 |
➄ | 0.643 | 0.748 |
➀➁ | 0.902 | 0.892 |
➂➃ | 0.874 | 0.833 |
➂➄ | 0.838 | 0.894 |
➃➄ | 0.914 | 0.922 |
➂➃➄ | 0.908 | 0.915 |
➀➁➄ | 0.942 | 0.953 |
Scheme | PLCC | SROCC |
---|---|---|
Scheme 4 | 0.873 | 0.884 |
Scheme 5 | 0.914 | 0.921 |
Scheme 6 | 0.901 | 0.902 |
Proposed | 0.942 | 0.953 |
Methods | LIVE VQA Dataset | CSIQ VQA Dataset | ||
---|---|---|---|---|
PLCC | SROCC | PLCC | SROCC | |
PSNR | 0.727 | 0.740 | 0.599 | 0.611 |
SSIM | 0.788 | 0.721 | 0.763 | 0.762 |
VIF | 0.760 | 0.686 | 0.728 | 0.726 |
MOVIE | 0.861 | 0.848 | 0.630 | 0.625 |
ST-MAD | 0.857 | 0.839 | 0.767 | 0.777 |
VMAF | 0.812 | 0.816 | 0.657 | 0.638 |
DeepVQA | 0.895 | 0.915 | 0.914 | 0.912 |
C3DVQA | 0.912 | 0.926 | 0.904 | 0.915 |
Method in [21] | 0.905 | 0.920 | 0.938 | 0.943 |
Proposed | 0.942 | 0.953 | 0.943 | 0.948 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
He, J.; Wang, Z.; Liu, Y.; Song, Y. Spatial–Temporal Analysis-Based Video Quality Assessment: A Two-Stream Convolutional Network Approach. Electronics 2024, 13, 1874. https://doi.org/10.3390/electronics13101874
He J, Wang Z, Liu Y, Song Y. Spatial–Temporal Analysis-Based Video Quality Assessment: A Two-Stream Convolutional Network Approach. Electronics. 2024; 13(10):1874. https://doi.org/10.3390/electronics13101874
Chicago/Turabian StyleHe, Jianghui, Zhe Wang, Yi Liu, and Yang Song. 2024. "Spatial–Temporal Analysis-Based Video Quality Assessment: A Two-Stream Convolutional Network Approach" Electronics 13, no. 10: 1874. https://doi.org/10.3390/electronics13101874
APA StyleHe, J., Wang, Z., Liu, Y., & Song, Y. (2024). Spatial–Temporal Analysis-Based Video Quality Assessment: A Two-Stream Convolutional Network Approach. Electronics, 13(10), 1874. https://doi.org/10.3390/electronics13101874