RVM+: An AI-Driven Vision Sensor Framework for High-Precision, Real-Time Video Portrait Segmentation with Enhanced Temporal Consistency and Optimized Model Design
Abstract
:1. Introduction
- (1)
- Development of the RVM+ model: By integrating ConvGRU into the RVM framework, this study represents a significant advancement in video segmentation technology, particularly in handling dynamic content and temporal changes.
- (2)
- Optimization through knowledge distillation: The study introduces an innovative approach to model optimization using knowledge distillation, effectively reducing model size and computational requirements—critical for real-time processing applications.
- (3)
- Comprehensive performance evaluation: Extensive testing on a variety of challenging datasets demonstrates the enhanced segmentation capabilities of the RVM+ model, setting new benchmarks for accuracy compared to existing segmentation models.
2. Materials and Methods
2.1. Datasets and Preprocessing
2.2. Design of the Model Network Structure
2.2.1. General Framework of the RVM
2.2.2. Structure of the ConvGRU
2.2.3. Structure of the Proposed RVM+
2.2.4. Loss Function
2.3. Model Knowledge Distillation Strategy
3. Results
3.1. Evaluation Metrics
3.2. Experimental Results
3.2.1. Ablation Study
3.2.2. Quantitative and Qualitative Results
4. Discussion
5. Conclusions
Author Contributions
Funding
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Wang, Y.; Zhang, W.; Wang, L.; Yang, F.; Lu, H. Temporal consistent portrait video segmentation. Pattern Recognit. 2021, 120, 108143. [Google Scholar] [CrossRef]
- Garcia-Garcia, A.; Orts-Escolano, S.; Oprea, S.; Villena-Martinez, V.; Martinez-Gonzalez, P.; Garcia-Rodriguez, J. A survey on deep learning techniques for image and video semantic segmentation. Appl. Soft Comput. 2018, 70, 41–65. [Google Scholar] [CrossRef]
- Seong, S.; Choi, J. Semantic Segmentation of Urban Buildings Using a High-Resolution Network (HRNet) with Channel and Spatial Attention Gates. Remote Sens. 2021, 13, 3087. [Google Scholar] [CrossRef]
- Du, X.; Wang, X.; Li, D.; Zhu, J.; Tasci, S.; Upright, C.; Walsh, S.; Davis, L. Boundary-sensitive Network for Portrait Segmentation. In Proceedings of the 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition, Lille, France, 14–18 May 2019; pp. 1–8. [Google Scholar]
- Hinton, G.E.; Vinyals, O.; Dean, J.J.A. Distilling the Knowledge in a Neural Network. arXiv 2015, arXiv:1503.02531. [Google Scholar]
- Ding, H.; Jiang, X.; Shuai, B.; Liu, A.Q.; Wang, G. Context Contrasted Feature and Gated Multi-Scale Aggregation for Scene Segmentation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
- Kim, Y.W.; Byun, Y.C.; Krishna, A.V.N. Portrait segmentation using ensemble of heterogeneous deep-learning models. Entropy 2021, 23, 197. [Google Scholar] [CrossRef]
- Ma, Z.; Yao, G. Deep portrait matting via double-grained segmentation. Multimed. Syst. 2023, 29, 3549–3557. [Google Scholar] [CrossRef]
- Yung-Yu, C.; Curless, B.; Salesin, D.H.; Szeliski, R. A Bayesian approach to digital matting. In Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Kauai, HI, USA, 8–14 December 2001. [Google Scholar]
- Shelhamer, E.; Long, J.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 640–651. [Google Scholar] [CrossRef] [PubMed]
- Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the 2015 Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015, Munich, Germany, 5–9 October 2015. [Google Scholar]
- Badrinarayanan, V.; Kendall, A.; Cipolla, R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef] [PubMed]
- Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Paszke, A.; Chaurasia, A.; Kim, S.; Culurciello, E. ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation. arXiv 2016, arXiv:1606.02147. [Google Scholar]
- Zhao, H.; Qi, X.; Shen, X.; Shi, J.; Jia, J. ICNet for Real-Time Semantic Segmentation on High-Resolution Images. In Proceedings of the 2018 European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
- Yu, C.; Gao, C.; Wang, J.; Yu, G.; Shen, C.; Sang, N. BiSeNet V2: Bilateral Network with Guided Aggregation for Real-Time Semantic Segmentation. Int. J. Comput. Vis. 2021, 129, 3051–3068. [Google Scholar] [CrossRef]
- Howard, A.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
- Zhou, C.; Xu, C.; Cui, Z.; Zhang, T.; Yang, J. Self-Teaching Video Object Segmentation. IEEE Trans. Neural Netw. Learn. Syst. 2022, 33, 1623–1637. [Google Scholar] [CrossRef]
- Zheng, S.; Song, Y.; Leung, T.; Goodfellow, I. Improving the Robustness of Deep Neural Networks via Stability Training. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016. [Google Scholar]
- Lin, S.; Yang, L.; Saleemi, I.; Sengupta, S. Robust high-resolution video matting with temporal guidance. In Proceedings of the 2022 IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022. [Google Scholar]
- Sun, Y.; Tang, C.K.; Tai, Y.W. Ultrahigh Resolution Image/Video Matting with Spatio-Temporal Sparsity. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023. [Google Scholar]
- Wu, J.; Jiang, Y.; Liu, Q.; Yuan, Z.; Bai, X.; Bai, S. General object foundation model for images and videos at scale. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024. [Google Scholar]
- Lin, S.; Ryabtsev, A.; Sengupta, S.; Curless, B.; Seitz, S.; Kemelmacher-Shlizerman, I. Real-Time High-Resolution Background Matting. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
- Dai, Y.; Price, B.L.; Zhang, H.; Shen, C. Boosting Robustness of Image Matting with Context Assembling and Strong Data Augmentation. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
- Zhang, X.; Yang, K.; Lu, Q.; Wu, J.; Yu, L.; Lin, Y. Predicting carbon futures prices based on a new hybrid machine learning: Comparative study of carbon prices in different periods. J. Environ. 2023, 346, 118962. [Google Scholar] [CrossRef] [PubMed]
- Hu, Y.; Lin, Y.; Wang, W.; Zhao, Y.; Wei, Y.; Shi, H. Diffusion for Natural Image Matting. In Proceedings of the 2024 European Conference on Computer Vision (ECCV), Milan, Italy, 29 September–4 October 2024. [Google Scholar]
- Liu, H.; Soto, R.A.R.; Xiao, F.; Lee, Y.J. YolactEdge: Real-time Instance Segmentation on the Edge. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xian, China, 30 May–5 June 2021. [Google Scholar]
- Shorten, C.; Khoshgoftaar, T.M. A survey on Image Data Augmentation for Deep Learning. J. Big Data 2019, 6, 60. [Google Scholar] [CrossRef]
- Siam, M.; Valipour, S.; Jagersand, M.; Ray, N. Convolutional gated recurrent networks for video segmentation. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017. [Google Scholar]
- Gou, J.; Hu, Y.; Sun, L.; Wang, Z.; Ma, H. Collaborative knowledge distillation via filter knowledge transfer. Expert Syst. Appl. 2024, 238, 121884. [Google Scholar] [CrossRef]
- Ren, J.; Zhang, M.; Yu, C.; Liu, Z. Balanced MSE for Imbalanced Visual Regression. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
- Li, J.; Ohanyan, M.; Goel, V.; Navasardyan, S.; Wei, Y.; Shi, H. VideoMatt: A Simple Baseline for Accessible Real-Time Video Matting. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Vancouver, BC, Canada, 18–22 June 2023. [Google Scholar]
- Chu, L.; Liu, Y.; Wu, Z.; Tang, S.; Chen, G.; Hao, Y.; Peng, J.; Yu, Z.; Chen, Z.; Lai, B.; et al. PP-HumanSeg: Connectivity-Aware Portrait Segmentation with a Large-Scale Teleconferencing Video Dataset. arXiv 2021, arXiv:2112.07146. [Google Scholar]
- Chen, L.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking atrous convolution for semantic image segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar]
- Ke, Z.; Li, K.; Zhou, Y.; Wu, Q.; Mao, X.; Yan, Q.; Lau, R.W. Is a green screen really necessary for real-time portrait matting. arXiv 2020, arXiv:2011.11961. [Google Scholar]
Name | Description |
---|---|
VideoMatte240K | Includes 484 high-resolution alpha masks and foreground video clips, totaling 240,709 frames. The alpha masks and foregrounds were extracted from green screen stock footage from the University of Washington. Of the video clips, 384 are in 4 K resolution, and 100 are in HD resolution. |
YouTubeVIS | The large dataset used for video instance segmentation is filtered to include 2985 video clips containing humans. |
PhotoMatte85 | Includes 85 half-length portraits without backgrounds. |
Adobe Image Matting | Provides 214 examples of portrait segmentation. |
Supervisely Person Dataset | The dataset consists of 5711 images, including 4079 labeled examples of the human body. |
VMD | The dataset includes 39 sets of alpha masks and foreground video clips in three categories: full body, half body, and close-ups of people, totaling 16,698 images. |
Distinction-646 | We screened 372 images in the dataset that contained people. |
BackgroundVideos (BG) | We screened 3279 video sets in the dataset that did not contain characters. |
Dataset | Method | MIoU | SAD | dtSSD | MSE |
---|---|---|---|---|---|
VS-test | RVM | 0.959 | 4.21 | 2.02 | 0.037 |
RVM+ | 0.974 | 5.81 | 1.78 | 0.026 | |
VMHD_TS | RVM | 0.833 | 10.36 | 3.12 | 0.047 |
RVM+ | 0.852 | 6.57 | 1.90 | 0.011 | |
PhotoMatte85-TS | RVM | 0.871 | 6.52 | 3.12 | 0.047 |
RVM+ | 0.895 | 0.971 | 1.84 | 0.014 |
Dataset | MIoU | SAD | dtSSD | MSE | bps | FPS |
---|---|---|---|---|---|---|
VS-test | 0.974 | 5.81 | 1.78 | 0.026 | 3.32 M | 32.3 |
VMHD_TS | 0.852 | 6.57 | 1.90 | 0.011 | 4.15 M | 26.7 |
PhotoMatte85_TS | 0.895 | 0.971 | 1.84 | 0.014 | 3.72 M | 28.6 |
Processing | MIoU | SAD | dtSSD | MSE |
---|---|---|---|---|
Before knowledge distillation | 0.974 | 5.81 | 1.78 | 0.026 |
After knowledge distillation | 0.962 | 6.57 | 1.90 | 0.022 |
Model | VS-Test | VMHD_TS | PhotoMatte85-TS | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
MIoU | SAD | dtSSD | MSE | MIoU | SAD | dtSSD | MSE | MIoU | SAD | dtSSD | MSE | |
BGMv2 | 0.906 | 25.19 | 4.51 | 0.151 | 0.784 | 28.50 | 5.18 | 0.043 | 0.761 | 26.56 | 4.68 | 0.051 |
DeepLabV3 | 0.884 | 27.13 | 6.08 | 0.213 | 0.766 | 30.25 | 6.23 | 0.047 | 0.725 | 35.27 | 5.60 | 0.071 |
MODNet | 0.942 | 16.34 | 2.69 | 0.088 | 0.846 | 12.39 | 3.38 | 0.039 | 0.843 | 10.64 | 3.25 | 0.037 |
ConnectNet | 0.955 | 14.47 | 2.09 | 0.035 | 0.867 | 17.98 | 2.74 | 0.042 | 0.881 | 15.88 | 2.56 | 0.027 |
RVM+ | 0.974 | 5.81 | 1.78 | 0.026 | 0.852 | 6.57 | 1.90 | 0.011 | 0.895 | 0.971 | 1.84 | 0.014 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Tang, N.; Liao, Y.; Chen, Y.; Yang, G.; Lai, X.; Chen, J. RVM+: An AI-Driven Vision Sensor Framework for High-Precision, Real-Time Video Portrait Segmentation with Enhanced Temporal Consistency and Optimized Model Design. Sensors 2025, 25, 1278. https://doi.org/10.3390/s25051278
Tang N, Liao Y, Chen Y, Yang G, Lai X, Chen J. RVM+: An AI-Driven Vision Sensor Framework for High-Precision, Real-Time Video Portrait Segmentation with Enhanced Temporal Consistency and Optimized Model Design. Sensors. 2025; 25(5):1278. https://doi.org/10.3390/s25051278
Chicago/Turabian StyleTang, Na, Yuehui Liao, Yu Chen, Guang Yang, Xiaobo Lai, and Jing Chen. 2025. "RVM+: An AI-Driven Vision Sensor Framework for High-Precision, Real-Time Video Portrait Segmentation with Enhanced Temporal Consistency and Optimized Model Design" Sensors 25, no. 5: 1278. https://doi.org/10.3390/s25051278
APA StyleTang, N., Liao, Y., Chen, Y., Yang, G., Lai, X., & Chen, J. (2025). RVM+: An AI-Driven Vision Sensor Framework for High-Precision, Real-Time Video Portrait Segmentation with Enhanced Temporal Consistency and Optimized Model Design. Sensors, 25(5), 1278. https://doi.org/10.3390/s25051278