Video Coding Based on Ladder Subband Recovery and ResGroup Module
Abstract
1. Introduction
1.1. Current Research Status of Video Coding
1.2. Discrete Wavelet Transform and Deep Learning-Based Video Coding
1.3. Contributions of This Study
- Innovative LadderConv Framework: This paper introduces a framework based on a step-by-step recovery of wavelet subbands, which gradually recovers low-frequency and high-frequency information, especially detailed features (such as edges and textures), thus improving the coding quality of video frames. The framework first processes the high-frequency subbands with less information and then enhances the interaction between the subbands during the recovery process, ultimately generating high-quality reconstructed images through inverse wavelet transform. To further improve detail recovery and image reconstruction accuracy, the LadderConv framework incorporates spatial attention and channel attention mechanisms. The spatial attention mechanism effectively focuses on important spatial regions in the image, enhancing the prominence of key information; the channel attention mechanism applies weights to different channel features, improving the information expression capability, thus ensuring the quality and accuracy of the reconstructed image.
- Combination of Wavelet Transform and Step-by-Step Recovery: By combining step-by-step recovery of wavelet transforms, the LadderConv framework can leverage both low-frequency and high-frequency information to progressively restore the image details while ensuring overall reconstruction quality. The framework organically combines the low-frequency and high-frequency subbands, maintaining the integrity of the low-frequency information while improving the restoration of image details.
- Innovative ResGroup Module: To optimize the performance of the LadderConv framework in video coding tasks, particularly in detail recovery and high-frequency information extraction, this paper designs an innovative ResGroup module. Through multi-layer convolution operations, feature map compression and recovery, and feature fusion strategies, the ResGroup module not only enhances the network’s expressiveness but also effectively reduces computational complexity. The module can capture multi-level features from low to high levels and retains rich feature information through residual connections, further improving the model’s reconstruction performance.
2. Proposed Video Coding Framework
2.1. Video Coding Framework Based on DWT and ResGroup Module
2.1.1. Core Encoder
2.1.2. Hyper-Prior
2.1.3. Core Decoder
2.2. LadderConv Module
2.3. ResGroup Module
2.4. Loss Function
3. Experiments
3.1. Comparisons of Compression Performance
3.2. Comparisons of R-D Performance with Baseline
4. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Sullivan, G.J.; Ohm, J.R.; Han, W.J.; Wiegand, T. Overview of the high efficiency video coding (HEVC) standard. IEEE Trans. Circuits Syst. Video Technol. 2012, 22, 1649–1668. [Google Scholar] [CrossRef]
- Chen, Y.; Murherjee, D.; Han, J.; Grange, A.; Xu, Y.; Liu, Z.; Parker, S.; Chen, C.; Su, H.; Joshi, U.; et al. An overview of core coding tools in the AV1 video codec. In Proceedings of the 2018 Picture Coding Symposium (PCS), San Francisco, CA, USA, 24–27 June 2018; pp. 41–45. [Google Scholar]
- Bross, B.; Wang, Y.K.; Ye, Y.; Liu, S.; Chen, J.; Sullivan, G.J.; Ohm, J.R. Overview of the versatile video coding (VVC) Standard and its applications. IEEE Trans. Circuits Syst. Video Technol. 2021, 31, 3736–3764. [Google Scholar] [CrossRef]
- Jacob, A.; Pawar, V.; Vishwakarma, V.; Mane, A. Deep learning approach to video compression. In Proceedings of the 2019 IEEE Bombay Section Signature Conference (IBSSC), Mumbai, India, 26–28 July 2019; pp. 1–5. [Google Scholar]
- Rippel, O.; Mertens, P.; Merhof, D. Modeling the distribution of normal data in pre-trained deep features for anomaly detection. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 6726–6733. [Google Scholar]
- Jedari, B.; Premsankar, G.; Illahi, G.; Di Francesco, M.; Mehrabi, A.; Ylä-Jääski, A. Video caching, analytics, and delivery at the wireless edge: A survey and future directions. IEEE Commun. Surv. Tutorials 2020, 23, 431–471. [Google Scholar] [CrossRef]
- Lu, G.; Zhang, X.; Ouyang, W.; Chen, L.; Gao, Z.; Xu, D. An end-to-end learning framework for video compression. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 3292–3308. [Google Scholar] [CrossRef] [PubMed]
- Bao, W.; Lai, W.S.; Zhang, X.; Gao, Z.; Yang, M.H. MEMC-Net: Motion estimation and motion compensation driven neural network for video interpolation and enhancement. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 933–948. [Google Scholar] [CrossRef] [PubMed]
- Kyslov, F.; Paniconi, M.; Jiang, J.; Wang, Y.; Tsai, C.Y. Optimizing AV1 encoder for real-time communication. In Proceedings of the 2022 IEEE International Conference on Image Processing (ICIP), Bordeaux, France, 16–19 October 2022; pp. 941–945. [Google Scholar]
- Chen, S.; Liu, Q.; Yang, Y. Adaptive multi-modality residual network for compression distorted multi-view depth video enhancement. IEEE Access 2020, 8, 97072–97081. [Google Scholar] [CrossRef]
- Medeiros, A.; Di Maio, A.; Braun, T.; Neto, A. TENET: Adaptive service chain orchestrator for MEC-enabled low-latency 6DoF virtual reality. IEEE Trans. Netw. Serv. Manag. 2023, 21, 1894–1911. [Google Scholar] [CrossRef]
- Blattmann, A.; Dockhorn, T.; Kulal, S.; Mendelevitch, D.; Kilian, K.; Lorenz, D.; Levi, Y.; English, Z.; Voleti, V.; Letts, A.; et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv 2023, arXiv:2311.15127. [Google Scholar]
- Chai, W.; Guo, X.; Wang, G.; Lu, Y. Stablevideo: Text-driven consistency-aware diffusion video editing. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 23040–23050. [Google Scholar]
- Yang, Z.; Teng, J.; Zheng, W.; Ding, M.; Huang, S.; Xu, J.; Yang, Y.; Hong, W.; Zhang, X.; Feng, G.; et al. Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv 2024, arXiv:2408.06072. [Google Scholar]
- Zhang, X.; Duan, Z.; Gong, D.; Liu, L. Training-free motion-guided video generation with enhanced temporal consistency using motion consistency loss. arXiv 2025, arXiv:2501.07563. [Google Scholar]
- Cai, Y.; Liu, Y.; Gao, E.; Jiang, T.; Zhang, Z.; Hengel, A.; Shi, J.Q. On the value of cross-modal misalignment in multimodal representation learning. arXiv 2025, arXiv:2504.10143. [Google Scholar]
- Liu, Y.; Gong, D.; Cai, Y.; Gao, E.; Zhang, Z.; Huang, B.; Gong, M.; Hengel, A.; Shi, J.Q. I predict therefore I am: Is next token prediction enough to learn human-interpretable concepts from data? arXiv 2025, arXiv:2503.08980. [Google Scholar]
- Li, Q.; Shen, L.; Guo, S.; Lai, Z. Wavelet integrated CNNs for noise-robust image classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 7243–7252. [Google Scholar]
- Mishra, D.; Singh, S.K.; Singh, R.K. Wavelet-based deep auto encoder-decoder (WDAED)-based image compression. IEEE Trans. Circuits Syst. Video Technol. 2020, 31, 1452–1462. [Google Scholar] [CrossRef]
- Jia, G.; Zheng, M.; Hu, C.; Ma, X.; Xu, Y.; Liu, L.; Deng, Y.; He, R. Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Trans. Biom. Behav. Identity Sci. 2021, 3, 308–319. [Google Scholar] [CrossRef]
- Fu, M.; Liu, H.; Yu, Y.; Chen, J.; Wang, K. DW-GAN: A discrete wavelet transform GAN for nonhomogeneous dehazing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 203–212. [Google Scholar]
- Dong, C.; Ma, H.; Liu, D.; Woods, J.W. Wavelet-based learned scalable video coding. In Proceedings of the 2022 IEEE International Symposium on Circuits and Systems (ISCAS), Austin, TX, USA, 27 May–1 June 2022; pp. 3190–3194. [Google Scholar]
- Cheng, Z.; Sun, H.; Takeuchi, M.; Katto, J. Learned image compression with discretized gaussian mixture likelihoods and attention modules. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 7936–7945. [Google Scholar]
- Fu, H.; Liang, F.; Lin, J.; Li, B.; Akbari, M.; Liang, J.; Zhang, G.; Liu, D.; Tu, C.; Han, J. Learned image compression with Gaussian-Laplacian-Logistic mixture model and concatenated residual modules. IEEE Trans. Image Process. 2023, 32, 2063–2076. [Google Scholar] [CrossRef] [PubMed]
- Zhu, Y.; Yang, Y.; Cohen, T. Transformer-based transform coding. In Proceedings of the International Conference on Learning Representations, Virtual Event, 25–29 April 2022. [Google Scholar]
- He, D.; Zheng, Y.; Sun, B.; Wang, Y.; Qin, H. Checkerboard context model for efficient learned image compression. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 14766–14775. [Google Scholar]
- Qian, Y.; Lin, M.; Tan, Z.; Jin, R. Entroformer: A transformer-based entropy model for learned image compression. In Proceedings of the International Conference on Learning Representations, Virtual Event, 25–29 April 2022. [Google Scholar]
- Zou, R.; Song, C.; Zhang, Z. The devil is in the details: Window-based attention for image compression. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 17471–17480. [Google Scholar]
- Xie, Y.; Cheng, K.L.; Chen, Q. Enhanced invertible encoding for learned image compression. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual Event, China, 20–24 October 2021; pp. 162–170. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wei, L.; Zhang, A.; Liu, L.; Wang, J.; Wang, S. Video Coding Based on Ladder Subband Recovery and ResGroup Module. Entropy 2025, 27, 734. https://doi.org/10.3390/e27070734
Wei L, Zhang A, Liu L, Wang J, Wang S. Video Coding Based on Ladder Subband Recovery and ResGroup Module. Entropy. 2025; 27(7):734. https://doi.org/10.3390/e27070734
Chicago/Turabian StyleWei, Libo, Aolin Zhang, Lei Liu, Jun Wang, and Shuai Wang. 2025. "Video Coding Based on Ladder Subband Recovery and ResGroup Module" Entropy 27, no. 7: 734. https://doi.org/10.3390/e27070734
APA StyleWei, L., Zhang, A., Liu, L., Wang, J., & Wang, S. (2025). Video Coding Based on Ladder Subband Recovery and ResGroup Module. Entropy, 27(7), 734. https://doi.org/10.3390/e27070734