Lightweight CNN–Transformer Hybrid Network for Efficient Face Super-Resolution
Abstract
1. Introduction
- To address the high computational cost and large parameter size of existing CNN–Transformer-based face super-resolution methods, we propose HCTIUNet, a lightweight hybrid network that integrates CNN-based local feature extraction and Transformer-based global dependency modeling within a unified framework.
- To improve local–global feature interaction under a lightweight design, we construct an inverted U-shaped architecture composed of lightweight CNN–Transformer interaction blocks. This structure enables progressive multi-scale feature exchange and enhances the representation of facial textures and structural information.
- To alleviate the limited feature representation capability of existing lightweight SR models, we design a lightweight CNN–Transformer interaction block, in which the CNN branch extracts local facial details while the Transformer branch captures long-range contextual dependencies, thereby achieving complementary local and global feature modeling.
- To enhance the reconstruction of fine facial structures and reduce the loss of detail caused by lightweight processing, we introduce multi-scale feature fusion and global feature refinement mechanisms. These modules further improve the representation of key facial regions and enhance reconstruction quality.
2. Related Work
2.1. CNN-Based Face Super-Resolution Methods
2.2. Transformer-Based and CNN–Transformer Hybrid Methods
2.3. Prior-Guided FSR Methods
2.4. Lightweight Super-Resolution Methods
3. Materials and Methods
3.1. Overall Architecture of HCTIUNet
3.2. IUNet
3.3. Feature Refinement
3.4. Loss Function
4. Experimental Results and Analysis
4.1. Datasets
4.2. Parameter Settings
4.3. Experimental Results and Comparison
4.4. Ablation Study
5. Discussion
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
| HCTIUNet | Hybrid CNN–Transformer Inverted U-Net Architecture |
| IUNet | Inverted UNet |
| MDTA | Multi-head Transpositional Self-Attention |
| LPCTB | Lightweight Processing of CNN–Transformer Block |
| LFIEB | Local Face semantic Information Extraction Block |
| MCT | Multi-scale channel Transformer |
| MFEU | Multi-scale Fusion Enhancement Unit |
| GFRB | Global Feature Refinement Block |
| LFEU | Local Feature Extraction Unit |
Appendix A
| (a) FFHQ Dataset | ||||
|---|---|---|---|---|
| Methods | Scale | PSNR ↑ | SSIM ↑ | LPIPS ↓ |
| RCAN | 26.93 ± 0.22 | 0.767 ± 0.015 | 0.236 ± 0.024 | |
| SCTANet | 27.69 ± 0.14 | 0.782 ± 0.006 | 0.206 ± 0.010 | |
| SISN | 26.12 ± 0.18 | 0.771 ± 0.008 | 0.228 ± 0.013 | |
| VDSR | 26.45 ± 0.12 | 0.760 ± 0.008 | 0.237 ± 0.011 | |
| RFDN | 25.72 ± 0.21 | 0.676 ± 0.017 | 0.251 ± 0.016 | |
| MSFSR | 25.33 ± 0.24 | 0.660 ± 0.018 | 0.250 ± 0.021 | |
| XLSR | 25.25 ± 0.25 | 0.622 ± 0.020 | 0.235 ± 0.023 | |
| HCTIUNet | 27.55 ± 0.15 | 0.765 ± 0.009 | 0.225 ± 0.009 | |
| (b) CelebA Dataset | ||||
| Methods | Scale | PSNR ↑ | SSIM ↑ | LPIPS ↓ |
| RCAN | 27.08 ± 0.18 | 0.766 ± 0.017 | 0.209 ± 0.018 | |
| SCTANet | 27.75 ± 0.12 | 0.780 ± 0.004 | 0.215 ± 0.013 | |
| SISN | 26.24 ± 0.15 | 0.752 ± 0.010 | 0.233 ± 0.010 | |
| VDSR | 26.80 ± 0.14 | 0.772 ± 0.009 | 0.244 ± 0.017 | |
| RFDN | 25.70 ± 0.19 | 0.653 ± 0.014 | 0.265 ± 0.021 | |
| MSFSR | 25.16 ± 0.22 | 0.660 ± 0.016 | 0.250 ± 0.018 | |
| XLSR | 25.25 ± 0.20 | 0.624 ± 0.014 | 0.240 ± 0.017 | |
| HCTIUNet | 27.63 ± 0.14 | 0.761 ± 0.006 | 0.212 ± 0.015 | |
| (c) Helen Dataset | ||||
| Methods | Scale | PSNR ↑ | SSIM ↑ | LPIPS ↓ |
| RCAN | 26.84 ± 0.22 | 0.735 ± 0.015 | 0.243 ± 0.031 | |
| SCTANet | 27.76 ± 0.04 | 0.775 ± 0.006 | 0.186 ± 0.008 | |
| SISN | 26.15 ± 0.13 | 0.753 ± 0.008 | 0.243 ± 0.013 | |
| VDSR | 26.56 ± 0.16 | 0.766 ± 0.007 | 0.231 ± 0.015 | |
| RFDN | 25.45 ± 0.23 | 0.682 ± 0.021 | 0.269 ± 0.019 | |
| MSFSR | 25.88 ± 0.14 | 0.671 ± 0.023 | 0.238 ± 0.020 | |
| XLSR | 25.51 ± 0.18 | 0.636 ± 0.025 | 0.267 ± 0.022 | |
| HCTIUNet | 27.53 ± 0.09 | 0.777 ± 0.005 | 0.213 ± 0.010 | |
References
- Baker, S.; Kanade, T. Limits on super-resolution and how to break them. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 1167–1183. [Google Scholar] [CrossRef]
- Dong, C.; Loy, C.C.; He, K.; Tang, X. Learning a deep convolutional network for image superre-solution. In Computer Vision–ECCV; Springer International Publishing: Cham, Switzerland, 2014; pp. 184–199. [Google Scholar]
- Zhang, Y.; Li, K.; Li, K.; Wang, L.; Zhong, B.; Fu, Y. Image super-resolution using very deep residual channel attention networks. In Proceedings of the European Conference on Computer Vision; Springer International Publishing: Cham, Switzerland, 2018; pp. 286–301. [Google Scholar]
- Bao, Q.; Liu, Y.; Gang, B.; Yang, W.; Liao, Q. SCTANet: A spatial attention-guided CNN-transformer aggregation network for deep face image super-resolution. IEEE Trans. Multimed. 2023, 25, 8554–8565. [Google Scholar] [CrossRef]
- Kim, J.; Lee, J.K.; Lee, K.M. Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2016; pp. 1646–1654. [Google Scholar]
- Zhu, S.; Liu, S.; Loy, C.C.; Tang, X. Deep cascaded bi-network for face hallucination. In European Conference on Computer Vision; Springer International Publishing: Cham, Switzerland, 2016; pp. 614–630. [Google Scholar]
- Lu, T.; Wang, H.; Xiong, Z.; Jiang, J.; Zhang, Y.; Zhou, H.; Wang, Z. Face hallucination using region-based deep convolutional networks. In 2017 IEEE International Conference on Image Processing (ICIP); IEEE: Piscataway, NJ, USA, 2017; pp. 1657–1661. [Google Scholar]
- Chen, Y.; Tai, Y.; Liu, X.; Shen, C.; Yang, J. FSRNET: End-to-end learning face super-resolution with facial priors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2018; pp. 2492–2501. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2021; pp. 10012–10022. [Google Scholar]
- Yoo, J.; Kim, T.; Lee, S.; Kim, S.H.; Lee, H.; Kim, T.H. Enriched CNN-transformer feature aggregation networks for super resolution. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision; IEEE: Piscataway, NJ, USA, 2023; pp. 4956–4965. [Google Scholar]
- Zhao, T.; Zhang, C. SAAN: Semantic attention adaptation network for face super resolution. In 2020 IEEE International Conference on Multimedia and Expo (ICME); IEEE: Piscataway, NJ, USA, 2020; pp. 1–6. [Google Scholar]
- Wang, Q.; Gao, Q.; Wu, L.; Sun, G.; Jiao, L. Adversarial Multi-Path Residual Network for image super-resolution. IEEE Trans. Image Process. 2021, 30, 6648–6658. [Google Scholar] [CrossRef] [PubMed]
- Li, W.; Wang, M.; Zhang, K.; Li, J.; Li, X.; Zhang, Y.; Gao, G.; Ma, Z. Survey on deep face restoration: From non-blind to blind and beyond. arXiv 2023, arXiv:2309.15490. [Google Scholar]
- Zhang, C.; Liu, Z. Face super-resolution with progressive embedding of multi-scale face priors. In IEEE International Joint Conference on Biometrics; IEEE: Piscataway, NJ, USA, 2022; pp. 1–8. [Google Scholar]
- Liu, J.; Tang, J.; Wu, G. Residual feature distillation network for lightweight image super-resolution. In ECCV Workshops; Springer International Publishing: Cham, Switzerland, 2020; pp. 41–55. [Google Scholar]
- Ayazoglu, M. Extremely lightweight quantization robust real-time single-image super resolution for mobile devices. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2021; pp. 2472–2479. [Google Scholar]
- Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In MICCAI; Springer International Publishing: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
- Chen, C.; Gong, D.; Wang, H.; Li, Z.; Wong, K.Y.K. Learning spatial attention for face super-resolution. IEEE Trans. Image Process. 2020, 30, 1219–1231. [Google Scholar] [CrossRef] [PubMed]
- Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
- Zhang, Y.; Wu, Y.; Chen, L. MSFSR: A multi-stage face super-resolution with accurate facial representation via enhanced facial boundaries. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops; IEEE: Piscataway, NJ, USA, 2020; pp. 504–505. [Google Scholar]












| Methods | Scale | FFHQ | CelebA | Helen |
|---|---|---|---|---|
| PSNR ↑ SSIM ↑ LPIPS ↓ | PSNR ↑ SSIM ↑ LPIPS ↓ | PSNR ↑ SSIM ↑ LPIPS ↓ | ||
| RCAN | 26.93/0.767/0.236 | 27.08/0.766/0.209 | 26.84/0.735/0.243 | |
| SCTANet | 27.69/0.782/0.206 | 27.75/0.780/0.215 | 27.76/0.775/0.186 | |
| SISN | 26.12/0.771/0.228 | 26.24/0.752/0.233 | 26.15/0.753/0.243 | |
| VDSR | 26.45/0.760/0.237 | 26.80/0.772/0.244 | 26.56/0.766/0.231 | |
| RFDN | 25.72/0.676/0.251 | 25.70/0.653/0.266 | 25.45/0.682/0.269 | |
| MSFSR | 25.33/0.660/0.250 | 25.16/0.624/0.240 | 25.88/0.671/0.238 | |
| XLSR | 25.25/0.622/0.235 | 25.48/0.627/0.259 | 25.51/0.636/0.267 | |
| HCTIUNet | 27.55/0.765/0.225 | 27.63/0.761/0.212 | 27.53/0.777/0.213 |
| Model | Params | FLOPs | Inference Time |
|---|---|---|---|
| RCAN | 15.9 M | 4.1 G | 0.069 s |
| SCTANet | 26.9 M | 11.2 G | 0.056 s |
| SISN | 20.4 M | 9.3 G | 0.087 s |
| VDSR | 17.53 M | 9.9 G | 0.071 s |
| RFDN | 500 K | 120.3 M | 0.027 s |
| MSFSR | 6.34 M | 1.2 G | 0.370 s |
| XLSR | 701 K | 180 M | 0.010 s |
| HCTIUNet | 10.5 M | 9.9 G | 0.021 s |
| Model | LFIEB | MCT | GFRB | PSNR/SSIM | Params/FLOPs |
|---|---|---|---|---|---|
| BaseLine | ✓ | × | × | 26.78/0.761 | 6.2 M/6.5 G |
| IUNet-G | ✓ | × | ✓ | 26.91/0.763 | 7.3 M/8.1G |
| IUNet-M | ✓ | ✓ | × | 27.44/0.770 | 9.2 M/9.6 G |
| HCTIUNet | ✓ | ✓ | ✓ | 27.57/0.775 | 10.5 M/10.1 G |
| Model | Scale | PSNR/SSIM | Params | FLOPs |
|---|---|---|---|---|
| BaseLine | 26.83/0.766 | 6.2 M | 6.5 G | |
| IUNet-T | 27.22/0.772 | 7.8 M | 8.1 G | |
| HCTIUNet | 27.54/0.775 | 10.5 M | 10.0 G |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Liu, A.-L.; Xu, Y.-H.; Zhou, W. Lightweight CNN–Transformer Hybrid Network for Efficient Face Super-Resolution. Appl. Sci. 2026, 16, 6221. https://doi.org/10.3390/app16126221
Liu A-L, Xu Y-H, Zhou W. Lightweight CNN–Transformer Hybrid Network for Efficient Face Super-Resolution. Applied Sciences. 2026; 16(12):6221. https://doi.org/10.3390/app16126221
Chicago/Turabian StyleLiu, Ao-Lin, Yi-Han Xu, and Wen Zhou. 2026. "Lightweight CNN–Transformer Hybrid Network for Efficient Face Super-Resolution" Applied Sciences 16, no. 12: 6221. https://doi.org/10.3390/app16126221
APA StyleLiu, A.-L., Xu, Y.-H., & Zhou, W. (2026). Lightweight CNN–Transformer Hybrid Network for Efficient Face Super-Resolution. Applied Sciences, 16(12), 6221. https://doi.org/10.3390/app16126221

