4.1. Datasets and Experimental Settings
The CelebA-HQ dataset is a publicly available dataset for computer vision research, comprising 30,000 face images with varying resolutions. The dataset includes images at five distinct scales: , catering to diverse application requirements. For this study, we curated a subset of 15,000 images with resolutions of and , with images designated as the test set.
The model training was conducted on a hardware server equipped with 8 Nvidia 3090 Ti GPUs (Santa Clara, CA, USA), running Ubuntu 18.04 and Py-Torch 2.1.0. All other hyperparameters are listed in
Table 1.
During the initial phase of generator training, a larger learning rate (Lr) is employed to rapidly converge toward the local minimum of the loss function. Given that face image SR tasks require the restoration of high-frequency details, the Lr is progressively reduced during the later training stages to prevent parameter oscillations near the optimal value, thereby ensuring rapid convergence. The Lr for the generator is defined as follows:
The generator employs a loss function that integrates three components: a perceptual loss, an L1 loss, and a multi-scale adversarial loss. Among these, L1 loss emphasizes pixel-level errors; however, insufficient weight allocation to this term may fail to adequately constrain the generator’s output. Conversely, the multi-scale adversarial loss drives the generator to produce realistic facial images, but excessive weighting may induce color artifacts in localized regions. To address this, the training process dynamically adjusts loss weights through a phase-wise focus on distinct objectives (structural fidelity followed by detail refinement), thereby balancing L1 and multi-scale adversarial loss contributions for high-fidelity facial reconstruction. The mathematical formulations are defined as follows:
for
,
, and
. As epochs exceed
,
progressively increases while
decreases during training.
4.2. Comparative Experiments
To validate the SR performance on face images, our proposed model was compared against classical approaches, including Bicubic interpolation, SRCNN, EDSR, SRGAN, ESRGAN, Real-ESRGAN, and SwinIR. All baseline models were retrained from scratch on the same dataset without utilizing pretrained weights from the original models. The architectures of the compared models strictly adhered to the specifications described in the original papers, with minor adjustments to hyperparameters made to accommodate hardware constraints on our server (e.g., GPU memory limitations). Quantitative evaluation was conducted using the widely adopted PSNR, SSIM [
31], and LPIPS [
32]. Higher PSNR and SSIM values indicate greater similarity between super-resolved images and the original HR images. The lower the LPIPS value is, the higher the similarity between the images is.
Figure 5 is the comparative analysis of SR results for face images. The first column presents the original HR images, followed by super-resolved outcomes from seven benchmark models: Bicubic interpolation, SRCNN, EDSR, SRGAN, ESRGAN, Real-ESRGAN and SwinIR, respectively. The final column displays the proposed model’s results.
Bicubic interpolation, a traditional interpolation-based method, generates new pixel values using a 16-pixel neighborhood. While computationally efficient, it introduces significant distortions in high-frequency face regions, achieving the lowest performance among all methods. SRCNN, the first CNN designed for SR, employs only three convolutional layers, limiting its receptive field and restoration capability for complex face textures. EDSR incorporates a deep residual architecture, substantially improving reconstruction quality. However, its high computational complexity and tendency to produce over-smoothing artifacts in super-resolved face images remain critical limitations. SRGAN pioneered the integration of GAN into image SR, demonstrating enhanced performance relative to conventional interpolation-based approaches. However, its generated high-frequency textures exhibit overly stylized and repetitive patterns, with noticeable blurring artifacts around face regions such as the mouth. The ESRGAN model enhances the generator architecture through improved feature fusion mechanisms, yet it still suffers from structural distortions in complex face images. Real-ESRGAN employs spectral normalization and a U-Net architecture to boost discriminative capability, but the absence of perceptual loss constraints leads to blurred details in critical regions like the eyes. SwinIR introduces the Swin Transformer into the super-resolution task, improving the computational efficiency through “window-level” self-attention while maintaining good modeling ability for local and global information. Its super-resolution result is superior to the previous several methods. In contrast, our proposed improved model demonstrates optimal reconstruction performance across both global structures and fine details, aligning with human visual perception.
To conduct comparative experiments in a rigorous manner, this study performs a quantitative analysis of our proposed model against six baseline models. To test the stability of the model, we conducted three independent tests on all the models on three different test sets, that is, to obtain the means and standard deviations of PSNR, SSIM, and LPIPS. The PSNR, SSIM, and LPIPS metrics are presented in
Table 2. Among these, the Bicubic model exhibits the lowest values, consistent with the qualitative comparisons shown in
Figure 5. As the first GAN-based SR model, SRGAN demonstrates relatively inferior performance due to its simplistic architecture, with its metric values surpassing only baseline methods but trailing behind advanced GAN-based approaches like ESRGAN. Our improvements to the generator and discriminator, along with a phased and refined training strategy, resulted in the highest PSNR and SSIM values, as well as the lowest LPIPS in all evaluated models. In addition, in terms of the standard deviation of PSNR, SSIM, and LPIPS, the stability of our method in the three independent tests is also the best.
This study implemented four key improvements to the model architecture: the incorporation of an Edge-guided Enhancement Block (EEB) and dilated convolution block, the redesign of standard residual blocks into Multi-scale Hybrid Attention Residual Blocks (MHARB), and the enhancement of the standard discriminator to a multi-scale discriminator. To systematically evaluate the contribution of each component in face image SR tasks, an ablation study was conducted. Qualitative comparisons of the ablation study are visualized in
Figure 6, with quantitative metrics provided in
Table 3.
The EEB employs a detail-adaptive enhancement strategy to effectively reduce noise while precisely restoring high-frequency textures. Experimental results prove that the absence of EEB leads to severe distortions in high-frequency details (e.g., ocular regions) of complex face images, causing visual discomfort. The multi-scale discriminator, through feature extraction across multiple scales, balances the generation quality of global structures and local details, thereby better constraining the generator to produce realistic face images. Its removal results in blurred artifacts in regions such as the eyes. The MHARB enhances the generator’s modeling capacity for critical facial features via dual-branch convolution fused with CBAM. Conversely, the removal of dilated convolution modules or the use of original residual blocks leads to a decline in PSNR and SSIM values.
The generator model incorporates 16 improved residual blocks following standard convolutional layers. This study assesses how the number of residual blocks affects SR performance through comparative experiments. The results in
Figure 7 and
Table 4 reveal that incrementally increasing residual blocks gradually enhances global color fidelity and texture sharpness in super-resolved face images. However, when N exceeds 16, the quantitative metrics decline instead of improving. Furthermore, excessive residual blocks lead to unacceptably high computational overhead. Therefore,
is selected as the optimal configuration.
During model training, the weights of the multi-scale adversarial loss function and L1 loss function were dynamically adjusted. By progressively reducing the multi-scale adversarial loss’s weight and increasing the L1 loss’s weight, the reconstruction of high-frequency components in face images was enhanced. In
Figure 8, the eyes in the second column of facial images exhibit noticeable scar-like artifacts, showing significant discrepancies compared to the original HR counterparts.
Table 5 illustrates the performance improvements in PSNR, SSIM, and LPIPS. The results indicate an enhancement of 1.56 dB for PSNR, 0.0398 for SSIM, and 2.1 for LPIPS.
Furthermore, during model training, a progressive reduction strategy was applied to the generator’s Lr, achieving superior SR results.
Figure 9 demonstrates that the decreasing Lr strategy better preserves texture details in reconstructed images.
Table 6 illustrates the performance improvements. The PSNR shows an enhancement of 2.67 dB; LPIPS is decreased by 0.49, while the SSIM rises by 0.0306, confirming the effectiveness of the proposed Lr adaptation.
To further demonstrate the generalization ability of the proposed method on different image distributions, we conducted a cross-dataset evaluation experiment on the DIV2K dataset. DIV2K is a widely-used benchmark dataset in image super-resolution tasks, featuring high-resolution images with rich texture details and complex degradation patterns. The details of the experiment retain the same hyperparameters and training settings as described in
Section 4.1. The evaluation metrics also follow the same evaluation protocol (PSNR, SSIM, and LPIPS) as CelebA-HQ.
As shown in
Figure 10 and
Table 7, the proposed model achieved competitive performance on DIV2K compared to other methods, demonstrating its robustness to diverse image characteristics.