Next Article in Journal
A Survey on the Main Techniques Adopted in Indoor and Outdoor Localization
Previous Article in Journal
A Blockchain-Based Framework for Secure Data Stream Dissemination in Federated IoT Environments
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

FACDIM: A Face Image Super-Resolution Method That Integrates Conditional Diffusion Models with Prior Attributes

School of Electronic and Information Engineering, Liaoning Technical University, Huludao 125105, China
*
Author to whom correspondence should be addressed.
Electronics 2025, 14(10), 2070; https://doi.org/10.3390/electronics14102070
Submission received: 8 April 2025 / Revised: 6 May 2025 / Accepted: 9 May 2025 / Published: 20 May 2025
(This article belongs to the Special Issue AI-Driven Image Processing: Theory, Methods, and Applications)

Abstract

:
Facial image super-resolution seeks to reconstruct high-quality details from low-resolution inputs, yet traditional methods, such as interpolation, convolutional neural networks (CNNs), and generative adversarial networks (GANs), often fall short, suffering from insufficient realism, loss of high-frequency details, and training instability. Furthermore, many existing models inadequately incorporate facial structural attributes and semantic information, leading to semantically inconsistent generated images. To overcome these limitations, this study introduces an attribute-prior conditional diffusion implicit model that enhances the controllability of super-resolution generation and improves detail restoration capabilities. Methodologically, the framework consists of four components: a pre-super-resolution module, a facial attribute extraction module, a global feature encoder, and an enhanced conditional diffusion implicit model. Specifically, low-resolution images are subjected to preliminary super-resolution and attribute extraction, followed by adaptive group normalization to integrate feature vectors. Additionally, residual convolutional blocks are incorporated into the diffusion model to utilize attribute priors, complemented by self-attention mechanisms and skip connections to optimize feature transmission. Experiments conducted on the CelebA and FFHQ datasets demonstrate that the proposed model achieves an increase of 2.16 dB in PSNR and 0.08 in SSIM under an 8× magnification factor compared to SR3, with the generated images displaying more realistic textures. Moreover, manual adjustment of attribute vectors allows for directional control over generation outcomes (e.g., modifying facial features or lighting conditions), ensuring alignment with anthropometric characteristics. This research provides a flexible and robust solution for high-fidelity face super-resolution, offering significant advantages in detail preservation and user controllability.

1. Introduction

Image Super-Resolution (ISR), as a core technology in the field of computer vision, and Face Image Super-Resolution (FISR), as an important branch of Image Super-Resolution (ISR), aims to restore high-quality high-resolution (HR) face images from low-resolution (LR) inputs. It plays a key role in scenarios, such as medical image analysis, security surveillance, and digital identity authentication [1]. However, the challenges of FISR stem from its ill-posed inverse problem nature; there are infinitely many possibilities in the high-resolution solution space corresponding to a single low-resolution input, and the loss of high-frequency details caused by image degradation makes it challenging to balance pixel accuracy and semantic authenticity in the reconstruction results [2]. Although deep learning technology has driven rapid development in this field, existing methods still face bottlenecks, such as insufficient realism, low computational efficiency, and weak controllability of generation, urgently requiring breakthroughs by combining new generative paradigms and domain prior knowledge [3,4,5].
Convolutional Neural Networks (CNNs) have achieved significant success in image super-resolution due to their powerful feature extraction and non-linear mapping capabilities. Early CNN-based methods, such as SRCNN [6], pioneered deep learning to image super-resolution tasks. Subsequent research has led to substantial improvements in network architecture and training strategies, resulting in more advanced models, like VDSR [7], DRCN [8], and EDSR [9]. To better restore facial texture details, Chen proposed the Spatial Attention Residual Network (SPARNet) [10], which incorporates a Facial Attention Unit (FAU) using spatial attention mechanisms. The use of convolutional layers enables adaptive focus on important facial structural regions while minimizing attention on areas lacking distinctive features, thereby improving the reconstruction of facial details. Chen further developed SPARNetHD by combining SPARNet with a multi-scale discriminator, enabling the generation of higher-quality facial images at 512 × 512 resolution with good generalization on low-quality facial images. G. Gao introduced the CNN-Transformer Collaborative Network (CTCNet) to improve network performance by considering global and local facial features [11]. CTCNet features a multi-scale connected encoder–decoder structure and a Local–Global Feature Collaboration Module (LGCM) containing a Facial Structure Attention Unit (FSAU) and Transformer modules. This design aims to enhance the consistency of local facial details and global facial structure recovery. However, traditional CNN-based super-resolution methods have limitations in facial image super-resolution tasks. They are mostly generic image super-resolution models that do not fully utilize facial image’s unique structure and prior knowledge of facial images. These methods often use pixel-level Mean Squared Error (MSE) or Peak Signal-to-Noise Ratio (PSNR) as optimization objectives. While these metrics may improve objectively, the resulting facial images can appear overly smooth and lack realistic high-frequency details.
Generative Adversarial Networks (GANs) are widely used in facial super-resolution tasks to enhance the perceptual quality and realism of reconstructed facial images. Through adversarial training between a generator and a discriminator, GANs enable the generator to learn realistic facial image distributions, producing more authentic super-resolution images. Early GAN-based methods, like SRGAN, demonstrated GANs’ effectiveness in super-resolution tasks [12]. Subsequent improved GAN models, such as ESRGAN [13] and RankSRGAN [14], enhanced the perceptual quality of super-resolved images. In facial super-resolution, GANs have been extensively applied. For instance, a GAN combined with attention mechanisms was proposed for multi-scale facial image super-resolution [15]. This method used deep residual networks and deep neural networks as the generator and discriminator, respectively, and integrated attention modules into the residual blocks of the deep residual network to reconstruct super-resolution facial images highly similar to HR images and hard to distinguish by the discriminator. To address the over-smoothing issue of MSE-oriented SR methods and the potential artifacts from GAN-oriented SR methods, Zhang and Ling introduced Supervised Pixel-level GAN (SPGAN) [16]. Unlike traditional unsupervised discriminators, SPGAN’s supervised pixel-level discriminator focuses on whether each generated SR facial image pixel is as realistic as the corresponding pixel in the ground-truth HR facial image. To boost SPGAN’s facial recognition performance, it uses facial identity priors by feeding the input facial image and its features extracted from a pre-trained facial recognition model into the discriminator. This identity-based discriminator focuses more on the intricate texture details that are essential for accurate facial recognition. Despite GANs’ significant progress in facial super-resolution, challenges remain. GAN training can be unstable, and prone to issues like mode collapse. Additionally, GAN-generated images may lack pixel-level precision, sometimes producing unwanted artifacts.
Transformers, originally revolutionary in NLP, have been introduced to computer vision. Their self-attention mechanisms effectively capture long-range dependencies in images. The research combines Transformers with CNNs, leveraging the former’s global modeling and the latter’s local feature extraction. For example, Shi proposed a two-branch Transformer-CNN structure for face super-resolution [17]. It has a Transformer and CNN branches. The Transformer branch extracts multi-scale features and explores local and non-local self-attention. In contrast, the CNN branch uses locally variant attention blocks to enhance network capability by capturing adjacent pixel variations. These branches are fused via modulation blocks to combine global and local strengths. To further integrate CNN and Transformer advantages, Qi introduced ELSFace, an efficient latent style-guided Transformer-CNN framework for face super-resolution [18]. It includes feature preparation and carving stages. The preparation stage generates basic facial contours and textures, guided by latent styles for better facial detail representation. CNN and Transformer streams recursively restore facial textures and contours in parallel in the carving stage. Considering possible high-frequency feature neglect during long-range dependency learning, a high-frequency enhancement block (HFEB) and a Sharp Loss were designed in the Transformer stream for improved perceptual quality. However, these hybrid models often face challenges. Their global computations may overlook high-frequency details, and the increased computational complexity limits efficiency in high-resolution image generation. Additionally, feature alignment between CNN and Transformer features is challenging, requiring complex strategies for fusion.
Some studies have tried to incorporate facial priors into deep learning models. For example, Grm proposed the Cascaded Super-Resolution and Identity Prior model for face hallucination (C-SRIP) [19], which consists of a cascaded super-resolution network and a face recognition model. The super-resolution network progressively upscales low-resolution face images, while the recognition model serves as an identity prior to guiding the reconstruction of high-resolution images. Ma proposed a deep-face super-resolution model that uses iterative collaboration between attention restoration and landmark estimation [20]. It employs two recurrent networks for face image recovery and landmark estimation. In each iteration, the restoration branch leverages prior landmark knowledge to generate higher-quality images, aiding more accurate landmark estimation, guiding better image recovery, and progressively improving both tasks.
As novel generative models, diffusion models have achieved remarkable success in image generation. Compared to GANs, they offer more stable training, higher-quality samples, and greater diversity. Conditional diffusion models introduce condition information, like low-resolution images, into diffusion and reverse diffusion processes to guide image generation. In super-resolution tasks, these models typically take low-resolution images as conditional inputs to learn the low to high-resolution mapping. Saharia introduced SR3 [21], which achieves super-resolution through repeated refinement. SR3’s success highlights the potential of diffusion models in super-resolution, especially at high magnifications. The iterative nature of diffusion models allows for gradual detail refinement, aiding in the recovery of high-frequency information. Implicit diffusion models, which integrate implicit neural representations with diffusion models for continuous super-resolution, operate in a continuous space rather than traditional models’ pixel or discrete latent space. S. Gao proposed the Implicit Diffusion Model (IDM) for high-fidelity continuous image super-resolution [22]. IDM combines implicit neural representations and denoising diffusion models in an end-to-end framework, using implicit neural representations during decoding to learn continuous resolution representations. The proposed method enables the generation of super-resolution images at any resolution without the need for retraining, which is crucial for practical applications where different resolutions may be needed. Traditional diffusion models require hundreds of sampling steps, resulting in low generation efficiency. To speed up diffusion models’ inference, researchers have developed various accelerated sampling techniques, such as DDIM and PNDM. These methods can reduce the number of sampling steps in diffusion models, thus speeding up the inference process. For instance, Yue proposed ResShift, an efficient diffusion model for image super-resolution that significantly reduces the number of diffusion steps through residual shifting [23]. In addition to accelerated sampling techniques, some research focuses on optimizing the network architecture and training methods of diffusion models to boost their efficiency and performance. For example, Xiao introduced EDiffSR, an efficient diffusion probabilistic model for remote sensing image super-resolution [24]. EDiffSR’s EANet uses simplified channel attention and basic gating operations to achieve good noise prediction with a low computational footprint. These acceleration and optimization methods are crucial for promoting the practical application of diffusion models. However, they fail to effectively integrate facial attribute priors, resulting in uncontrolled generation outcomes (such as the inability to adjust age or expression directionally).
Recent studies have demonstrated the effectiveness of attention mechanisms and semantic priors in enhancing visual perception tasks, particularly in time-sensitive applications. For instance, complementary techniques leveraging spatiotemporal attention and adaptive feature fusion have significantly improved in real-time processing while maintaining semantic consistency [25]. These insights further motivate our design to integrate facial attribute priors with lightweight attention modules, enabling high efficiency and controllability.
Although existing technologies have advanced FISR in various dimensions, its core contradictions still focus on the following aspects:
  • Trade-offs between Detail and Efficiency: Although the complex architectures of CNNs and Transformers can enhance reconstruction quality, their computational costs are excessively high; diffusion models produce excellent generation quality, yet their inference speed struggles to meet real-time requirements.
  • Insufficient utilization of prior knowledge: Most models do not fully integrate the structural attributes of the face (such as the distribution of facial features and expressions) with semantic information (such as age and gender), resulting in generated images lacking semantic consistency.
  • Limited control ability: Existing methods mainly focus on quality improvement and cannot have direct control over the generated results, making it challenging to meet the personalized needs in practical scenarios (such as adjusting lighting and repairing specific areas).
In response to the aforementioned challenges, this paper proposes a super-resolution framework (FACDIM) that integrates conditional diffusion implicit models with prior knowledge of facial attributes. The core innovations are as follows:
  • Hierarchical Feature Enhancement Architecture: Pre-super-resolution module (PSRM)—for extremely low-resolution inputs (e.g., 16 × 16), it performs preliminary feature enhancement through a lightweight residual network, alleviating the detail recovery pressure on subsequent modules.
  • Dual-stream feature extraction: Design a lightweight attribute extraction module (FAEM) and a global feature encoder (FGFVE) to capture local attributes (such as eye shape and curvature of the mouth corners) and the overall facial contour. Through adaptive group normalization (AdaGN), feature fusion is achieved, injecting strong semantic guidance into the diffusion process. Different from SR3, which directly extracts mixed features through CNN, it leads to the coupling of local attributes and global structure. However, the dual-stream design of FACDIM (FAEM + FGFVE) decouples local attributes from global features for the first time and avoids information interference data adaptability through the dynamic fusion of AdaGN. Flexibility and scalability are also better. It can be used to conduct targeted experiments on self-annotated custom attributes (such as those not provided by CelebA).
  • Efficient Conditional Diffusion Implicit Model: In traditional diffusion models, the Markov chain is replaced with implicit neural representations, reducing the number of sampling steps to 1/5 of the original by modeling in continuous space, significantly enhancing generation efficiency. Incorporating Property-Aware Residual Blocks (FPARB) into U-Net, combined with self-attention mechanisms to optimize the interaction between attribute vectors and noise features, ensures that the generated images retain high-frequency details (such as skin texture) while strictly adhering to prior constraints.
  • User-controllable generation mechanism: This mechanism supports the targeted manipulation of generated results by manually adjusting attribute vectors (such as increasing the “smile” attribute value and decreasing the “age” attribute value), providing flexible tools for scenarios, such as security repair and digital entertainment.
Figure 1 shows the overall architecture diagram of this model.
The contribution of this paper lies not only in the innovation at the technical level but also in the establishment of a new paradigm where high fidelity and efficient controllability coexist: by deeply integrating attribute priors with diffusion models, it resolves the contradiction between the lack of detail and rigid generation in traditional methods, and breaks through the bottleneck of low efficiency in diffusion models, providing reliable technical support for applications, such as medical image restoration and virtual digital humans. The differences between this paper’s model and other existing methods, as well as the details and novelty of this model, are presented in Table 1.

2. Materials and Methods

2.1. Pre-Super-Resolution Module

In the task of 8× super-resolution, the input low-resolution face image is only 16 × 16 pixels, and its limited pixel information makes it extremely difficult to directly extract prior facial features (such as the contours of facial features, texture details). To tackle this issue, this paper introduces a pre-super-resolution module (PSRM). To achieve progressive feature enhancement through two-stage processing: First, the input image is super-resolved by a factor of 4, generating an intermediate result of 64 × 64 pixels, providing a feature foundation rich in semantic information for subsequent modules. This design draws inspiration from the iterative up-and-down sampling proposed by Haris [26]. The core idea is to enhance reconstruction robustness under large-scale factors through multi-scale feature interaction. PSRM consists of feature extraction, iterative up-and-down sampling, and image reconstruction.
  • Shallow Feature Extraction: A 3 × 3 convolutional layer performs initial feature mapping on the input image, capturing low-level visual features, such as edges and contours.
  • Iterative up-and-down sampling unit: The main body consists of multiple groups of densely connected (Dense Connection) [27] deconvolution layers and convolutional layers, achieving multi-scale feature fusion by alternately performing 4× upsampling and downsampling operations. The deconvolution layer employs a configuration of 8 × 8 kernel size, stride of 4, and padding of 2, resulting in an upsampled feature map resolution of 64 × 64. The convolutional layer performs downsampling with the same parameters, compressing the feature maps to the original resolution and forming a closed-loop feature optimization path.
  • Use a 1 × 1 convolutional layer to reduce dimensionality on the iterated multi-channel features, reducing computational redundancy. The reconstruction module consists of two deconvolution layers (4 times upsampling) and a terminal 3 × 3 convolution layer, ultimately outputting a 64 × 64 pre-super-resolution image.
Figure 2 shows a structural diagram of the pre-super-resolution module.
The design of PSRM was optimized in three aspects for characteristics of ultra-low-resolution inputs:
  • Progressive Feature Enhancement: Through iterative up-and-down sampling strategies, gradually refine local features and suppress artifacts, avoiding detail blurring caused by direct large-scale upsampling.
  • Dense connectivity for information reuse: Cross-layer feature fusion enhances gradient flow, ensuring that deep networks can effectively utilize shallow features.
  • Lightweight computation: 1 × 1 convolutions and parameter sharing mechanisms reduce the module’s complexity, reserving computational resources for subsequent high-magnification super-resolution.

2.2. The Facial Attribute Extraction Module

Compared with ordinary images, face images have their particularities. Utilizing the prior features of faces to guide the network in capturing relevant face information can effectively improve the effect of face super-resolution reconstruction. When performing face image super-resolution tasks, integrating prior knowledge of facial attributes is an effective strategy that can significantly enhance the quality and authenticity of the generated images. By introducing prior knowledge of facial attributes (such as gender, age, expression, posture, etc.), the model can be better guided to generate face images that conform to specific attributes.
The Facial Attribute Extraction Module (FAEM) is a lightweight facial attribute extraction module based on an improved Residual Network (ResNet) architecture. The module inputs a 64 × 64 × 3 image generated by the pre-super-resolution module (PSRM). The reference ResNet18 contains approximately 11 million parameters for the residual network, which may lead to over-parameterization for 64 × 64 × 3 inputs, particularly prone to overfitting under limited training data. When trained solely on the CelebA dataset, the model risks overfitting to the training set, resulting in insufficient generalization capability. We selectively remove a subset of residual blocks to address this, retaining only Layer1 and Layer2 to construct a lightweight ResNet variant. This streamlined architecture reduces computational complexity while preserving essential feature extraction capabilities, ensuring robust performance in resource-constrained scenarios. Figure 3 shows a structural diagram of the facial attribute extraction module.
First, adjust to fit the 64 × 64 input size. Change the conv kernel from 7 × 7 to 3 × 3, and stride from 2 to 1 to prevent premature downsampling of the 64 × 64 input. Remove the first MaxPool layer to avoid excessively low feature map resolution, suitable for small-image feature extraction. Reduce parameters to lower overfitting risk while keeping key feature extraction ability. The literature [28] indicates that adding BN layers in super-resolution networks can slow training, cause instability, and create artifacts. To ease this, apply a pre-activation module (BN and LeakyReLU) after the first convolution. This facilitates the optimization of deeper networks [29] and helps to minimize overfitting. The literature [30] suggests SENet, which fuses channel-wise info via weights to suppress irrelevant features. Add SE modules in residual blocks to boost responses in key channels, enhancing the model’s focus on attribute-related regions (e.g., mouth for “smile”, eyes for “glasses”).

2.3. Facial Global Feature Vector Encoder

This section introduces an enhanced low-resolution image encoder (FGFVE) for super-resolution tasks driven by conditional diffusion models. Based on a multi-stage CNN architecture with spatial attention mechanisms, the encoder extracts robust global and local features from 16 × 16 facial images, focusing on key areas, like the eyes, nose, and mouth, while maintaining lightweight computation for high-semantic conditional information for subsequent diffusion models. Figure 4 shows a structural diagram of the facial global feature vector encoder.
The encoder consists of three convolutional modules and a spatial attention module. Each convolutional module contains two 3 × 3 convolutions (with strides of 1 and 2), BatchNorm layers and GELU activation. GELU, smoother than ReLU, is chosen for low-resolution feature learning. The first convolutional module outputs an 8 × 8 × 64 feature map, capturing local details. The second produces a 4 × 4 × 128 feature map, extracting global structural information. The third module uses two 3 × 3 convolutional layers (256 channels) to maintain a 4 × 4 × 256 spatial resolution and incorporates a spatial attention module. This module, composed of a 3 × 3 convolutional layer (1 output channel) and a Sigmoid activation function, generates spatial attention weight maps. These weights are multiplied point-wise with the original feature maps and enhanced via residual connections to boost responses in key regions. The spatial attention weights are calculated as follows:
A = σ ( W a × F )
where F R H × W × C is the input feature, W a is the 3 × 3 convolution kernel, σ is the Sigmoid activation function, and the weighted features are as follows:
F = F A + F
The module strengthens feature responses in key areas (such as eyes, nose, mouth) via adaptive weights and suppresses background interference. It outputs a 4 × 4 × 256 feature map. Finally, global average pooling compresses this into a 256-dimensional global vector, which is then mapped to the final feature representation through a fully connected layer.

2.4. Conditional Diffusion Implicit Model Combined with Facial Attributes

As both the forward and reverse processes in DDPM have the Markov property, the number of time steps in the reverse generation process must match that in the diffusion process. This results in a gradual process of sample generation. Denoising Diffusion Implicit Models (DDIM) [31] are introduced to address this. DDIM shares the same training objective as DDPM but is not restricted to a Markov chain. This enables DDIM to speed up the generation process by utilizing fewer sampling steps. Moreover, DDIM’s sample generation from random noise is a deterministic process without added random noise. Figure 5 illustrates the Diffusion Implicit Model.
This study presents FACDIM (Facial Attributes Conditional Diffusion Implicit Model), a conditional diffusion implicit model that integrates facial attributes as prior knowledge for face image super-resolution. The model takes 16 × 16 low-resolution images, facial attribute vector α , and global feature vectors β as inputs, and generates 128 × 128 high-resolution face images. It uses Adaptive Group Normalization (AdaGN) to combine facial attribute and global feature vectors as prior knowledge [32] and employs skip connections to concatenate low-resolution features and attribute vectors at multiple levels. FACDIM uses a modified U-Net with Feature Prior Residual Blocks (FPARB), replacing the original convolutional layers, and injects the facial global feature vector β and facial attribute vector α from FGFVE and FAEM into each FPARB block via AdaGN. The encoder processes 16 × 16 images and outputs a global feature vector. The facial attribute vector from the prior attribute extraction module is mapped to 64 dimensions via a fully connected layer.
α = W e d ( W e R 7 × 64 )
Splice low-resolution feature and attribute embedding to generate 320-dimensional conditions:
c = α ; β R 320
Concatenate the joint condition c and time step t e m b embeddings, then use an MLP to generate scaling factor γ and offset δ .
γ , δ = M L P ( c ; t e m b ) γ , δ R C
where C is the number of channels in the current feature map; finally perform feature adjustment for affine transformation of the normalized features within the residual block.
F = G r o u p N o r m ( F ) ( 1 + γ ) + δ
In fact, conditional diffusion implicit models require T temporal prediction models. Thus, a time embedding component is added to encode positional information into the network, enabling the training of a single shared U-Net model. Following the literature [33], sinusoidal position encoding is adopted for time embedding, with the formula as follows:
p k , 2 i = sin k 1000 0 2 i d p k , 2 i + 1 = cos k 1000 0 2 i d
In the formula, p k , 2 i and p k , 2 i + 1 are the 2 i -th and 2 i + 1 -th components of the encoding vector at position k , respectively, and d is the dimension of the vector.
U-Net is suitable for diffusion models due to its ability to produce outputs with the same dimensionality as inputs. In the FACDIM model’s modified U-Net structure, the original convolutional layers are replaced with Feature Prior Residual Blocks as shown in Figure 6. Each major layer contains several of these sub-modules, and deeper layers are followed by self-attention modules. Unlike the original model, each sub-module’s output in the downsampling part of a major layer is additionally fed into its symmetric upsampling sub-module, reducing information loss during downsampling.
The encoder comprises stages with three FPARB modules and one self-attention module. The FPARB modules incorporate facial attribute and global feature vectors to enhance feature learning and capture global information and long-range dependencies. Each FPARB module consists of two 3 × 3 convolutions and a 2 × 2 average pooling layer, with downsampling modules in each stage to reduce spatial size. Conversely, the decoder progressively restores the compressed features from the encoder. Figure 7 shows a structural diagram of the Improved U-net structure.

2.4.1. Forward Process

In the forward process of DDPM, the optimization objective depends solely on the marginal distribution q ( x t | x 0 ) and it does not directly act on the joint distribution q ( x 1 : T | x 0 ) . Therefore, as long as the inference distribution meets the characteristics of the diffusion process, it can be used as the inference distribution for DDPM, and these inference processes do not have to be Markov chains. To obtain the optimization objective of DDPM, it is also necessary to know distribution q ( x t 1 | x t , x 0 ) , but this distribution depends on the Markov chain property of the forward process. If the dependence on the forward process is to be removed, then distribution q ( x t 1 | x t , x 0 ) must be defined directly. Therefore, consider a new distribution. In the image super-resolution reconstruction based on conditional diffusion implicit models, for a given high-resolution image y , the forward diffusion process is defined as follows:
q ( y 1 : T | y ) = q ( y T | y ) t = 2 T q ( y t 1 | y t , y )
Here, y t represents the noisy image. The following conditions need to be met simultaneously:
q ( y T | y ) = N y T ; a ¯ T y , ( 1 a ¯ T ) I
In the formula, a ¯ t represents the noise variance and is a hyperparameter, a ¯ t = i = 1 t a i , related to the training step t . For all t 2 , the following condition must be satisfied:
q ( y t 1 | y t , y ) = N y t 1 ; a ¯ t 1 y + 1 a ¯ t 1 σ 2 y t 1 a ¯ t y 1 a ¯ t , σ t 2 I
The variance σ t 2 is a real number, with different distributions for different settings, so q ( y 1 : T | y ) is actually a sequence of inference distributions. The mean of the q ( y t 1 | y t , y ) distribution is defined as a combination function dependent on y and y t . This is because according to q ( y T , y ) , it can be proven by mathematical induction that this form satisfies the condition for all t .
q ( y t | y ) = N y , a ¯ t y , ( 1 a ¯ t ) I
That is, q ( y t | y ) remains normally distributed at any time. Equation (10) can be derived using Bayes’ theorem.
q ( y t | y t 1 , y ) = q ( y t 1 | y t , y ) q ( y t | y ) q ( y t 1 | y )
Unlike the diffusion process in DDPM, the forward process in FACDIM is not a Markov process, as each y t   depends on both y t 1 and y simultaneously.

2.4.2. Objective Function

The denoising model f θ ( c , x , y t , a ¯ t ) takes low-resolution image x and noisy image y t as inputs. LR image x and the concatenated condition vector c serve as guiding conditions. y t , obtained from Equation (11), is defined as follows:
y t = a ¯ t y + 1 a ¯ t ε
In which, ε represents the noise vector sampled from the standard normal distribution N ( 0 , Ι ) , and the model aims to recover the noise-free target image y from y t . In addition to the low-resolution image x and the noisy image y t , the denoising model f θ takes the noise variance a ¯ t as input, enabling the model to be aware of the noise level. The training objective of the model is to predict the noise vector ε ; therefore, the objective function of the denoising model f θ is as follows:
E ( c , x , y t ) E ε , a ¯ t f θ c , x , a ¯ t y + 1 a ¯ t ε , a ¯ t ε p p
In which, ( c , x , y t ) is sampled from the training set, p 1,2 represents the use of L1 or L2 norm, and a ¯ t is sampled from distribution p ( a ¯ t ) .

2.4.3. Generation Process

During the generation process, FACDIM only alters the Markovian nature of the sampling process, while the number of steps T in the noise addition process remains unchanged, but the number of sampling steps can be significantly less than T . Based on the results from DDPM, the inference distribution q ( x t 1 | x t , x 0 ) is a Gaussian distribution, with the mean being a linear function of x 0 and x t , and the variance being a function related to the time step, Figure 8 shows the Super-Resolution U-net framework (16 × 16→128 × 128). Assuming that the inference distribution of FACDIM is also a Gaussian distribution, and the mean is also a linear function of the high-resolution image y and the noisy image y t , with the variance being a function of the time step, then let
q ( y t 1 | y t , y ) = N ( y t 1 | λ t y + k t y t , σ t 2 I )
This distribution does not rely on the first-order Markovian assumption. λ t , k t , and σ t are coefficients to be determined. In theory, there are infinitely many solutions to avoid retraining the model. By not altering distributions q ( y t 1 | y t ) and q ( y t 1 | y ) , it is only necessary to find a set of solutions for λ t , k t , and σ t such that the inference distribution of FACDIM satisfies the aforementioned distribution (15).
In Appendix A, we derive the sampling distribution of FACDIM as follows:
q ( y t 1 | y t , y ) = N y t 1 , a ¯ t 1 y 1 a ¯ t 1 σ t 2 y t a ¯ t y 1 a ¯ t , σ t 2 I
It can be seen from Equation (16) that a cluster of solutions can be found so that Equation (A2) holds and equation q ( y t | y ) = N y , a ¯ t y , ( 1 a ¯ t ) I is satisfied. Different σ t correspond to different generation processes. Since the forward process has not changed, the noise prediction model can be directly trained by DDPM. The sampling process is performed as follows:
y t 1 = a ¯ t 1 y + 1 a ¯ t 1 σ t 2 y t a ¯ t y 1 a ¯ t + σ t ε t
Train the denoising model f θ ( c , x , y t , a ¯ t ) to estimate the added noise ε . For any noisy image that includes y t , given y t , the terms in Equation (13) can be rearranged to estimate y :
y = 1 a ¯ t y t 1 a ¯ t f θ ( c , x , y t , a ¯ t )
Formula (21) into Formula (20), arranged as follows:
y t 1 = a ¯ t 1 y t 1 a ¯ t f θ ( c , x , y t , a ¯ t ) a ¯ t + 1 a ¯ t 1 σ t 2 f θ ( c , x , y t , a ¯ t ) + σ t ε t
Here, the generation process is divided into three parts: one generated by the predicted y , two directed towards y t , and three as random noise (where ε t is noise unrelated to y t ). Since it has been proven in the literature [31] that regardless of the value of σ t , it does not affect the validity of Equation (9); σ t is further defined as follows:
σ t = η 1 a ¯ t 1 1 a ¯ t 1 a ¯ t a ¯ t 1
When η = 1 , the forward process becomes a Markovian process, and the generation process is the same as DDPM.
When η = 0 , the random noise term added during the sampling process becomes 0, and the generation process is noise-free, making the model in this case FACDIM. Once the initial random noise is determined, the sample generation of FACDIM becomes a deterministic process.

2.4.4. Accelerate Generation

In deriving the sampling distribution p θ ( x t 1 | x t , x 0 ) , DDPM employs the Markov assumption p θ ( x t 1 | x t , x 0 ) = p θ ( x t 1 | x t ) , which necessitates sampling sequentially from t = T down to t = 1. DDPM often requires a large number of time steps to achieve high-quality results, hence the need for T sampling steps. In FACDIM, the original generation sequence is assumed to be L = [ T , T 1 , . . . , 1 ] . Its length is dim ( L ) = T . At this point, a sub-sequence τ = [ τ s , τ s 1 , . . . τ 1 ] with length dim ( τ ) = S , S T can be built from the generation sequence L . During the generation process, sampling is performed according to the constructed sequence τ . The recursive sampling sequence at this point is as follows:
y t 1 = α ¯ τ s 1 y τ s 1 α ¯ τ s f θ ( x , y τ s , α ¯ τ s ) α ¯ τ s + 1 α ¯ τ s 1 f θ ( x , y τ s , α ¯ τ s )
This enables the generation process to directly operate on the sub-sequence. In training, the full sequence is used, while in generation, the sub-sequence is employed. As long as the sub-sequence is small and the performance remains acceptable, this approach can accelerate the process.
Each downsampling phase in the U-net structure in FACDIM is shown in Figure 9:
Each upsampling phase in the U-net structure in FACDIM is shown in Figure 10:
The up-and-down sampling phase did not adopt the traditional Relu or Sigmoid activation function, but instead adopted the Swish activation function:
s w i s h = x s i g m o i d ( x )
This is a smooth non-linear function that helps to improve the training process and performance of deep learning models, allowing the network to dig deep into the detailed features, and helping to improve the generalization ability and performance of the model. It also alleviates the gradient disappearance problem common in the training stage of the diffusion model. (The root cause is that the mean square error, the regression problem loss function of L1, L2 are different from the classification problem loss function of cross-entropy. The resulting loss values and gradients tend to be smaller).

3. Discussion and Results

3.1. Experimental Configuration

3.1.1. Experimental Dataset

This study selected three datasets for systematic experimental validation: the FFHQ (Flickr-Faces-High-Quality) dataset [34], the CelebA (CelebFaces Attributes) dataset [35], and the MAFA (MAsked FAces) dataset [36]. The experimental design is divided into two parts: conventional scenario experiments and challenging scenario experiments, each employing adapted datasets tailored to distinct experimental objectives.
In the conventional scenario experiments, two widely recognized high-quality face datasets, FFHQ and CelebA, were utilized to validate the model’s fine-grained semantic controllability and detailed feature learning capability. Among these, the FFHQ dataset is a commonly used high-quality dataset in computer vision and machine learning for tasks, such as face recognition, image generation, and image synthesis. The CelebA dataset contains diverse celebrity faces with varying attributes, expressions, poses, and backgrounds, making it suitable for training models that require generalization across diverse facial features.
In the challenging scenario experiments, additional experiments were conducted on the MAFA dataset to verify the robustness of our model under demanding real-world conditions. Compared to CelebA and FFHQ, the MAFA dataset comprises 33,811 real-world facial images with significant occlusion features (e.g., masks, sunglasses, hand occlusions) and extreme illumination variations. Its data characteristics better simulate partial occlusions and lighting changes commonly encountered in practical facial image scenarios.
For the conventional scenario experiments, the CelebA dataset was employed as the training set, while the FFHQ dataset served as the validation set. During the training phase, 160,000 images from CelebA were selected. In the challenging scenario experiments, 25,000 occluded images from the MAFA dataset were used for training, with the remaining 5811 images reserved for testing. The low-resolution images (LR) were obtained by center cropping the original images to 178 × 178 pixels and downsampling to 16 × 16 pixels, while the high-resolution images (HR) were cropped from the corresponding regions and downsampled to 128 × 128 pixels. Online random rotation (rotation angles included 90°, 180°, and 270°) and online random horizontal flipping were used as data augmentation methods. The low-resolution images were subjected to bicubic interpolation degradation and then concatenated (concatenated) with the noisy high-resolution images before being fed into our diffusion model. The attribute values in the dataset were converted to binary (1/0), where attribute vector values from 0 to 1 represent the increasing enhancement of the facial attribute; the eight facial attributes used in this paper are as follows: Bags_Under_Eyes; Young; Smiling; Chubby; Big_Lips; Blurry; Attractive; Male, chosen based on their semantic relevance to facial structure and perceptual quality in super-resolution tasks. Attributes like Smiling, Big_Lips, and Bags_Under_Eyes directly influence facial geometry and texture, which are critical for recovering high-frequency details; Blurry, Attractive, Young, and Male act as global semantic constraints, guiding the model to prioritize clarity and aesthetic consistency. These attributes are widely annotated in the CelebA dataset and align with prior studies on face super-resolution (e.g., [16,19,35]), ensuring reproducibility and fair comparison with existing methods. Additionally, during preliminary experiments, we observed that these attributes provided sufficient fine-grained control over key facial regions while maintaining computational efficiency. Due to the constraints of computational resources and time, the study selects key attributes to verify the method’s effectiveness, avoiding increased model complexity from too many attributes. Expanding the attribute set further would require balancing model complexity and training stability, which we prioritized in this initial framework.

3.1.2. Experimental Parameters

Diffusion Process: A linear noise schedule with 1000 steps is employed ( β m a x m i n ), utilizing the AdamW optimizer, which is a variant of the Adam optimizer. The primary distinction lies in the weight decay management approach. The Adam optimizer combines the benefits of momentum and adaptive learning rates, whereas AdamW addresses weight decay more effectively to prevent overfitting. The first moment decay rate is β 1 = 0.9 , and the second moment decay rate is β 2 = 0.999 with an initial learning rate of 0.0002. Weight decay utilizes a cosine annealing schedule to gradually reduce the learning rate following the shape of a cosine function throughout the training process. This strategy helps maintain a larger learning rate initially to accelerate convergence and decrease it later for fine-tuning parameters. This approach assists the model in escaping local optima and finding better global solutions. The experimental setup includes the Windows 11 operating system, Python 3.7.10 as the development version, PyTorch 1.11.0 + CUDA 11.7 as the framework, and experiments are conducted on a single NVIDIA A100 GPU with a batch size of 64, using mixed precision training (FP16), with 500 training epochs, and a gradient clipping threshold set to 1.0.

3.1.3. Loss Function

Considering that facial structures contain rich texture information while also requiring the continuity between local areas, this experiment introduces edge smoothing loss on the basis of the original reverse diffusion stage. By constraining the gradient changes between pixels in the image, leveraging the inherent smoothness present in facial images, edge smoothing loss effectively reduces the noise and artifacts that may arise during the generation process, enhancing the structural consistency and the authenticity of texture details in the generated images, thus making the results clearer and more realistic. Loss function: the combination of mean squared error (MSE) of noise prediction and edge smoothing loss (Edge Smoothness Loss) is used, where the mathematical expression for MSE loss is as follows:
L M S E = Ε t , x 0 , ε ε ε θ ( x t , t , c ) 2 2
Edge smoothing loss aims to constrain the gradient consistency between the generated image y 0 and the real high-resolution image y H R in the edge regions. Its mathematical expression is as follows:
L s m o o t h = i , j y 0 ( i , j ) y H R ( i , j ) 1
In which the gradient operator can be implemented using the Sobel operator:
x = 1 0 1 2 0 2 1 0 1 ,     y = 1 2 1 0 0 0 1 2 1
In the actual computation, the gradient is calculated separately for each of the RGB channels and then averaged. During the training process of the diffusion model, the current estimated generated image y 0 needs to be recovered from the noisy image y t to compute the gradient loss:
y 0 = 1 a t y t 1 a t ε θ ( y t , t , c )
The boundary smoothing loss can be expressed as follows:
L s m o o t h = Ε t y 0 y H R 1
Joint noise prediction MSE loss and boundary smoothing loss, the total loss is as follows:
L t o t a l = L M S E + λ L s m o o t h
where λ is the smoothing loss weight, with an initial value set to 0.3.

3.1.4. Evaluation Indicators

Peak Signal-to-Noise Ratio (PSNR) is an objective standard for measuring image quality. It evaluates image quality by calculating the mean squared error between the original image and the processed image, expressed in decibels (dB). A higher PSNR value indicates better quality of the reconstructed image and a higher degree of similarity to the original image. The calculation formula is as follows:
P S N R = 10 × log 10 M a x I 2 M S E
Here, M a x I represents the maximum pixel value in the image, and M S E denotes the mean squared difference between corresponding pixels of the two images. For multi-channel images, the calculation formula for M S E is as follows:
M S E = 1 N i = 1 N ( I ( i ) I ( i ) ) 2
In this context, I ( i ) represents the i -th pixel value of the real image, B denotes the i -th pixel value of the generated image, and N indicates the total number of pixels in the image.
The Structural Similarity Index (SSIM) is an image quality assessment metric that is based on structural information. It takes into account not only the luminance and contrast of the image but also the structural similarity between images. The SSIM value ranges between [−1, 1], with values closer to one indicating a higher similarity between the generated image and the reference image. The calculation formula for SSIM is as follows:
S S I M ( x , y ) = ( 2 μ x μ y + C 1 ) ( 2 σ x y + C 2 ) ( μ x 2 + μ y 2 + C 1 ) ( σ x 2 + σ y 2 + C 2 )
In the formula, μ x and μ y represent the mean values of images x and y , respectively. σ x and σ y are the standard deviations of images x and y , respectively. σ x y is the covariance between images x and y , and C 1 , C 2 is a constant used to stabilize the formula.

3.2. Experimental Results Comparative Analysis

Conducted super-resolution experiments from 16 × 16 to 128 × 128 on the validation set FFHQ, with the experimental results shown in Figure 11:
In this paper’s experiments, we compare our method with other advanced face super-resolution methods (SR3, FSRGAN [37], SRFLOW [38], SRCNN, SRGAN [12], PULSE [39]). Analysis of Figure 12 shows that our conditional diffusion implicit model, which integrates facial attributes as prior knowledge, better restores facial details and textures, offering superior visual quality and realism in super-resolution reconstruction. SR3, as a pure diffusion model, lacks explicit semantic guidance, leading to blurry facial features (e.g., eyes, lips) and reliance on random noise for high-frequency detail recovery, which can cause jagged edges or artifacts. It is also computationally expensive, requiring 2000 iterative denoising steps, which slows down inference. FSRGAN, based on GANs, struggles with limited image diversity, often producing similar hairstyles or skin tones. This is due to GANs’ tendency to fall into local optima and their sensitivity to hyperparameters, causing unstable training. SRFLOW, based on normalizing flows, has slow generation due to the need for complete forward and backward propagation. Flow models also tend to over-smooth textures, resulting in detail loss.
In terms of objective evaluation metrics, the experimental results show that our proposed conditional diffusion implicit model, which integrates facial attributes as prior knowledge, achieves the best scores in both PSNR and SSIM compared to other advanced models. Specifically, our model’s PSNR is 2.16 dB higher than SR3, 3.36 higher than SRFLOW, and 3.67 higher than FSRGAN. For SSIM, our model is 0.08 higher than SR3, 0.12 higher than SRFLOW, and 0.14 higher than FSRGAN. The detailed data are shown in Table 2.

3.3. Robustness Analysis on MAFA Dataset

We retained the same model architecture and hyperparameters as in previous experiments (Section 3.1.2) but fine-tuned the pre-trained FACDIM model on MAFA for 50 epochs to adapt to occlusion patterns. To ensure fairness, all baseline methods (SR3, FSRGAN, SRFLOW) were also fine-tuned under identical settings. We evaluated both objective metrics (PSNR, SSIM) and perceptual quality by measuring FID scores between generated images and ground-truth HR images. Additionally, we introduced Occlusion Robustness Score (ORS), defined as the average SSIM difference between occluded and non-occluded regions, to quantify the model’s ability to recover masked areas. Figure 13 shows the experimental results of each model on the MAFA dataset and the enlarged comparison figure.
Table 3 compares the performance of FACDIM and baseline methods on MAFA. Our model achieves a PSNR of 24.31 dB and SSIM of 0.69, outperforming SR3 by 1.52 dB and 0.07, respectively. Figure 13 visualizes the super-resolution results for a heavily occluded face, while SR3 generates blurred features and FSRGAN introduces artifacts around occluded regions.

3.4. Ablation Experiment

To validate the effectiveness of the proposed AdaGN and self-attention fusion strategy, we conducted an ablation study comparing it with three alternative fusion approaches:
  • Concatenation + MLP: Concatenate the low-resolution features and attribute vectors, followed by an MLP for dimensionality reduction.
  • Cross-Attention: Use a cross-attention layer where attribute vectors serve as keys/values and image features as queries.
  • Channel Attention: Apply SENet-style channel attention to weight feature channels based on attribute vectors.
The experiments were performed on the FFHQ validation set under 8× super-resolution settings. All variants shared identical backbone architectures (PSRM, FAEM, FGFVE) and training configurations. The detailed data are shown in Table 4.
The experimental results show the following: AdaGN achieves the highest fidelity (PSNR/SSIM), demonstrating its superiority in preserving facial details and semantic consistency. AdaGN is 43% faster than cross-attention and requires fewer parameters due to lightweight affine transformations. Concatenation + MLP struggles with feature misalignment, while cross-attention introduces computational overhead from query-key interactions. AdaGN dynamically scales and shifts feature maps using condition vectors, enabling efficient and implicit fusion without disrupting spatial relationships. This aligns with diffusion models’ need for stable gradient flow during iterative denoising.
To verify the effectiveness of our module design in the task of facial image super-resolution, we set up four control experiments:
  • Without using the facial attribute vector extraction module;
  • Without employing the global facial feature vector encoder;
  • Removing the implicit diffusion model setup;
  • Removing AdaGN fusion of facial attribute vectors as prior knowledge.
The complete model is also tested. Results, evaluated by PSNR and SSIM, are shown in Table 5.
In addition, to verify the control of facial attribute vectors on facial image super-resolution, this paper conducted an attribute-control sensitivity analysis, testing and analyzing eight attributes from the CelebA dataset. In the experiment, the output values of the facial attribute vector extraction module were modified while keeping other attribute vector values unchanged. The differences in super-resolution results were observed by changing single attribute vector values. The experimental results are shown in Figure 14.
Experimental analysis shows that setting the eight attribute vectors to a maximum value of one creates significant feature differences and visual semantic consistency in the super-resolved images. To verify the model’s ability to precisely control local features via facial attribute vectors, experiments with varying vector magnitudes were conducted. The results, shown in Figure 15, demonstrate that the model can adjust the expression of specific attributes by altering corresponding vector values. This confirms that the model achieves fine-grained semantic editing through attribute-conditioned manipulation, ensuring reliable semantic controllability in face super-resolution tasks.
To fully evaluate the model’s strengths in terms of inference speed, resource usage, and complexity, this experiment compares the following key metrics: inference time, VRAM usage during training and inference, and parameter count.
Inference time is the duration taken to generate a 128 × 128 image on an NVIDIA A100 with a batch size of 1. It is the average of 100 trials, excluding I/O time. Training VRAM is the peak usage during training, while inference VRAM is the peak usage for a single-image inference. The parameter count is the total number of trainable parameters in the model. Experimental results are presented in Table 6.
Using implicit diffusion, our model reduces sampling steps and thus shortens inference time compared to SR3. The single-step inference complexity of diffusion models is dominated by the U-Net structure. The improved U-Net employs FPARB (attribute-aware residual blocks) and self-attention mechanisms, reducing the single-step complexity to O ( T L H W C 2 ) where T is the time step embedding dimension and L is the network depth. Compared to traditional diffusion models (such as SR3), FACDIM reduces the number of sampling steps to 1/2 (from 2000 steps in experiments to 1000 steps) through implicit representation, optimizing overall time complexity from O ( T N 2 ) to O ( T r e d u c e d N log N ) thanks to feature sharing with parameters. However, diffusion models are slower than FSRGAN due to multi-step iteration, FSRGAN. The complexity of single-step generation based on GAN is O ( N 2 ) . The actual reasoning efficiency (0.7 s) is fast, but the quality and controllability of generation are insufficient. Our experimental results show much higher fidelity. Thanks to a lightweight feature vector encoder and shared conditional mapping MLP, our model’s training VRAM usage is lower than SRFLOW’s. Also, our model only needs 50% of SRFLOW’s inference VRAM since there is no need to save intermediate variables during backpropagation. In terms of parameter count, our model has 13% fewer parameters than SR3 due to shared AdaGN parameters in conditional fusion. SRFLOW forward and backward propagation must be complete, and the complexity is as high as O ( N 3 ) , resulting in a reasoning delay of 10.3 s, and the number of parameters (207M) is significantly higher than FACDIM. Compared to SRFLOW, our model reduces parameters by 38% as flow models have many dense, invertible units that increase parameters. In the security surveillance scenario (input resolution 640 × 480, target output 1080p), the single-frame inference time of FACDIM is around 2.3 s (A100 GPU). Through parallel optimization, it can achieve an end-to-end latency of about 4.6 s, meeting some real-time requirements (such as a response within 5 s). If deployed to edge devices (NVIDIA Jetson AGX Xavier), the inference time increases to 8.2 s, and further lightweighting is still needed. The current FACDIM on A100 has an FPS of 0.43, which is better than SR3 (0.18 FPS) and SRFLOW (0.10 FPS).

4. Conclusions

This paper presents a conditional diffusion implicit model (FACDIM) that combines conditional and implicit diffusion models for face image super-resolution. The model enhances facial detail super-resolution generation by integrating low-resolution facial attributes and global feature vectors as prior knowledge. This integration improves image fidelity and perceptual quality. Experiments show that FACDIM outperforms existing methods in edge clarity, attribute consistency, and inference speed. Additionally, attribute-conditioned manipulation enables fine-grained semantic control, such as precise adjustments of smiles and eye bags, offering a new solution for controllable face super-resolution.
Practical applications include enhancing low-quality surveillance footage, restoring old films, optimizing social media and personal images, and enabling personalized attribute editing. However, the model has limitations, such as the need for improvement in adapting to cross-domain data and mild interference between attributes during complex attribute combinations. Future work could explore integrating more facial priors, like key points and parsing maps, to further advance face super-resolution technology and broaden its real-world applications.

Author Contributions

Conceptualization, J.R. and Y.G.; methodology, J.R., Y.G., and Q.L.; software, Y.G.; validation, J.R., Y.G., and Q.L.; formal analysis, Y.G. and J.R.; investigation, Y.G. and J.R.; resources, Y.G. and J.R.; data curation, J.R. and Y.G.; writing—original draft preparation, Y.G.; writing—review and editing, Y.G. and J.R.; visualization, Y.G. and J.R.; supervision, J.R. and Q.L.; project administration, Y.G. and J.R. funding acquisition, Q.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by (1) Liaoning Provincial Department of Education Scientific Research Project (no. JYTMS20230819); (2) Liaoning Technical University Doctoral Scientific Research Startup Fund (no. 21-1043).

Institutional Review Board Statement

Ethics approval is not required for this type of study. This research utilizes three publicly available, pre-existing datasets (CelebA, FFHQ and MAFA) that are fully anonymized and legally accessible for non-commercial academic purposes. Our work involves no interaction with human subjects, no new data collection, and no disclosure of private information.

Data Availability Statement

The original data presented in the study are openly available in CelebA dataset at http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html (accessed on 8 May 2025), FFHQ dataset at https://github.com/NVlabs/ffhq-dataset (accessed on 8 May 2025), and MAFA dataset at http://www.escience.cn/people/geshiming/mafa.html (accessed on 8 May 2025).

Acknowledgments

The authors would like to thank all the reviewers who participated in the review during the preparation of this manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
x 0 Original high-resolution image
x T Pure noise image
x t Add noise to the intermediate image
y Generate high-resolution images
y t Image with noise
α t Noise variance (hyperparameter)
α ¯ t The product of the variance of the previous t-step noise
I Identity matrix
λ t , k t , σ t Undetermined coefficient
f θ The denoising model

Appendix A

When t = 1, FACDIM can satisfy the following formula:
q ( y 1 | y ) = N y 1 ; α ¯ 1 y , ( 1 α ¯ 1 ) I
Suppose that when t = t, FACDIM also satisfies formula: q ( x t | x 0 ) = N x 0 ; α ¯ t x 0 , ( 1 α ¯ t ) I . It can be known from mathematical induction that as long as q ( y t 1 | y 0 ) = N y t 1 ; a ¯ t y 0 , ( 1 a ¯ t ) I , then all time steps satisfy the above reasoning distribution.
Just satisfy the marginal distribution condition:
y t q ( y t 1 | y t , y ) q ( y t | y ) d y t = q ( y t 1 | y )
Thus, the problem becomes to find a set of solutions λ t , k t , σ t to Equation (15) from known Equation (A1), such that
q ( y t 1 | y ) = N y t 1 ; a ¯ t 1 y , ( 1 a ¯ t 1 ) I
The sampling of y t 1 can be performed by Formula (15):
y t 1 = λ y + k y t + σ ε t 1
where, ε t 1 ~ N ( 0 , I ) . According to Equation q ( y t | y ) = N y , a ¯ t y , ( 1 a ¯ t ) I , sampling y t is possible:
y t = a ¯ t y + ( 1 a ¯ t ) ε t
where, ε t ~ N ( 0 , I ) .
According to the additivity of normal distribution, Equation (A5) is substituted into Equation (A4) and we obtain:
y t 1 = ( λ + k α ¯ t ) y + k 1 α ¯ t ε t + σ t ε t 1
y t 1 can be sampled by (A3):
y t 1 = α ¯ t 1 y + 1 α ¯ t 1 ε ¯ t 1
In order to satisfy Equation (A2), we obtain:
λ t + k t α ¯ t = α ¯ t 1 λ t + k t α ¯ t = α ¯ t 1 λ t k t σ t = α ¯ t 1 α ¯ t 1 α ¯ t 1 σ t 2 1 α ¯ t 1 α ¯ t 1 σ t 2 1 α ¯ t σ t
There are three unknowns and two equations in Equation (A8), so it is clear that there are infinitely many solutions to Equation (A2). FACDIM in the σ t is regarded as a variable, and its size can be regarded as the degree of randomness in the sampling process.
Therefore, the sampling distribution of FACDIM is as follows:
q ( y t 1 | y t , y ) = N y t 1 , a ¯ t 1 y 1 a ¯ t 1 σ t 2 y t a ¯ t y 1 a ¯ t , σ t 2 I

References

  1. Liu, A.; Liu, Y.; Gu, J. Blind Image Super-Resolution: A Survey and Beyond. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 38, 295–307. [Google Scholar] [CrossRef] [PubMed]
  2. Taigman, Y.; Yang, M.; Ranzato, M.A. Deepface: Closing the gap to human-level performance in face verification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 1701–1708. [Google Scholar]
  3. Yang, J.; Luo, L.; Qian, J. Nuclear norm based matrix regression with applications to face recognition with occlusion and illumination changes. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 156–171. [Google Scholar] [CrossRef] [PubMed]
  4. Jourabloo, A.; Mao, Y.; Liu, X. Poseinvariant face alignment with a single cnn. In Proceedings of the IEEE International Conference on Computer Vision, San Juan, PR, USA, 17–19 June 2017; pp. 3200–3209. [Google Scholar]
  5. Sun, K.; Li, Q.M.; Li, D.Q. Face detection algorithm based on cascaded convolutional neural network. J. Nanjing Univ. Sci. Technol. 2018, 42, 40–47. [Google Scholar]
  6. Kim, J.; Lee, J.K.; Lee, K.M. Accurate Image Super-Resolution Using Very Deep Convolutional Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1646–1654. [Google Scholar]
  7. Shermeyer, J.; Van Etten, A. The Effects of Super-Resolution on Object Detection Performance in Satellite Imagery. In Proceedings of the Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 16–17 June 2019; pp. 1432–1441. [Google Scholar]
  8. Kim, J.; Lee, J.K.; Lee, K.M. Deeply-Recursive Convolutional Network for Image Super-Resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1637–1645. [Google Scholar]
  9. Lim, B.; Son, S.; Kim, H.; Nah, S.; Mu Lee, K. Enhanced Deep Residual Networks for Single Image Super-Resolution. In Proceedings of the Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 325–337. [Google Scholar]
  10. Chen, C.; Gong, D.; Wang, H.; Li, Z.; Wong, K.-Y.K. Learning spatial attention for face super-resolution. IEEE Trans. Image Process. 2021, 30, 1219–1231. [Google Scholar] [CrossRef] [PubMed]
  11. Gao, G.; Xu, Z.; Li, J. CTCNet: A CNN-transformer cooperation network for face image super-resolution. IEEE Trans. Image Process 2023, 32, 1978–1991. [Google Scholar] [CrossRef] [PubMed]
  12. Ledig, C.; Theis, L.; Huszár, F.; Caballero, J.; Cunningham, A.; Acosta, A.; Aitken, A.; Tejani, A.; Totz, J.; Wang, Z.; et al. Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network. In Proceedings of the IEEE Conference on Computing Research Repository, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
  13. Wang, X.; Yu, K.; Wu, S.; Gu, J.; Liu, Y.; Dong, C.; Qiao, Y.; Change Loy, C. ESRGAN: Enhanced Super-Resolution Generative Adversarial Networks. In Proceedings of the Computer Vision-ECCV, Munich, Germany, 8–14 September 2018; Volume 11133, pp. 63–79. [Google Scholar]
  14. Zhang, W.; Liu, Y.; Chao, D.; Yu, Q. RankSRGAN: Super Resolution Generative Adversarial Networks with Learning to Rank. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 7149–7166. [Google Scholar] [CrossRef] [PubMed]
  15. Chen, X.F.; Shen, H.J.; Bian, Q. Face super-resolution reconstruction combined with attention mechanism. J. Xidian Univ. 2019, 3, 148–153. [Google Scholar]
  16. Zhang, M.; Ling, Q. Supervised pixel-wise GAN for face super-resolution. IEEE Trans. Multimed. 2021, 23, 1938–1950. [Google Scholar] [CrossRef]
  17. Shi, J.; Wang, Y.; Yu, Z. Exploiting multi-scale parallel self-attention and local variation via dual-branch transformer-CNN structure for face super-resolution. IEEE Trans. Multimed. 2024, 26, 2608–2620. [Google Scholar] [CrossRef]
  18. Qi, H.; Qiu, Y.; Luo, X. An efficient latent style guided transformer-CNN framework for face super-resolution. IEEE Trans. Multimed. 2024, 26, 1589–1599. [Google Scholar] [CrossRef]
  19. Grm, K.; Scheirer, W.J. Face hallucination using cascaded super-resolution and identity priors. IEEE Trans. Image Process. 2020, 29, 2150–2165. [Google Scholar] [CrossRef] [PubMed]
  20. Ma, C.; Jiang, Z.; Rao, Y. Deep face super-resolution with iterative collaboration between attentive recovery and landmark estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 5568–5577. [Google Scholar]
  21. Saharia, C.; Ho, J.; Chan, W.; Salimans, T. Image super-resolution via iterative refinement. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 4713–4726. [Google Scholar] [CrossRef] [PubMed]
  22. Gao, S.; Liu, X.; Zeng, B. Implicit diffusion models for continuous super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 1–9. [Google Scholar]
  23. Yue, Z.; Wang, J.; Loy, C.C. ResShift: Efficient diffusion model for image super-resolution by residual shifting. In Proceedings of the Computing Research Repository (CoRR), New Orleans, LA, USA, 9–19 December 2023; pp. 1–12. [Google Scholar]
  24. Xiao, Y.; Yuan, Q.; Jiang, K. EDiffSR: An efficient diffusion probabilistic model for remote sensing image super-resolution. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–14. [Google Scholar] [CrossRef]
  25. Haodong, C.; Niloofar, Z.; Ming, C.L. Fine-grained Activity Classification in Assembly Based on Multi-Visual Modalities. J. Intell. Manuf. 2024, 35, 2215–2233. [Google Scholar]
  26. Haris, M.; Shakhnarovich, G.; Ukita, N. Deep back-projection networks for superresolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1664–1673. [Google Scholar]
  27. Huang, G.; Liu, Z.; Van Der Maaten, L. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
  28. Li, J.; Fang, F.; Mei, K.; Zhang, G. Multi-scale residual network for image super-resolution. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 527–542. [Google Scholar]
  29. He, K.; Zhang, X.; Ren, S.; Sun, J. Identity mappings in deep residual networks. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Springer International Publishing: Berlin, Germany, 2016; pp. 630–645. [Google Scholar]
  30. Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; IEEE Press: Washington, DC, USA, 2018; pp. 7132–7141. [Google Scholar]
  31. Song, J.; Meng, C.; Ermon, S. Denoising diffusion implicit models. In Proceedings of the 9th International Conference on Learning Representations, Virtual Event, Austria, 3–7 May 2021; pp. 1–22. [Google Scholar]
  32. Dhariwal, P.; Nichol, A. Diffusion Model Beat GANs on Image Synthesis. Adv. Neural Inf. Process. Syst. 2021, 34, 8780–8794. [Google Scholar]
  33. Vaswani, A.; Shazeer, N.; Parmar, N. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Curran Associates Inc.: Red Hook, NY, USA, 2017; Volume 11, pp. 6000–6010. [Google Scholar]
  34. Karras, T.; Laine, S.; Aila, T. A Style-Based Generator Architecture for Generative Adversarial Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4396–4405. [Google Scholar]
  35. Liu, Z.; Luo, P.; Wang, X.; Tang, X. Deep Learning Face Attributes in the Wild. In Proceedings of the IEEE International Conference on Computer Vision, Columbus, OH, USA, 23–28 June 2014; pp. 3730–3738. [Google Scholar]
  36. Shiming, G.; Jia, L.; Qiting, Y. Detecting Masked Faces in the Wild with LLE-CNNs. In Proceedings of the Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 426–434. [Google Scholar]
  37. Zhou, W.; Hong, C.; Wang, X.; Zeng, Z. FSRGAN-DB: Super-resolution Reconstruction Based on Facial Prior Knowledge. In Proceedings of the IEEE International Conference on Big Data, Atlanta, GA, USA, 10–13 December 2020; pp. 3380–3386. [Google Scholar]
  38. Lugmayr, A.; Danelljan, M.; Van Gool, L.; Timofte, R. SRFlow: Learning the Super-Resolution Space with Normalizing Flow. In Proceedings of the Computing Research Repository (CoRR), Glasgow, UK, 23–28 August 2020; pp. 715–732. [Google Scholar]
  39. Menon, S.; Damian, A.; Hu, S.; Ravi, N.; Rudin, C. PULSE: Self-Supervised Photo Upsampling Via Latent Space Exploration of Generative Models. In Proceedings of the Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 2434–2442. [Google Scholar]
Figure 1. The overall architecture diagram of the model.
Figure 1. The overall architecture diagram of the model.
Electronics 14 02070 g001
Figure 2. Structure diagram of the pre-super-resolution module.
Figure 2. Structure diagram of the pre-super-resolution module.
Electronics 14 02070 g002
Figure 3. Structure diagram of the facial attribute extraction module.
Figure 3. Structure diagram of the facial attribute extraction module.
Electronics 14 02070 g003
Figure 4. Structure diagram of the facial global feature vector encoder.
Figure 4. Structure diagram of the facial global feature vector encoder.
Electronics 14 02070 g004
Figure 5. Diffusion implicit model.
Figure 5. Diffusion implicit model.
Electronics 14 02070 g005
Figure 6. Structure diagram of the fusion of prior attribute residual block.
Figure 6. Structure diagram of the fusion of prior attribute residual block.
Electronics 14 02070 g006
Figure 7. Improved U-net structure.
Figure 7. Improved U-net structure.
Electronics 14 02070 g007
Figure 8. FACDIM at 8× magnification under the Super-Resolution U-net framework (16 × 16→128 × 128).
Figure 8. FACDIM at 8× magnification under the Super-Resolution U-net framework (16 × 16→128 × 128).
Electronics 14 02070 g008
Figure 9. Downsampling.
Figure 9. Downsampling.
Electronics 14 02070 g009
Figure 10. Upsampling.
Figure 10. Upsampling.
Electronics 14 02070 g010
Figure 11. Results of the diffusion super-resolution experiment.
Figure 11. Results of the diffusion super-resolution experiment.
Electronics 14 02070 g011
Figure 12. Comparison of various methods for 8× super-resolution local magnification on the CelebA dataset.
Figure 12. Comparison of various methods for 8× super-resolution local magnification on the CelebA dataset.
Electronics 14 02070 g012aElectronics 14 02070 g012bElectronics 14 02070 g012c
Figure 13. Comparison of various methods for 8x super-resolution local magnification on the MAFA dataset.
Figure 13. Comparison of various methods for 8x super-resolution local magnification on the MAFA dataset.
Electronics 14 02070 g013aElectronics 14 02070 g013b
Figure 14. Single face attribute vector value change controls super-resolution rate result effect.
Figure 14. Single face attribute vector value change controls super-resolution rate result effect.
Electronics 14 02070 g014
Figure 15. Facial attribute vector value changes control feature variations.
Figure 15. Facial attribute vector value changes control feature variations.
Electronics 14 02070 g015
Table 1. Conceptual and technical distinctions between FACDIM and existing methods.
Table 1. Conceptual and technical distinctions between FACDIM and existing methods.
AspectFACDIMSPGAN [16]C-SRIP [19]ELSFace [18]
ArchitectureConditional Diffusion Implicit Model (DDIM) with AdaGN-based fusion of attribute priors and global features.Pixel-Wise GAN with identity-aware discriminators and supervised pixel-level adversarial loss.Cascaded CNN + Identity Priors (pre-trained face recognition models).Transformer-CNN Hybrid with latent style guidance and high-frequency enhancement.
Supervision StrategyJoint supervision: Noise prediction (L1/L2) + Edge Smoothness Loss + Attribute Priors via AdaGN.Pixel-level adversarial loss + Identity preservation via pre-trained recognition models.Multi-scale SSIM loss + Cross-entropy identity loss.Perceptual loss + Multi-scale Transformer-CNN feature alignment.
Training ObjectiveOptimizes for high-fidelity detail recovery and attribute-conditional generation.Maximizes pixel-level realism and identity similarity.Focuses on identity preservation through cascaded refinement.Balances global structure (Transformer) and local textures (CNN).
ControllabilityDynamic attribute control via manual adjustment of semantic vectors (e.g., young, smile).No user-controllable attributes; focuses on fixed identity preservation.No direct control; relies on identity priors for implicit consistency.Limited to latent style interpolation; no explicit attribute manipulation.
Key InnovationAttribute-aware diffusion with lightweight fusion (AdaGN) + efficient sampling.Supervised pixel-level discriminator for artifact suppression.Iterative cascaded refinement guided by identity priors.Efficient latent style-guided fusion of global and local features.
Table 2. Evaluation index values of each model at 8× super-resolution on FFHQ dataset.
Table 2. Evaluation index values of each model at 8× super-resolution on FFHQ dataset.
ModelPSNRSSIMLPILSFID
Bicubic20.110.6680.286102.872
PULSE19.660.5260.38563.949
SRGAN21.490.5600.20933.817
SRCNN22.870.5920.24771.316
SRFLOW23.430.630.22239.789
FSRGAN23.120.610.19555.648
SR324.630.670.17022.777
Ours26.790.750.12216.785
Table 3. Evaluation index values of each model at 8× super-resolution on MAFA dataset.
Table 3. Evaluation index values of each model at 8× super-resolution on MAFA dataset.
ModelPSNRSSIMFID
SRFLOW23.120.6535.89
FSRGAN21.950.5841.67
SR322.790.6228.41
Ours24.310.6919.83
Table 4. Model performance results under different fusion strategies.
Table 4. Model performance results under different fusion strategies.
Fusion MethodPSNRSSIMInference Time (s)Parameter Quantity (M)VRAM (GB)
Concatenation + MLP25.320.702.81286.2
Cross-Attention25.980.734.11347.5
Channel Attention25.670.713.21276.5
AdaGN (Ours)26.790.752.31246.0
Table 5. Module design effectiveness experiment.
Table 5. Module design effectiveness experiment.
Experiment NumberModel SituationPSNRSSIM
1Remove the PSRM and FAEM23.210.63
2Remove the FGFVE21.030.58
3Removing the implicit diffusion model24.120.66
4Removing AdaGN fusion22.760.61
5Complete experimental model26.790.72
Table 6. Experimental results of computational efficiency for each model.
Table 6. Experimental results of computational efficiency for each model.
ModelInference Time (s)Training Memory (GB)Inference Video Memory (GB)Parameter Quantity (M)
SRFLOW10.33112207
FSRGAN0.718498
SR35.6248145
Ours2.3206124
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ren, J.; Guo, Y.; Leng, Q. FACDIM: A Face Image Super-Resolution Method That Integrates Conditional Diffusion Models with Prior Attributes. Electronics 2025, 14, 2070. https://doi.org/10.3390/electronics14102070

AMA Style

Ren J, Guo Y, Leng Q. FACDIM: A Face Image Super-Resolution Method That Integrates Conditional Diffusion Models with Prior Attributes. Electronics. 2025; 14(10):2070. https://doi.org/10.3390/electronics14102070

Chicago/Turabian Style

Ren, Jianhua, Yuze Guo, and Qiangkui Leng. 2025. "FACDIM: A Face Image Super-Resolution Method That Integrates Conditional Diffusion Models with Prior Attributes" Electronics 14, no. 10: 2070. https://doi.org/10.3390/electronics14102070

APA Style

Ren, J., Guo, Y., & Leng, Q. (2025). FACDIM: A Face Image Super-Resolution Method That Integrates Conditional Diffusion Models with Prior Attributes. Electronics, 14(10), 2070. https://doi.org/10.3390/electronics14102070

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop