Next Article in Journal
Structural Damage Detection Using PZT Transmission Line Circuit Model
Previous Article in Journal
Enhanced RF Energy Harvesting System Utilizing Piezoelectric Transformer
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

PDGrad: Guiding Diffusion Model for Reference-Based Blind Face Restoration with Pivot Direction Gradient Guidance

1
Department of Artificial Intelligence, Ajou University, Suwon 16499, Republic of Korea
2
Department of Electrical and Computer Engineering, Ajou University, Suwon 16499, Republic of Korea
*
Author to whom correspondence should be addressed.
Sensors 2024, 24(22), 7112; https://doi.org/10.3390/s24227112
Submission received: 30 September 2024 / Revised: 27 October 2024 / Accepted: 2 November 2024 / Published: 5 November 2024
(This article belongs to the Section Sensing and Imaging)

Abstract

:
Reference-based blind face restoration (RefBFR) has gained considerable attention because it utilizes additional reference images to restore facial images in situations where the degradation is caused by unknown factors, making it particularly useful in real-world applications. Recently, guided diffusion models have demonstrated exceptional performance in this task without requiring training. They achieve this by integrating gradients of the losses where each loss reflects the different desired properties of the additional external images. However, these approaches fail to consider potential conflicts between gradients of multiple losses, which can lead to sub-optimal results. To address this issue, we introduce Pivot Direction Gradient guidance (PDGrad), a novel gradient adjustment method for RefBFR within a guided diffusion framework. To this end, we first define the loss function based on both low-level and high-level features. For loss at each feature level, both the coarsely restored image and the reference image are fully integrated. In cases of conflicting gradients, a pivot gradient is established for each level and other gradients are aligned to it, ensuring that the strengths of both images are maximized. Additionally, if the magnitude of the adjusted gradient exceeds that of the pivot gradient, it is adaptively scaled according to the ratio between the two, placing greater emphasis on the pivot. Extensive experimental results on the CelebRef-HQ dataset show that the proposed PDGrad significantly outperforms competitive approaches both quantitatively and qualitatively.

1. Introduction

Blind face restoration (BFR) aims to restore a high-quality (HQ) face image from a low-quality (LQ) image that has been degraded by unknown and complex factors, such as downsampling, blur, noise, and compression artifacts. BFR is a highly ill-posed problem because the unknown degradation makes it difficult to determine a single solution for a given LQ image, leading to multiple possible outcomes. Since facial images are sensitive to even subtle differences, having detailed information is essential for accurate restoration. By utilizing high-quality (HQ) reference images of the same individual, it becomes possible to achieve a high quality of image that is difficult to attain with BFR methods that do not use reference images. In this context, the reference-based blind face restoration (RefBFR) method has gained significant attention for its unique ability to leverage additional reference images to improve accuracy for practical scenarios. As a result, it can be applied for various applications, including face recognition [1,2], face detection [3,4,5] and age estimation [6,7].
Recently, several RefBFR studies [8,9,10,11,12,13] have been proposed based on deep learning [14]. Among these methods, PGDiff [13] has demonstrated outstanding performance in RefBFR using a training-free guided diffusion model [15]. It provides guidance to an unconditional diffusion model pre-trained for face image generation by incorporating the gradients of the losses during the reverse diffusion process. Their loss function is structured as a combination of multiple distances, each representing a specific desired attribute of the additional images. These include the coarsely restored image, generated using an external restorer such as CodeFormer [16], and the reference image, processed through the ArcFace network [1]. However, the guidance technique of PGDiff [13] may not be the optimal solution for RefBFR. This limitation arises from their gradients, which focus on using low-level information solely from the coarsely restored image, while relying on high-level information exclusively from the reference image. As a result, this approach fails to capture the crucial low-level details from the reference image and the high-level features from the coarsely restored image, leading to sub-optimal results. Moreover, the guidance derived from merely summing the gradients of multiple loss functions often results in sub-optimal results, as these gradients may be incompatible, causing conflicts.
To address this problem, we propose a novel gradient adjustment method for RefBFR called Pivot Direction Gradient guidance (PDGrad) within a guided diffusion framework. Inspired by PCGrad [17], the essence of our method is to reduce gradient interference by directly modifying the conflicting gradients of the loss. To this end, we first define the loss function based on both low-level and high-level features. Similar to PGDiff [13], we utilize external information such as the coarsely restored image y c which is obtained using the pre-trained restoration method such as CodeFormer [16] and the reference y r . However, unlike PGDiff, we utilize both y c and y r to compute the loss at each level. This is because these two images capture complementary characteristics of face images. Figure 1 illustrates the complementary properties of y c and y r . Generally, y c is aligned well with the LQ input and making it easy to compare with the prediction for low-level information such as edge, color and shape. However, certain areas of y c are not restored effectively. In contrast, y r provides more reliable high-level information, such as identity, and is partially aligned to input, helping to compensate for the low-level details in regions where y c has significant degradation. Based on this observation, our approach efficiently and comprehensively leverages both images, enabling the effective integration of detailed and contextual information from both y c and y r .
In this situation, simply summing the gradients of the losses at each level can lead to conflicting gradients. To address this issue, we establish a proper pivot gradient for the loss at each feature level and align other gradients to this pivot when conflicts arise. This approach allows us to fully harness the distinct advantages of both y c and y r . Specifically, for the loss using low-level features, the gradient from the loss using y c is prioritized, and the gradient of the loss using y r is modified by projecting it onto the plane orthogonal to y c when a conflict arises. Conversely, for the loss using high-level features, the gradient from the loss using y r is emphasized, and the gradient from the loss using y c is projected onto that of y r to avoid conflict and fully utilize the information in y r . Additionally, if the magnitude of the adjusted gradient exceeds that of the pivot gradient, it is adaptively scaled according to the ratio between the two, placing greater emphasis on the pivot. As exemplified in Figure 1, the proposed PDGrad outperforms previous methods by preserving the properties of the prioritized image at each feature level while selectively extracting elements of the properties of other images in a manner that aligns with the prioritized image.
In summary, the proposed method provides the following key contributions:
  • We propose a novel gradient adjustment method called PDGrad for RefBFR within a training-free guided diffusion framework.
  • The loss function of the proposed method consists of two components: low-level and high-level losses, where both the coarsely restored image and the reference image are fully incorporated.
  • Our proposed PDGrad establishes a proper pivot gradient for the loss at each level and adjusts other gradients to align with this pivot by modifying their direction and magnitude, thereby mitigating gradient interference.
  • Extensive comparisons show the superiority of our method against previous state-of-the-art RefBFR methods.
In this paper, we outline the organization as follows: Section 2 discusses previous works on blind face restoration. Section 3 provides detailed explanation of the proposed PDGrad. In Section 4, we compare and analyze the experimental outcomes of several methods, including our proposed approach. Finally, in Section 5, we discuss the conclusions.

2. Related Works

Most recent BFR studies have focused on utilizing face-specific prior information, such as geometric facial priors, reference priors and generative facial priors. Note that our proposed method can be viewed as a study that exploits both reference priors and generative facial priors.
  • Geometric Facial Priors.
Unlike natural images, faces consist of a common structural shape and components (e.g., eyes, nose, mouth and hair). Inspired by this, several approaches have been proposed to utilize the geometric priors. including facial landmarks [18,19], semantic segmentation map [20,21,22,23] and 3D shapes [24,25,26]. However, as pointed out by [16,27], such priors have limitations in guiding the fine details and texture information of the face (e.g., wrinkles and eye pupils). Furthermore, estimating geometric face priors from severely degraded inputs makes it difficult to obtain reliable results, which can affect performance.
  • Reference Priors.
Various methods [8,9,11] have been developed to utilize high-quality facial images of the same individual as reference for restoration, aiming to leverage the distinct facial features of each person. However, these methods heavily rely on reference images of the same individual, which cannot be easily accessible. To mitigate this issue, DFDNet [10] utilizes a facial component dictionary as reference information. However, their approach may be sub-optimal for face restoration tasks since their dictionary is extracted from the pre-trained face recognition model. Inspired by [10], DMDNet [12] introduces dual dictionaries that extend beyond a single general dictionary, allowing for more flexible handling of degraded inputs, regardless of whether reference images are present. While [10,12] utilize the facial component dictionary extracted from the face recognition model, ENTED [28] introduces a vector quantized dictionary along with a latent space refinement technique. In contrast to the above methods that leverage a single reference image, ASFFNet [11] utilizes multiple reference images to select the most suitable guidance image and learns the landmark weights to improve the reconstruction quality. ENTED [28] is a blind face restoration framework that uses a high-quality reference image to restore a single degraded input image. It substitutes corrupted semantic features with high-quality codes, inspired by vector quantization, and generates style codes containing high-quality texture information. PFStorer [29] utilizes a diffusion model for face restoration and Super-Resolution, using several images of the individual’s face to customize the restoration process while preserving fine details.
  • Generative Facial Priors.
Recently, numerous studies have widely leveraged the capability of generative models such as the Generative Adversarial Network (GAN) [30], Vector Quantized-Variational AutoEncoder (VQVAE) [31], Vector Quantized-Generative Adversarial Network (VQGAN) [32] and Denoising diffusion probabilistic models (DDPMs) [33,34,35]. GAN inversion-based methods [36,37] try to find the closest latent vector in the GAN latent space corresponding to a given input image. GFP-GAN [27] and GPEN [38] design their encoder networks to effectively find the latent vector for an input image, then utilize the pre-trained GAN model as a decoder in their methods. VQFR [39] uses a vector-quantized (VQ) codebook as a dictionary to enhance high-quality facial details. By employing a parallel decoder for the fusion of input features with texture features from the VQ codebook, this approach preserves fidelity while achieving detailed facial restoration. CodeFormer [16] is a transformer-based architecture for code prediction that captures the global structure of low-quality facial images, enabling the generation of natural faces even from severely degraded inputs. To adapt to varying levels of degradation, a controllable feature transformation (CFT) module is included, offering a versatile balance between fidelity and quality. RestoreFormer++ [40] enhances facial image restoration by utilizing fully spatial and multi-head cross-attention to merge contextual, semantic and structural information from degraded face features with high-quality priors. PMRF [41] presents an algorithm that predicts the posterior mean and then uses a rectified flow model to transport it to a high-quality image.
Building on the powerful generative capabilities of the diffusion model [33,34,35], several studies [42,43,44,45] have explored its application for BFR, and several additional studies [13,44,46,47] have also explored its application for BFR. DR2 [46] proposes a two-stage framework DR2E for blind face restoration that uses a pretrained diffusion model to remove various types of degradation and a module for detail enhancement and upsampling, and, furthermore, eliminates the need for synthetically degraded data during training. IPC [47] proposes a conditional diffusion-based BFR framework like SR3 to restore severely degraded face images. This framework employs a region-adaptive strategy that enhances restoration quality while preserving identity information. DifFace [44] establishes a posterior distribution for mapping LQ images to HQ counterparts via a pre-trained diffusion model. To achieve this, the approach estimates a transition distribution from the LQ input image to an intermediate noisy image using a diffuse estimator within the diffusion model to enhance robustness to severe degradations. Additionally, it incorporates a Markov chain that transitions the intermediate image to the HQ target image by repeatedly applying a pre-trained diffusion model, which further improves face restoration performance. Lu et al. [48] propose a diffusion-based architecture that incorporates 3D facial priors. These priors are derived from a reconstructed 3D face from an initially restored image and are integrated into the diffusion reverse process to provide structural and identity information.
However, these studies have not been developed to effectively leverage reference images for further enhancement. Meanwhile, PGDiff [13] proposed a partial guidance approach that is extensible to utilizing a reference image. By incorporating identity loss into the diffusion-based restoration method, it outperforms existing diffusion-prior-based methods. Inspired by this, we also propose a method that incorporates both diffusion prior and reference prior. However, unlike PGDiff [13], which does not consider the conflicts between gradients arising from multiple losses, our approach effectively addresses these conflicts through the integration of the proposed PDGrad, leading to more consistent and high-quality results for RefBFR.

3. Proposed Method

In this section, we provide a preliminary overview of the guided diffusion models to aid understanding of our proposed method in Section 3.1. We then detail the overall process of the proposed method in Section 3.2. Section 3.3 describes the proposed loss function, designed to fully leverage both the coarsely restored image and the reference image at each feature level. Lastly, in Section 3.4, our proposed PDGrad is explained, which is developed to mitigate the conflicting gradient problem.

3.1. Preliminary

3.1.1. Denoising Diffusion Probabilistic Models

Recently, diffusion models [33,34,35] are one of the probabilistic generative models that have achieved remarkable success in the field of image generation. The diffusion model consists of a forward process and a reverse process. The forward process gradually adds Gaussian noise to an input image, while the reverse process removes the noise and reconstructs the image from the noisy state.
For an unconditional diffusion model [33] with discrete steps T, there exists a transition distribution q ( x t + 1 | x t ) at each step t { 1 , 2 , 3 , · · · , T} with corresponding variance schedule  β t :
q ( x t | x t 1 ) = N ( x t ; 1 β t x t 1 , β t I ) ,
where x t 1 and x t are samples at time t 1 and t, respectively. x t is sampled using the reparameterization trick. x t can be sampled directly from x 0 :
x t = α ¯ t x 0 + 1 α ¯ t ϵ ,
where α ¯ t = i = 1 t α i , α t = 1 β t and ϵ N ( ϵ ; 0 , I ) . The sampling process begins with a pure Gaussian noise x T N ( x T ; 0 , I ) and gradually conducts the denoising step. Practically, the ideal denoising step is approximated by p θ ( x t 1 | x t ) [15] as follows.
p θ ( x t 1 | x t ) = N ( μ θ ( x t , t ) , Σ θ ( x t , t ) ) ,
where μ θ ( x t , t ) represents the mean, which is obtained as a linear combination of x t and an estimated noise ϵ θ ( x t , t ) , while Σ θ ( x t , t ) denotes the variance, a constant dependent on the pre-defined β t . From Equation (2), x ^ 0 | t can be directly computed from ϵ θ as:
x ^ 0 | t = 1 α ¯ t x t 1 α ¯ t α ¯ t ϵ θ ( x t , t ) .
ADM [15] introduces guided diffusion to control the sample generation of the diffusion model by leveraging an external classifier p ϕ ( c | x ) that predicts the conditioning information c such as class label. By utilizing the classifier, the conditional distribution for denoising step in Equation (3) is approximated as a Gaussian distribution and formulated as:
p θ , ϕ ( x t 1 | x t , c ) N ( μ θ ( x t , t ) + s Σ θ ( x t , t ) g , Σ θ ( x t , t ) ) ,
where s denotes the strength of classifier guidance for control. Here, the diffusion of the unconditional sampling distribution is guided by the gradient g towards conditional target c, which can be written as:
g = x log p ϕ ( c | x ) | x = μ θ ( x t , t ) .

3.1.2. Partial Guidance

Recently, PGDiff [13] has introduced a training-free method and utilized classifier guidance on an unconditional diffusion model for face restoration by leveraging a pre-trained network through a technique called partial guidance. Specifically, PGDiff [13] decomposes a high-quality face image into smooth semantics and high-frequency details. The smooth semantics of the face are provided by the pre-trained face restoration model, such as CodeFormer [16]. For the high-frequency details, PGDiff relies on the diffusion prior. In addition, by leveraging a reference image and incorporating identity loss into the partial guidance, PGDiff enhances the preservation of personal identity. This identity information is guided using a pre-trained face recognition network, such as ArcFace [1].

3.2. Overview of Our Method

Figure 2 illustrates an overview of the proposed process. Let y R H × W × C be the given LQ image and y r R H × W × C be the reference HQ image. Our goal is to predict a HQ image x 0 R H × W × C by adjusting conflicting gradient of the loss within a guided diffusion framework [44].
To this end, following PGDiff [13], a coarsely restored image y c R H × W × C is first obtained by adopting a pre-trained face restoration model f ( · ) , which can be written as:
y c = f ( y ) .
However, unlike PGDiff [13], which begins the reverse process from pure Gaussian noise, our method starts from x τ , sampled from y c to enhance initialization and decrease the number of sampling steps [49]. As in Equation (2), x τ can be defined as:
x τ = α ¯ τ y c + 1 α ¯ τ ϵ ,
where τ [ 0 , T ] is a hyperparameter that determines the starting reverse process. Also, α ¯ τ = i = 1 τ α i and ϵ N ( ϵ ; 0 , I ) . Then, the reverse diffusion process is iteratively performed using the guided diffusion model [33] as follows:
x t 1 N ( μ θ ( x t , t ) s Σ θ ( x t , t ) x ^ 0 | t L t o t a l , Σ θ ( x t , t ) ) ,
where s represents the guidance strength and Σ θ ( x t , t ) is the time-dependent constant, as defined in Equation (5). x ^ 0 | t is the predicted image at timestep t (Equation (4)). x ^ 0 | t L t o t a l denotes the gradient of the total loss with respect to x ^ 0 | t . Details of L t o t a l and the computation of the corresponding gradient x ^ 0 | t L t o t a l are discussed in the following.

3.3. Loss Function

One key goal of our method is to decompose the gradients from both the coarsely restored image and the reference image into their respective low-level and high-level components and then use these as guidance. This enables our approach to effectively leverage both low-level and high-level features from the coarsely restored image and the reference image.
Our total loss L t o t a l in Equation (9) at arbitrary diffusion time t is formulated by:
L t o t a l = L l o w + L h i g h ,
where L l o w and L h i g h represent losses for low-level and high-level features, respectively. For ease of notation, we omit the denoising timestep t. The former focuses on preserving low-level information such as face shape, edges and color, while the latter focuses on promoting high-level information such as face identity.
The proposed loss is computed using the coarsely restored image y c , the reference image y r , and the predicted image x ^ 0 | t (Equation (4)) at an arbitrary diffusion time t. To explicitly incorporate face information from y r and y c into the diffusion process, various levels of intermediate features are extracted from pre-trained ArcFace [1] and VGG16 [50] networks. As discussed in [51], it is well established that the intermediate output feature maps of the early layers of a well-trained network capture the low-level information of the input image, while the later layers capture higher-level features. Motivated by this, various levels of features are extracted from pre-trained convolutional networks such as ArcFace [1] and VGG16 [50].
Specifically, let { u i ( z ) } i = 1 4 represent a set of features extracted using ArcFace [1], which is the face recognition network to determine whether two images are of the same identity. Here, u i ( z ) denotes the feature extracted from the i t h intermediate layer of the ArcFace network for the input image z R H × W × C . In the case of low-level loss, we additionally use the VGG16 [50] network trained on a dataset that reflects human perceptual similarity to better match human preferences. Let { v i ( z ) } i = 1 5 represent a set of features extracted using the VGG16 [50], where v i ( z ) denotes the feature extracted from the i t h intermediate layer of the VGG16 [50] for the input image z R H × W × C . The specific layers used in u and v are explained in the implementation details in Section 4.1. Now, we explain each loss in detail in the following subsection.

3.3.1. Low-Level Loss

The proposed low-level loss L l o w is defined as sum of two losses:
L l o w = L l o w c + L l o w r ,
where L l o w c represents the loss that measures the similarity between the low-level features of x ^ 0 | t and y c . Similarly, L l o w r represents the loss that ensures alignment of the low-level features between x ^ 0 | t and y r . L l o w c and L l o w r are defined by using pre-trained networks, including ArcFace [1] and VGG16 [50]. Concretely, L l o w c is defined as:
L l o w c = i = 1 3 d a r c ( u i ( x ^ 0 | t ) , u i ( y c ) ) + i = 1 5 d v g g ( v i ( x ^ 0 | t ) , v i ( y c ) ) .
Here, d a r c ( · , · ) [1] is defined as:
d a r c ( j 1 , j 2 ) = 1 j 1 · j 2 j 1 j 2 ,
where j 1 and j 2 represent input vectors to measure distance. The distance function d v g g ( · , · )  [50] is defined as:
d v g g ( z 1 , z 2 ) = l 1 H l W l h , w | | w l h w ( z 1 h w l z 2 h w l ) | | 2 2 ,
where z 1 and z 2 are input images being compared, and z 1 h w l and z 2 h w l represent the feature values of the feature maps in the l t h layer at the spatial location ( h , w ) for z 1 and z 2 , respectively. H l and W l denote the height and width of the feature map at the l t h layer, and w l h w is a weighting factor for the feature map difference at spatial location ( h , w ) . Symbol ⊙ represents element-wise multiplication.
Similarly, to enforce the alignment of low-level features between between x 0 | t and y r , L l o w r is formulated as:
L l o w r = i = 1 3 d a r c ( u i ( x ^ 0 | t ) , u i ( y r ) ) + i = 1 5 d v g g ( v i ( x ^ 0 | t ) , v i ( y r ) ) .

3.3.2. High-Level Loss

The proposed high-level loss L h i g h is designed to measure the identity similarity between x 0 | t and y r as well as between x 0 | t and y c . Accordingly, L h i g h is comprised of two loss terms as:
L h i g h = L h i g h c + L h i g h r ,
where L h i g h c and L h i g h r are defined as follows.
L h i g h c = d a r c ( u 4 ( x ^ 0 | t ) , u 4 ( y c ) ) .
L h i g h r = d a r c ( u 4 ( x ^ 0 | t ) , u 4 ( y r ) ) .
Here, d a r c ( · , · ) refers to the cosine distance metric defined in Equation (13).

3.4. The Proposed PDGrad

Inspired by the PCGrad [17], we propose a gradient adjustment method for guiding diffusion models in RefBFR. Similar to PGDiff [13], the unconditioned diffusion model is guided using classifier guidance. In this context, the gradient of each loss in Equation (10) acts as a specific guidance, defined by the following equation:
g t o t a l = g l o w + g h i g h ,
where g t o t a l , g l o w , and g h i g h denote the gradients x ^ 0 | t L t o t a l , x ^ 0 | t L l o w , and x ^ 0 | t L h i g h , respectively. Unlike PCGrad [17], we select a pivot gradient for each loss and adjust the other gradient by projecting it to the normal plane of the pivot gradient when a conflicting gradient occurs. This prevents the interfering component from being applied in the pivot direction.
From Equation (11), g l o w consists of g l o w c = x ^ 0 | t L l o w c and g l o w r = x ^ 0 | t L l o w r , where the former is defined using y c , and the latter is defined using y r . When the angle between g l o w c and g l o w r is larger than 90 , that is, the cosine similarity between them is a negative value, it indicates that two gradients conflict [17]. In this case, the resultant gradient g l o w would be suboptimal as guidance for the guided diffusion. Thus, the proposed gradient g l o w p d , which replaces g l o w , is defined as follows:
g l o w p d = g l o w c + g l o w r if g l o w c · g l o w r 0 , g l o w c + k l · g ^ l o w r otherwise .
When two gradients g l o w c and g l o w r do not conflict, g l o w p d is the same as the original g l o w . When they conflict, we hypothesize that y c contains more reliable information for low-level features. Thus, as shown in Figure 3, we set the pivot direction as g l o w c and define g ^ l o w r by projecting g l o w r onto the normal plane of g l o w c , which is formulated as:
g ^ l o w r = g l o w r g l o w r · g l o w c g l o w c 2 g l o w c .
The weighting factor k l in Equation (20) is defined by
k l = 1 if g l o w c g ^ l o w r , g l o w c g ^ l o w r otherwise .
Note that when the norm of the projected gradient g ^ l o w r is larger than that of g l o w c , we clip the norm of g ^ l o w r by controlling the value of k l . This weighting factor helps our model to focus on the low-level features of y c .
Similarly, g h i g h consists of g h i g h c = x ^ 0 | t L h i g h c and g h i g h r = x ^ 0 | t L h i g h r . For the high-level features, y r contains more suitable information than that of y c . In this case, we define the modified gradient g h i g h p d which replaces g h i g h as:
g h i g h p d = g h i g h c + g h i g h r if g h i g h c · g h i g h r 0 , k h · g ^ h i g h c + g h i g h r otherwise .
As shown in Figure 3, we set the pivot direction for g h i g h as g h i g h r and define g ^ h i g h c by projecting g h i g h c onto the normal plane of g h i g h r , which is formulated as:
g ^ h i g h c = g h i g h c g h i g h c · g h i g h r g h i g h r 2 g h i g h r .
The weighting factor k h in Equation (23) is defined by
k h = 1 if g h i g h r g ^ h i g h c , g h i g h r g ^ h i g h c otherwise .
When the norm of the projected gradient g ^ h i g h c is larger than that of g h i g h r , we clip the norm of g ^ h i g h c by controlling the value of k h .
Finally, the total gradient g t o t a l p d for guiding the diffusion model can be obtained by summing g l o w p d in Equation (20) and g h i g h p d in Equation (23). It is formulated as follows:
g t o t a l p d = g l o w p d + g h i g h p d .
The overall pipeline of the proposed PDGrad is described in Algorithm 1.
Algorithm 1 Restoration process of PDGrad
1:
Input: a low-quality image y, reference image y r , a diffusion model ( μ θ ( x t , t ) , Σ θ ( x t , t ) ), face restorer f ( · ) , gradient scale s and the initial timestep τ
2:
Output: restored image x 0
3:
y c f ( y )
4:
Sample x τ from q ( x τ | y c ) according to Equation (8)
5:
for t do from τ to 1
6:
     μ , Σ μ θ ( x t , t ) , Σ θ ( x t , t )
7:
     x ^ 0 | t 1 α ¯ t x t 1 α ¯ t α ¯ t ϵ θ ( x t , t )
8:
    Compute gradients g l o w p d according to Equation (20)
9:
    Compute gradients g h i g h p d according to Equation (23)
10:
     g t o t a l p d g l o w p d + g h i g h p d
11:
     x t 1 sample from N μ s Σ g t o t a l p d , Σ
12:
end for

4. Experiments

As mentioned in Li et al. [10], the RefBFR methods generally outperform single-image BFR methods, since reference image contains rich textures and the fine detains lost in the given LQ image [52,53]. Hence, in this paper, we mainly compare our proposed method with the recent reference-based BFR methods such as ASFFNet [11], DMDNet [12], and PGDiff [13]. Additionally, we report comparison with the single-image BFR methods, including VQFR [39], CodeFormer [16], RestoreFormer++ [40] and DifFace [44]. All experiments in this paper are conducted using the official models with pre-trained weights provided by the authors.
In this section, we provide the details of experimental settings in Section 4.1. In Section 4.2, we compare our proposed method with the state-of-the-art BFR methods through both qualitative and quantitative analyses. Section 4.4 presents the evaluation results of the ablation study to assess the effect of each component of our proposed approach.

4.1. Experimental Setting

4.1.1. Implementation Details

Following PGDiff [13], we utilize the pre-trained diffusion model provided by Yue et al. [44] for a fair comparison. This model is an unconditional diffusion network trained on the FFHQ dataset [54] and supports an image resolution of 512 × 512 . In our RefBFR process, we leverage CodeFormer [16] as a pre-trained face restoration model to obtain coarsely restored image from a given LQ image, which is denoted as f ( · ) in Equation (7). It is noteworthy that the proposed method employs off-the-shelf pre-trained networks which are readily accessible online, without the need for additional training. The proposed framework is implemented using Pytorch [55] and the inference process is executed on a single NVIDIA GeForce RTX 3090 GPU. Empirically, we set τ for initial guidance step to 700, and gradient scale s to 0.1. The intermediate features { u i ( z ) } i = 1 3 are extracted from layers conv 1 _ 1 , conv 2 _ 2 , and conv 3 _ 2 of ArcFace [1]. u 4 ( z ) is final output feature of ArcFace [1]. The intermediate features { v i ( z ) } i = 1 5 are extracted from layers conv 1 _ 2 , conv 2 _ 2 , conv 3 _ 3 , conv 4 _ 3 and conv 5 _ 2 of VGG16 [50].

4.1.2. Datasets

To evaluate our method, we use CelebRef-HQ dataset [12], which comprises a total of 10,555 HQ face images. This dataset includes 1005 distinct identities and each individual has between 2 and 21 images. Specifically, for our evaluation, we randomly select two images from each of the 1005 identities in the dataset. Then, one image is designated as the ground-truth HQ image, while the other image serves as the reference HQ image. Following the degradation model specified in recent BFR studies [16,27,39,56], the LQ images are synthesized as follows:
y = [ [ ( x G T k σ ) r + n δ ] J P E G ] r ,
where the ground-truth HQ image x G T is first blurred using a Gaussian kernel k σ , then downsampled by a scale factor r. Next, Gaussian noise n δ is added, followed by JPEG compression with quality factor q is applied. Lastly, the LQ image y is resized back to 512 × 512 . In this paper, we randomly sample σ , r, δ , and q from [ 0.1 , 15 ] , [ 24 , 40 ] , [ 0 , 20 ] , and [ 30 , 100 ] , respectively.

4.1.3. Evaluation Metrics

For a quantitative evaluation, we employ PSNR, SSIM [57] and NIQE [58], which are commonly used metrics in the image restoration field. Additionally, we measure LPIPS [50] to assess perceptual similarity between ground-truth images and restored images. Furthermore, FID [59] is used to quantify the distance between the feature distributions of HQ face datasets and restored images. We employ the CelebRef-HQ dataset [12] to measure the feature distributions of HQ face dataset. To measure the similarity in facial identity between the ground-truth images and the restored images, we compute the angle between their embedding vectors using ArcFace [1], denoted as Deg [39]. We also compare the landmark distance (LMD) [39], which is calculated as the average L2 distance of 98 facial landmarks predicted using Awing [60] between the ground-truth images and the restored images.

4.2. Quantitative Comparison

The quantitative comparisons of various RefBFR methods are shown in Table 1. Here, ASSFNet [11] and DMDNet [12] are face restoration methods that incorporate landmark estimation procedures. However, for certain LQ input images, these methods struggle with accurate landmark detection, preventing them from generating results and rendering testing infeasible. Therefore, to ensure a fair comparison between methods, we conduct experiments on 662 LQ input images, a subset of the CelebRef-HQ dataset [12], where testing with ASSFNet [11] and DMDNet [12] is feasible, as shown in Table 1. The results demonstrate that the proposed PDGrad achieves better performance than other methods in terms of LPIPS, Deg, LMD and NIQE. PDGrad achieved a | 0.4508 0.4437 | / 0.4508 = 1.57 % better result in terms of LPIPS compared to the second best competitive method, indicating that it consistently restores faces with perceptual quality closest to the ground truth. For fidelity, the proposed PDGrad achieved the highest performance in terms of Deg and LMD, compared to the second best method with improvements of | 55.53 53.1 | / 55.53 = 4.38 % and | 6.25 6.01 | / 6.25 = 3.84 % , respectively. This demonstrates that our method can accurately recover facial identity similarity and details. Additionally, in terms of image quality metrics such as NIQE, the proposed PDGrad outperforms DMDNet [12] by | 3.85 3.38 | / 3.85 = 12.21 % and produces more realistic details. This can be attributed to the incorporation of (1) the proposed loss function that considers perceptual quality and identity preservation and (2) the proposed gradient adjustment procedure that effectively handles the conflict of the gradients.
In Table 2, we further compared the proposed PDGrad with PGDiff [13] using the full CelebRef-HQ dataset [12], which consists of 1005 LQ input images. Compared to PGDiff, PDGrad shows outperforming results on the LPIPS, Deg, LMD and NIQE metrics. Notably, we achieved improvements of | 56.68 53.90 | / 56.68 = 4.9 % and | 4.32 3.39 | / 4.32 = 21.53 % in Deg and NIQE, respectively. This indicates that PDGrad generates face images that are more faithful to the ground truth identity while maintaining high image quality during the guided diffusion process.
Table 3 provides a quantitative comparison between the proposed PDGrad and single-image BFR methods on the full CelebRef-HQ dataset [12]. The results demonstrate that our method achieves better or at least comparable performance in terms of LPIPS, Deg, LMD, NIQE and FID. While CodeFormer [16] achieves a higher perceptual quality than other methods according to LPIPS, its identity similarity is significantly compromised according to Deg. Although our proposed PDGrad is slightly worse in LPIPS compared to CodeFormer [16], it excels at preserving identity similarity, as measured by Deg. Notably, our method demonstrates a significant improvement | 68.3 53.9 | / 68.3 = 21.08 % in Deg compared to the second-best model.

4.3. Qualitative Comparison

Visual comparisons of RefBFR and single-image BFR methods are presented in Figure 4 and Figure 5, respectively. In each figure, the even-numbered rows provide close-up views that highlight specific details in the same areas indicated by red rectangles in the LQ input of the corresponding images in the odd-numbered rows. Figure 4 demonstrates that ASFFNet [11] and DMDNet [12] fail to preserve the identity and to produce a proper facial shape. Specifically, it can be observed that components like the eyes have been mostly restored, but there are difficulties in restoring most components such as the nose and mouth. PGDiff [13] is able to produce high-quality images, but it showed a lack of facial details in terms of preserving identity. Unlike other methods, our PDGrad is able to generate high-quality images with high fidelity in skin texture, wrinkles and eye shape.
In Figure 5, VQFR [39] and RestoreFormer++ [40] fail to produce satisfactory restoration results due to severe degradations. The results contain artifacts and lack facial details. CodeFormer [16], DifFace [44] and PMRF [41] produce high-quality images, but they also lack of facial details important for preserving identity. However, the proposed PDGrad exhibits superior performance over all other methods in restoring sharp and fine details of the face (e.g., in the eyes, nose and mouth). Moreover, the proposed method can generate identity-preserving results consistent with the GT while also improving the perceptual quality of the image.

4.4. Ablation Study

We conducted ablation studies to investigate the impact of each component in the proposed PDGrad. First, Table 4 presents the effects of the gradient adjustment components in the PDGrad, summarizing the configurations and results for each experiment. All the methods in Table 4 use the same network architectures and a loss function defined as the sum of Equations (11) and (16), with the only difference being the gradient used to guide the diffusion model. A1 in Table 4 represents a baseline model, where the total gradient in Equation (19) is obtained by simply summing of multiple gradients without any gradient adjustment. The model is then guided by this gradient during the diffusion sampling process. A2 is a method that resolves gradient conflicts by adjusting only the gradient direction. This is achieved by projecting the gradient to the normal plane of the pivot gradient when conflicts occur. In A2, the values of both k l and k h are fixed as 1 in Equations (20) and (23) for all cases, respectively. Compared to A1, A2 shows the improvements of | 0.4513 0.4499 | / 0.4513 = 0.31 % in LPIPS, | 59.24 53.90 | / 59.24 = 9.01 % in Deg, | 6.49 6.36 | / 6.49 = 2 % in LMD and | 3.44 3.39 | / 3.44 = 1.45 % in NIQE, respectively. These improvements highlight that adjustment of the gradient direction in PDGrad effectively mitigates the conflicts between gradients arising from multiple losses. Consequently, the diffusion process is more efficiently guided, enhancing the quality of the generated images. PDGrad is our proposed method, which is built upon A2 by additionally applying an adaptive scaling for k h and k l to ensure that the magnitude of the projected gradient does not exceed that of the pivot gradient. This adaptive scaling is designed to enhance the influence of the pivot gradient when applying gradient adjustment. As a result, PDGrad not only adjusts the gradient direction towards the pivot, but also preserves the influence of the pivot gradient by adjusting the magnitude of other gradients. This leads to the restoration of images with both perceptually improved and enhanced fidelity. Consequently, compared to A2, PDGrad shows further improvements of | 0.4506 0.4499 | / 0.4506 = 0.16 % in LPIPS and | 54.02 53.90 | / 54.02 = 0.22 % in Deg.
To effectively guide detailed facial information, PDGrad defines the low-level loss by combining two components, d a r c and d v g g , as shown in Equations (12) and (15). As shown in Table 5, to evaluate the impact of each component in these equations, we performed an additional ablation study by using either d a r c or d v g g in Equations (12) and (15). The experiment was performed by varying only the low-level loss component, while applying gradient adjustment, including gradient projection and adaptive scaling. A3 shows results using d a r c exclusively in both Equations (12) and (15), while A4 demonstrates results using d v g g exclusively in those equations. When A3 and A4 are compared to PDGrad, the results of PDGrad show an improvement in the LPIPS by | 0.4616 0.4499 | / 0.4616 = 2.53 % and | 0.4632 0.4499 | / 0.4632 = 2.87 % , respectively. Additionally, the Deg score improves by | 60.60 53.90 | / 60.60 = 11.06 % and | 61.87 53.90 | / 61.87 = 12.88 % , respectively. These improvements indicate that using both d a r c and d v g g together is more effective in guiding perceptual and fidelity information than using either component alone.
As shown in Table 6, we conducted further experiments to explore the effects of the input images by using either the coarsely restored image y c from CodeFormer [16] or the reference image y r individually for gradient computation in the diffusion sampling process. This setup represents an extreme scenario where the angle between two input gradients derived from y c and y r , respectively, are aligned in the same direction. This alignment occurs when y c and y r are identical, resulting in the same result as if either image were used alone. In Table 6, A5 represents the case of using only y c , where g l o w p d and g h i g h p d in Equations (20) and (23) are set to g l o w c and g h i g h c , respectively. Similarly, A6 represents the case of using only y r , where g l o w p d and g h i g h p d in Equations (20) and (23) are set to g l o w r and g h i g h r , respectively. Our results confirm that PDGrad, which utilizes both y c and y r , as presented in Table 6, outperformed the other models across most metrics.

5. Conclusions

In this paper, we present a Pivot Direction Gradient (PDGrad), a novel gradient adjustment method designed to enhance reference-based blind face restoration within the guided diffusion framework. By focusing on the issue of conflicting gradients in multi loss-based guidance, the proposed method aligns gradients across different feature levels, ensuring both low-level and high-level facial characteristics are accurately restored. Through comprehensive experiments, we have demonstrated that the proposed method consistently outperforms existing methods, offering robust solution for reference-based blind face restoration. This advancement highlights the potential of gradient adjustment techniques for guided-diffusion models and the broader image restoration field.

Author Contributions

Conceptualization, G.M., T.B.L. and Y.S.H.; software, G.M. and T.B.L.; validation, Y.S.H.; investigation, G.M.; writing—original draft preparation, G.M., T.B.L. and Y.S.H.; writing—review and editing, Y.S.H.; supervision, Y.S.H. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education under Grant 2022R1F1A1065702 and in part by the Institute of Information & Communications Technology Planning & Evaluation (IITP) under the Artificial Intelligence Convergence Innovation Human Resources Development (IITP-2024-RS-2023-00255968) grant funded by the Korean government (MSIT).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Acknowledgments

The authors would like to thank the anonymous reviewers for their constructive comments and recommendations.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Deng, J.; Guo, J.; Xue, N.; Zafeiriou, S. Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 4690–4699. [Google Scholar]
  2. Sun, Y.; Cheng, C.; Zhang, Y.; Zhang, C.; Zheng, L.; Wang, Z.; Wei, Y. Circle loss: A unified perspective of pair similarity optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 6398–6407. [Google Scholar]
  3. Li, J.; Zhang, B.; Wang, Y.; Tai, Y.; Zhang, Z.; Wang, C.; Li, J.; Huang, X.; Xia, Y. ASFD: Automatic and scalable face detector. In Proceedings of the 29th ACM International Conference on Multimedia, Chengdu, China, 20–24 October 2021; pp. 2139–2147. [Google Scholar]
  4. Deng, J.; Guo, J.; Zhou, Y.; Yu, J.; Kotsia, I.; Zafeiriou, S. Retinaface: Single-stage dense face localisation in the wild. arXiv 2019, arXiv:1905.00641. [Google Scholar]
  5. Qi, D.; Tan, W.; Yao, Q.; Liu, J. YOLO5Face: Why reinventing a face detector. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 228–244. [Google Scholar]
  6. Kuprashevich, M.; Tolstykh, I. Mivolo: Multi-input transformer for age and gender estimation. In Proceedings of the International Conference on Analysis of Images, Social Networks and Texts, Yerevan, Armenia, 28–30 September 2023; Springer: Berlin/Heidelberg, Germany, 2023; pp. 212–226. [Google Scholar]
  7. Shin, N.H.; Lee, S.H.; Kim, C.S. Moving window regression: A novel approach to ordinal regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 18760–18769. [Google Scholar]
  8. Dogan, B.; Gu, S.; Timofte, R. Exemplar guided face image super-resolution without facial landmarks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
  9. Li, X.; Liu, M.; Ye, Y.; Zuo, W.; Lin, L.; Yang, R. Learning warped guidance for blind face restoration. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 272–289. [Google Scholar]
  10. Li, X.; Chen, C.; Zhou, S.; Lin, X.; Zuo, W.; Zhang, L. Blind face restoration via deep multi-scale component dictionaries. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 399–415. [Google Scholar]
  11. Li, X.; Li, W.; Ren, D.; Zhang, H.; Wang, M.; Zuo, W. Enhanced blind face restoration with multi-exemplar images and adaptive spatial feature fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 2706–2715. [Google Scholar]
  12. Li, X.; Zhang, S.; Zhou, S.; Zhang, L.; Zuo, W. Learning dual memory dictionaries for blind face restoration. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 5904–5917. [Google Scholar] [CrossRef] [PubMed]
  13. Yang, P.; Zhou, S.; Tao, Q.; Loy, C.C. PGDiff: Guiding Diffusion Models for Versatile Face Restoration via Partial Guidance. Adv. Neural Inf. Process. Syst. 2024, 36, 1–21. [Google Scholar]
  14. LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef]
  15. Dhariwal, P.; Nichol, A. Diffusion models beat gans on image synthesis. Adv. Neural Inf. Process. Syst. 2021, 34, 8780–8794. [Google Scholar]
  16. Zhou, S.; Chan, K.; Li, C.; Loy, C.C. Towards robust blind face restoration with codebook lookup transformer. Adv. Neural Inf. Process. Syst. 2022, 35, 30599–30611. [Google Scholar]
  17. Yu, T.; Kumar, S.; Gupta, A.; Levine, S.; Hausman, K.; Finn, C. Gradient surgery for multi-task learning. Adv. Neural Inf. Process. Syst. 2020, 33, 5824–5836. [Google Scholar]
  18. Chen, Y.; Tai, Y.; Liu, X.; Shen, C.; Yang, J. Fsrnet: End-to-end learning face super-resolution with facial priors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 2492–2501. [Google Scholar]
  19. Kim, D.; Kim, M.; Kwon, G.; Kim, D.S. Progressive face super-resolution via attention to facial landmark. arXiv 2019, arXiv:1908.08239. [Google Scholar]
  20. Shen, Z.; Lai, W.S.; Xu, T.; Kautz, J.; Yang, M.H. Deep semantic face deblurring. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8260–8269. [Google Scholar]
  21. Chen, C.; Li, X.; Yang, L.; Lin, X.; Zhang, L.; Wong, K.Y.K. Progressive semantic-aware style transformation for blind face restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 11896–11905. [Google Scholar]
  22. Lee, T.B.; Jung, S.H.; Heo, Y.S. Progressive semantic face deblurring. IEEE Access 2020, 8, 223548–223561. [Google Scholar] [CrossRef]
  23. Han, S.; Lee, T.B.; Heo, Y.S. Semantic-Aware Face Deblurring with Pixel-Wise Projection Discriminator. IEEE Access 2023, 11, 11587–11600. [Google Scholar] [CrossRef]
  24. Ren, W.; Yang, J.; Deng, S.; Wipf, D.; Cao, X.; Tong, X. Face video deblurring using 3D facial priors. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9388–9397. [Google Scholar]
  25. Hu, X.; Ren, W.; LaMaster, J.; Cao, X.; Li, X.; Li, Z.; Menze, B.; Liu, W. Face super-resolution guided by 3d facial priors. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part IV 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 763–780. [Google Scholar]
  26. Zhu, F.; Zhu, J.; Chu, W.; Zhang, X.; Ji, X.; Wang, C.; Tai, Y. Blind face restoration via integrating face shape and generative priors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 7662–7671. [Google Scholar]
  27. Wang, X.; Li, Y.; Zhang, H.; Shan, Y. Towards real-world blind face restoration with generative facial prior. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 9168–9178. [Google Scholar]
  28. Lau, Y.F.; Zhang, T.; Rao, Z.; Chen, Q. ENTED: Enhanced Neural Texture Extraction and Distribution for Reference-based Blind Face Restoration. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 4–8 January 2024; pp. 5162–5171. [Google Scholar]
  29. Varanka, T.; Toivonen, T.; Tripathy, S.; Zhao, G.; Acar, E. PFStorer: Personalized Face Restoration and Super-Resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 2372–2381. [Google Scholar]
  30. Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. Adv. Neural Inf. Process. Syst. 2014, 27, 139–144. [Google Scholar]
  31. Van Den Oord, A.; Vinyals, O. Neural discrete representation learning. Adv. Neural Inf. Process. Syst. 2017, 30, 6309–6318. [Google Scholar]
  32. Esser, P.; Rombach, R.; Ommer, B. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 12873–12883. [Google Scholar]
  33. Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar]
  34. Song, J.; Meng, C.; Ermon, S. Denoising Diffusion Implicit Models. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
  35. Nichol, A.Q.; Dhariwal, P. Improved denoising diffusion probabilistic models. In Proceedings of the International Conference on Machine Learning (PMLR), Virtual, 18–24 July 2021; pp. 8162–8171. [Google Scholar]
  36. Menon, S.; Damian, A.; Hu, S.; Ravi, N.; Rudin, C. Pulse: Self-supervised photo upsampling via latent space exploration of generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 2437–2445. [Google Scholar]
  37. Gu, J.; Shen, Y.; Zhou, B. Image processing using multi-code gan prior. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 3012–3021. [Google Scholar]
  38. Yang, T.; Ren, P.; Xie, X.; Zhang, L. Gan prior embedded network for blind face restoration in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 672–681. [Google Scholar]
  39. Gu, Y.; Wang, X.; Xie, L.; Dong, C.; Li, G.; Shan, Y.; Cheng, M.M. Vqfr: Blind face restoration with vector-quantized dictionary and parallel decoder. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 126–143. [Google Scholar]
  40. Wang, Z.; Zhang, J.; Chen, T.; Wang, W.; Luo, P. RestoreFormer++: Towards real-world blind face restoration from undegraded key-value pairs. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 15462–15476. [Google Scholar] [CrossRef]
  41. Ohayon, G.; Michaeli, T.; Elad, M. Posterior-Mean Rectified Flow: Towards Minimum MSE Photo-Realistic Image Restoration. arXiv 2024, arXiv:2410.00418. [Google Scholar]
  42. Kawar, B.; Elad, M.; Ermon, S.; Song, J. Denoising diffusion restoration models. Adv. Neural Inf. Process. Syst. 2022, 35, 23593–23606. [Google Scholar]
  43. Wang, Y.; Yu, J.; Zhang, J. Zero-shot image restoration using denoising diffusion null-space model. arXiv 2022, arXiv:2212.00490. [Google Scholar]
  44. Yue, Z.; Loy, C.C. Difface: Blind face restoration with diffused error contraction. arXiv 2022, arXiv:2212.06512. [Google Scholar] [CrossRef]
  45. Fei, B.; Lyu, Z.; Pan, L.; Zhang, J.; Yang, W.; Luo, T.; Zhang, B.; Dai, B. Generative diffusion prior for unified image restoration and enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 9935–9946. [Google Scholar]
  46. Wang, Z.; Zhang, Z.; Zhang, X.; Zheng, H.; Zhou, M.; Zhang, Y.; Wang, Y. Dr2: Diffusion-based robust degradation remover for blind face restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 1704–1713. [Google Scholar]
  47. Suin, M.; Nair, N.G.; Lau, C.P.; Patel, V.M.; Chellappa, R. Diffuse and Restore: A Region-Adaptive Diffusion Model for Identity-Preserving Blind Face Restoration. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 4–8 January 2024; pp. 6343–6352. [Google Scholar]
  48. Lu, X.; Hu, X.; Luo, J.; Ren, W. 3D Priors-Guided Diffusion for Blind Face Restoration. In Proceedings of the ACM Multimedia, Melbourne, Australia, 28 October–1 November 2024. [Google Scholar]
  49. Chung, H.; Sim, B.; Ye, J.C. Come-closer-diffuse-faster: Accelerating conditional diffusion models for inverse problems through stochastic contraction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 12413–12422. [Google Scholar]
  50. Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 586–595. [Google Scholar]
  51. Peng, X.; Zhang, X.; Li, Y.; Liu, B. Research on image feature extraction and retrieval algorithms based on convolutional neural network. J. Vis. Commun. Image Represent. 2020, 69, 102705. [Google Scholar] [CrossRef]
  52. Zheng, H.; Ji, M.; Wang, H.; Liu, Y.; Fang, L. Crossnet: An end-to-end reference-based super resolution network using cross-scale warping. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 88–104. [Google Scholar]
  53. Zhang, Z.; Wang, Z.; Lin, Z.; Qi, H. Image super-resolution by neural texture transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 7982–7991. [Google Scholar]
  54. Karras, T.; Laine, S.; Aila, T. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 4401–4410. [Google Scholar]
  55. Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. Pytorch: An imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 2019, 32, 8026–8037. [Google Scholar]
  56. Wang, Z.; Zhang, J.; Chen, R.; Wang, W.; Luo, P. Restoreformer: High-quality blind face restoration from undegraded key-value pairs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 17512–17521. [Google Scholar]
  57. Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed]
  58. Mittal, A.; Soundararajan, R.; Bovik, A.C. Making a “completely blind” image quality analyzer. IEEE Signal Process. Lett. 2012, 20, 209–212. [Google Scholar] [CrossRef]
  59. Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Adv. Neural Inf. Process. Syst. 2017, 30, 6629–6640. [Google Scholar]
  60. Wang, X.; Bo, L.; Fuxin, L. Adaptive wing loss for robust face alignment via heatmap regression. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6971–6981. [Google Scholar]
Figure 1. Example of restoration results for RefBFR. To obtain a coarsely restored image y c from the LQ input image, CodeFormer [16] is used as a restorer. Unlike PGDiff [13], which is significantly affected by the quality of y c , the proposed PDGrad can generate images that mitigate this drawback.
Figure 1. Example of restoration results for RefBFR. To obtain a coarsely restored image y c from the LQ input image, CodeFormer [16] is used as a restorer. Unlike PGDiff [13], which is significantly affected by the quality of y c , the proposed PDGrad can generate images that mitigate this drawback.
Sensors 24 07112 g001
Figure 2. Overview of the proposed method. During the sampling process, the gradients are carefully adjusted by our PDGrad technique to prevent conflicts between gradients. This ensures that the diffusion process is efficiently guided, optimizing the quality and stability of the generated images.
Figure 2. Overview of the proposed method. During the sampling process, the gradients are carefully adjusted by our PDGrad technique to prevent conflicts between gradients. This ensures that the diffusion process is efficiently guided, optimizing the quality and stability of the generated images.
Sensors 24 07112 g002
Figure 3. The proposed PDGrad. We illustrate an example of calculating the proposed gradient g p d , where two input gradients are denoted as g 1 and g 2 . The pivot gradient, denoted as g 1 and represented by the red arrow, is without loss of generality. In (a), when the gradients g 1 and g 2 do not conflict, the resultant gradient g p d is defined as the simple sum of the two gradients, expressed as g p d = g 1 + g 2 . In (b), g 1 and g 2 exhibit conflicting directions. In this case, g 2 is projected onto the normal plane of the pivot gradient, resulting in g ^ 2 , where the magnitude of g ^ 2 is smaller than that of g 1 . Then, g p d is defined as g p d = g 1 + g ^ 2 . In (c), if the magnitude of g ^ 2 is larger than that of the pivot gradient g 1 , the magnitude of g ^ 2 is adjusted by scaling factor k, ensuring that it does not exceed that of g 1 . This results in g p d = g 1 + k · g ^ 2 .
Figure 3. The proposed PDGrad. We illustrate an example of calculating the proposed gradient g p d , where two input gradients are denoted as g 1 and g 2 . The pivot gradient, denoted as g 1 and represented by the red arrow, is without loss of generality. In (a), when the gradients g 1 and g 2 do not conflict, the resultant gradient g p d is defined as the simple sum of the two gradients, expressed as g p d = g 1 + g 2 . In (b), g 1 and g 2 exhibit conflicting directions. In this case, g 2 is projected onto the normal plane of the pivot gradient, resulting in g ^ 2 , where the magnitude of g ^ 2 is smaller than that of g 1 . Then, g p d is defined as g p d = g 1 + g ^ 2 . In (c), if the magnitude of g ^ 2 is larger than that of the pivot gradient g 1 , the magnitude of g ^ 2 is adjusted by scaling factor k, ensuring that it does not exceed that of g 1 . This results in g p d = g 1 + k · g ^ 2 .
Sensors 24 07112 g003
Figure 4. Qualitative comparison of RefBFR methods on the CelebRef-HQ dataset [12]. For a better comparison of visual quality, zooming-in is recommended. From left to right, LQ input image, reference image, ASFFNet [11], DMDNet [12], PGDiff [13], the proposed PDGrad and ground truth (GT).
Figure 4. Qualitative comparison of RefBFR methods on the CelebRef-HQ dataset [12]. For a better comparison of visual quality, zooming-in is recommended. From left to right, LQ input image, reference image, ASFFNet [11], DMDNet [12], PGDiff [13], the proposed PDGrad and ground truth (GT).
Sensors 24 07112 g004
Figure 5. Qualitative comparison of single-image BFR methods on the CelebRef-HQ dataset [12]. For a better comparison of visual quality, zooming-in is recommended. From left to right, LQ input image, reference image, VQFR [39], CodeFormer [16], RestoreFormer++ [40], DifFace [44], PMRF [41], the proposed PDGrad and ground truth (GT).
Figure 5. Qualitative comparison of single-image BFR methods on the CelebRef-HQ dataset [12]. For a better comparison of visual quality, zooming-in is recommended. From left to right, LQ input image, reference image, VQFR [39], CodeFormer [16], RestoreFormer++ [40], DifFace [44], PMRF [41], the proposed PDGrad and ground truth (GT).
Sensors 24 07112 g005
Table 1. Quantitative comparison of reference-based BFR methods on a subset of the CelebRef-HQ dataset [12]. The symbol ↑ in parentheses represents that the higher the value, the better. Similarly, the symbol ↓ indicates that the lower the value, the better. We highlight the best model and the second best model with the bold and underline, respectively.
Table 1. Quantitative comparison of reference-based BFR methods on a subset of the CelebRef-HQ dataset [12]. The symbol ↑ in parentheses represents that the higher the value, the better. Similarly, the symbol ↓ indicates that the lower the value, the better. We highlight the best model and the second best model with the bold and underline, respectively.
MethodsMetrics
LPIPS ↓Deg ↓LMD ↓NIQE ↓FID ↓PSNR ↑SSIM ↑
ASFFNet [11]0.485076.4327.824.6756.8419.000.5759
DMDNet [12]0.517477.9224.233.8580.3119.120.5374
PGDiff [13]0.450855.536.254.2926.5819.150.5917
PDGrad0.443753.106.013.3827.2818.330.5332
Table 2. Quantitative comparison with PGDiff on the CelebRef-HQ dataset [12]. The symbol ↑ in parentheses represents that the higher the value, the better. Similarly, the symbol ↓ indicates that the lower the value, the better. We highlight the best model and the second best model with the bold and underline, respectively.
Table 2. Quantitative comparison with PGDiff on the CelebRef-HQ dataset [12]. The symbol ↑ in parentheses represents that the higher the value, the better. Similarly, the symbol ↓ indicates that the lower the value, the better. We highlight the best model and the second best model with the bold and underline, respectively.
MethodsMetrics
LPIPS ↓ Deg ↓ LMD ↓ NIQE ↓ FID ↓ PSNR ↑ SSIM ↑
PGDiff [13]0.458456.686.754.3226.5818.990.5936
PDGrad0.449953.906.363.3927.2818.240.5368
Table 3. Quantitative comparison of single image BFR methods on the CelebRef-HQ dataset [12]. The symbol ↑ in parentheses represents that the higher the value, the better. Similarly, the symbol ↓ indicates that the lower the value, the better. We highlight the best model and the second best model with the bold and underline, respectively.
Table 3. Quantitative comparison of single image BFR methods on the CelebRef-HQ dataset [12]. The symbol ↑ in parentheses represents that the higher the value, the better. Similarly, the symbol ↓ indicates that the lower the value, the better. We highlight the best model and the second best model with the bold and underline, respectively.
MethodsMetrics
LPIPS ↓Deg ↓LMD ↓NIQE ↓FID ↓PSNR ↑SSIM ↑
VQFR [39]0.533279.1913.933.4899.8518.320.4881
CodeFormer [16]0.442670.077.815.0234.5718.900.5828
RestoreFormer++ [40]0.540178.9414.033.97112.3018.620.4870
DifFace [44]0.451068.306.254.8928.1220.930.6334
PMRF [41]0.455269.126.794.3525.4320.300.6116
PDGrad0.449953.906.363.3927.2818.240.5368
Table 4. Ablation study on the gradient adjustment components of the proposed PDGrad using the Celebref-HQ dataset [12]. The symbol ↑ in parentheses represents that the higher the value, the better. Similarly, the symbol ↓ indicates that the lower the value, the better. We highlight the best model and the second best model with the bold and underline, respectively.
Table 4. Ablation study on the gradient adjustment components of the proposed PDGrad using the Celebref-HQ dataset [12]. The symbol ↑ in parentheses represents that the higher the value, the better. Similarly, the symbol ↓ indicates that the lower the value, the better. We highlight the best model and the second best model with the bold and underline, respectively.
AblationGradient ProjectionAdaptive ScalingMetrics
LPIPS ↓Deg ↓LMD ↓NIQE ↓FID ↓PSNR ↑SSIM ↑
A1 0.451359.246.493.4426.7018.240.5434
A2 0.450654.026.363.3627.4318.230.5348
PDGrad0.449953.906.363.3927.2818.240.5368
Table 5. Ablation study on the components of the loss function in the proposed PDGrad using the Celebref-HQ dataset [12]. The symbol ↑ in parentheses represents that the higher the value, the better. Similarly, the symbol ↓ indicates that the lower the value, the better. We highlight the best model and the second best model with the bold and underline, respectively.
Table 5. Ablation study on the components of the loss function in the proposed PDGrad using the Celebref-HQ dataset [12]. The symbol ↑ in parentheses represents that the higher the value, the better. Similarly, the symbol ↓ indicates that the lower the value, the better. We highlight the best model and the second best model with the bold and underline, respectively.
Ablation d arc d vgg Metrics
LPIPS ↓Deg ↓LMD ↓NIQE ↓FID ↓PSNR ↑SSIM ↑
A3 0.461660.606.533.7626.4218.320.5720
A4 0.463261.876.923.2431.1917.590.5165
PDGrad0.449953.906.363.3927.2818.240.5368
Table 6. Ablation study on the input images used in the proposed PDGrad with the Celebref-HQ dataset [12]. The symbol ↑ in parentheses represents that the higher the value, the better. Similarly, the symbol ↓ indicates that the lower the value, the better. We highlight the best model and the second best model with the bold and underline, respectively.
Table 6. Ablation study on the input images used in the proposed PDGrad with the Celebref-HQ dataset [12]. The symbol ↑ in parentheses represents that the higher the value, the better. Similarly, the symbol ↓ indicates that the lower the value, the better. We highlight the best model and the second best model with the bold and underline, respectively.
AblationCodeFormer ( y c )Reference ( y r )Metrics
LPIPS ↓Deg ↓LMD ↓NIQE ↓FID ↓PSNR ↑SSIM ↑
A5 0.456370.766.953.5925.4118.100.5474
A6 0.512656.037.403.2734.1517.410.4988
PDGrad0.449953.906.363.3927.2818.240.5368
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Min, G.; Lee, T.B.; Heo, Y.S. PDGrad: Guiding Diffusion Model for Reference-Based Blind Face Restoration with Pivot Direction Gradient Guidance. Sensors 2024, 24, 7112. https://doi.org/10.3390/s24227112

AMA Style

Min G, Lee TB, Heo YS. PDGrad: Guiding Diffusion Model for Reference-Based Blind Face Restoration with Pivot Direction Gradient Guidance. Sensors. 2024; 24(22):7112. https://doi.org/10.3390/s24227112

Chicago/Turabian Style

Min, Geon, Tae Bok Lee, and Yong Seok Heo. 2024. "PDGrad: Guiding Diffusion Model for Reference-Based Blind Face Restoration with Pivot Direction Gradient Guidance" Sensors 24, no. 22: 7112. https://doi.org/10.3390/s24227112

APA Style

Min, G., Lee, T. B., & Heo, Y. S. (2024). PDGrad: Guiding Diffusion Model for Reference-Based Blind Face Restoration with Pivot Direction Gradient Guidance. Sensors, 24(22), 7112. https://doi.org/10.3390/s24227112

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop