DPO-ESRGAN: Perceptually Enhanced Super-Resolution Using Direct Preference Optimization

Yun, Wonwoo; Park, Hanhoon

doi:10.3390/electronics14173357

Open AccessFeature PaperArticle

DPO-ESRGAN: Perceptually Enhanced Super-Resolution Using Direct Preference Optimization

by

Wonwoo Yun

¹ and

Hanhoon Park

^1,2,*

¹

Division of Electronics and Communications Engineering, Pukyong National University, 45 Yongso-ro, Nam-gu, Busan 48513, Republic of Korea

²

Department of Artificial Intelligence Convergence, Graduate School, Pukyong National University, 45 Yongso-ro, Nam-gu, Busan 48513, Republic of Korea

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(17), 3357; https://doi.org/10.3390/electronics14173357

Submission received: 25 July 2025 / Revised: 19 August 2025 / Accepted: 22 August 2025 / Published: 23 August 2025

(This article belongs to the Special Issue Recent Advances and Applications of Machine Learning in Pattern Recognition)

Download

Browse Figures

Versions Notes

Abstract

Super-resolution (SR) is a long-standing task in the field of computer vision that aims to improve the quality and resolution of an image. ESRGAN is a representative generative adversarial network specialized to produce perceptually convincing SR images. However, it often fails to recover local details and still produces blurry or unnatural visual artifacts, resulting in producing SR images that people do not prefer. To address this problem, we propose to adopt Direct Preference Optimization (DPO), which was originally devised to fine-tune large language models based on human preferences. To this end, we develop a method for applying DPO to ESRGAN, and add a DPO loss for training the ESRGAN generator. Through ×4 SR experiments utilizing benchmark datasets, it is demonstrated that the proposed method can produce SR images with a significantly higher perceptual quality and higher human preference than ESRGAN and other ESRGAN variants that have modified the loss or network structure of ESRGAN. Specifically, when compared to ESRGAN, the proposed method achieved, on average, 0.32 lower PieAPP values, 0.79 lower NIQE values, and 0.05 higher PSNR values on the BSD100 dataset, as well as 0.32 lower PieAPP values, 0.32 lower NIQE values, and 0.17 higher PSNR values on the Set14 dataset.

Keywords:

image super-resolution; ESRGAN; direct preference optimization; PieAPP; LPIPS

1. Introduction

Super-resolution (SR) aims to improve the quality and resolution of an image and is used as pre-process in various vision tasks such as remote sensing, surveillance, microscopy, and medical imaging [1]. Recent approaches are based on deep learning and are being implemented using various network architectures such as convolutional neural networks (CNNs), generative adversarial networks (GANs), Transformers, and diffusion models. Compared to the others, GAN-based approaches have shown more promising results in many benchmark datasets [1,2], drawing more attention in the computer vision and graphics society.

ESRGAN [3], which is the most well-known GAN-based approach for SR, is capable of generating SR images with significantly improved perceptual quality compared to existing SR approaches. However, while the adversarial and perceptual losses are useful for understanding local or global contexts, they hinder the recovery of pixel-level details. This causes unnatural visual artifacts, resulting in SR images that people do not prefer. To reduce these undesirable artifacts, ESRGAN performed network interpolation as follows:

π_{θ} = (1 - α) π_{θ}^{P S N R} + α π_{θ}^{G A N}

, where

π_{θ}^{P S N R}

is a pre-trained model obtained by minimizing L1 loss and

π_{θ}^{G A N}

is a GAN model obtained by fine-tuning

π_{θ}^{P S N R}

using the adversarial and perceptual losses. However, this approach is not an effective solution and causes the SR images to blur. Various attempts have been made to improve the performance of ESRGAN by modifying loss functions or network architectures [4,5,6]. They performed better on some perceptual metrics than ESRGAN, but did not directly consider human preferences in the training process. Therefore, they still suffer from the problem of generating low-preference SR images. In this study, our goal is to develop an effective method for training SR models to generate images that people do prefer (thus, with significantly improved perceptual quality). Inspired by Direct Preference Optimization (DPO), a learning method for fine-tuning language models based on human preferences [7], we propose a DPO-based loss function to reflect human preferences in the training process of SR models. The problem is that SR models including ESRGAN have different frameworks from language models, which means that applying DPO to SR models is not straightforward. To this end, we review the DPO formula in detail and describe how to modify it to apply it to SR models.

Our study is the first attempt to directly apply DPO to SR models. We adopt ESRGAN as the baseline model, but our method can be applied to other SR models with different network architectures without modification.

2. Related Works

2.1. Image Super-Resolution

SR is the process of enhancing a low-resolution (LR) image by increasing its resolution to reveal the underlying high-resolution (HR) image. The nature of SR is inherently ill-posed because the same LR image can be generated from different HR images. However, deep learning has easily resolved this problem and revolutionized the SR process. Early deep learning-based SR methods have used L1 or L2 loss, achieving higher PSNR—a common image quantitative quality assessment metric. However, PSNR is known to be poorly correlated with human visual perception [8], and PSNR-oriented SR produces blurry results because it does not consider the creation of high-frequency details.

Focusing more on the perceptual quality of generated images, GANs, consisting of a generator network and a discriminator network that are trained in an adversarial manner, have proven to be particularly effective for SR. SRGAN [9] introduced GAN into SR for the first time, generating more visually convincing SR images at the expense of PSNR. ESRGAN [3], an improvement over SRGAN, further increased the perceptual quality of SR images by modifying the SRGAN generator, discriminator, and loss function. Afterwards, ESRGAN has evolved in two different ways. First, more advanced network architectures and more sophisticated loss functions have been introduced [4,5,6,10,11]. Second, to address the problem of poor performance in real-world images, more accurate and realistic image degradation models have been explored [12,13]. In both ways, various attempts are still being reported.

Since SwinIR [14], the first attempt to introduce Transformer to SR, was reported, Transformer has also been applied to SR. Due to its effectiveness in learning the global context of images, Transformer helped generate SR images with more plausible and natural textures. However, due to its high computational complexity, Transformer is mainly used in combination with CNNs, showing much better performance [15].

Recently, diffusion models, which are generative models devised for image generation, have shown a high potential in SR by generating SR images that are even more visually pleasing and realistic [16]. However, diffusion models face challenges such as a high computational complexity and color shift, which require continuous follow-up studies.

In this way, SR has been implemented using various network architectures. However, this study focuses on GAN-based approaches due to their maturity and computational efficiency, as well as focusing on how to further improve their ability to generate perceptually enhanced SR images. In particular, we adopt ESRGAN as the baseline model because it is the most representative and widely used of the GAN-based SR approaches.

2.2. Direct Preference Optimization

DPO is a method that helps large, unsupervised language models better match human preferences using a simple classification approach [7]. DPO was originally proposed for language model alignment and has spread quickly as an effective fine-tuning method in various natural language processing tasks such as instruction tuning, summarization, and dialog generation. Rafailov et al. [7] showed that DPO can achieve an excellent performance in summarization tasks, and various extended studies have been reported since then. To name a few, a method of calibrating the preference optimization loss to prevent over-confidence [17], a multi-objective optimization framework that addresses the diversity of human preferences [18], a method that uses self-retrospection to enhance preference optimization [19], a method of performing fine-grained preference optimization at the token level rather than the full responses [20], and a method of adding a regularization term to the preference optimization loss to disentangle response quality from length, controlling the length bias in preference [21], have been proposed.

Recently, DPO has also been expanding into the field of image processing and computer vision, such as image generation and text-to-image alignment. Wallace et al. [22] applied DPO to a diffusion-based text-to-image generation model and made it better aligned with human preferences. Lee et al. [23] proposed a modified DPO known as the direct consistency optimization, which controls the deviation between the fine-tuned and reference models, allowing the personalized text-to-image diffusion model to generate images consistent to both subject and style. Their method is not based on human preferences, but shows that the alignment ability of DPO is also effective for training personalized image generation models. Croitoru et al. [24] proposed Curriculum DPO, which is a method that combines curriculum learning and DPO to gradually learn from a series of easy to difficult examples by dividing text–image pairs based on difficulty. Through this, both stability and performance were improved in the alignment of the diffusion model. These studies show that DPO can be useful in the field of computer vision, but most of them are still limited to text-to-image diffusion models. They do not deviate from the original DPO framework, indicating that they cannot be applied to other models or network architectures. Most recently, as the only study to apply DPO to SR, direct semantic preference optimization (DSPO) [25], which is a method for suppressing visual artifacts by introducing semantic-level preference alignment in real-world SR, has been proposed. However, DSPO is also based on diffusion models; thus, it is not applicable to other network architectures and requires instance-level preference datasets.

Unlike the existing vision-domain DPO-based methods, our method provides a breakthrough to extend DPO to models or network architectures other than diffusion models, without the need for human-labeled preference datasets.

3. Proposed Method

3.1. Preliminaries

DPO is at the heart of the proposed method, so this section briefly reviews it.

Reinforcement learning methods such as reinforcement learning from human feedback (RLHF) [26], which fine-tune generative models to generate the output that people prefer, have recently emerged. They have played an important role in improving the performance of large-scale models (LMs) such as language models and diffusion models. RLHF trains a reward model (

r_{ϕ}

) using the human preference dataset (D) by minimizing the negative log-likelihood loss,

- E_{(x, y_{w}, y_{l}) \sim D} [\log σ (r_{ϕ} (x, y_{w}) - r_{ϕ} (x, y_{l}))]

, where

σ

is the sigmoid function, while

y_{w}

and

y_{l}

represent the preferred and non-preferred outputs for the input x. Then, RLHF optimizes LMs (

π_{θ}

) to produce outputs with high rewards while constraining them not to stray too far from the pre-trained LMs’ outputs, as follows:

RLHF : max_{π_{θ}} E_{x \sim D, y \sim π_{θ} (y | x)} [r_{ϕ} (x, y) - β D_{K L} (π_{θ} (y | x), π_{r e f} (y | x))] .

(1)

Here,

β

is a parameter controlling the deviation from the pre-trained reference model (

π_{r e f}

), while

D_{K L}

represents the Kullback–Leibler (KL) divergence. Since Equation (1) is not differentiated, optimization is performed through proximal policy optimization [27], which is one of the reinforcement learning methods. That is, with Equation (1) as the final reward for reinforcement learning, the policy model is trained so that the reward increases.

DPO sticks to the KL-constrained reward maximization, which is the strategy of RLHF for preference learning. However, DPO eliminated the process of training a reward model in advance, as well as the process of reinforcement learning for policy optimization, by reparameterizing the reward model with some algebra and applying the reparameterization to the Bradley–Terry preference model [28], enabling us to solve the RLHF problem using a simple differentiable loss, as follows:

L_{D P O} (π_{θ}; π_{r e f}) = - E_{(x, y_{w}, y_{l}) \sim D} [\log σ (β \log \frac{π_{θ} (y_{w} | x)}{π_{r e f} (y_{w} | x)} - β \log \frac{π_{θ} (y_{l} | x)}{π_{r e f} (y_{l} | x)})] .

(2)

Here,

β \log \frac{π_{θ} (y | x)}{π_{r e f} (y | x)}

is an implicit reward model defined by the log probability of outputs. Therefore, the reward, i.e., the relative log probability of preferred to non-preferred outputs, increases during the training of the policy model (

π_{θ}

). The common parameter of the two reward terms,

β

, plays the same role as in Equation (1). That is, a large

β

increases the reward, resulting in a reduced DPO loss and the suppressed learning of

π_{θ}

. This constrains

π_{θ}

from deviating from the reference model. The detailed process for deriving Equation (2) from Equation (1) can be found in [7]. The fine-tuned language model using the DPO loss (Equation (2)) showed a better performance in summary and single-turn problems compared to the RLHF fine-tuned model [7].

We adopt ESRGAN as the baseline model, so we briefly review it here. ESRGAN is a GAN-based SR model [3]. It consists of a generator and a discriminator, which are trained in an adversarial manner. In other words, the generator is trained to generate more natural SR images, while the discriminator is trained to better distinguish between the generated SR images and the real HR images. This allows the generator to generate high-perceptual-quality SR images. The generator is trained with the following combination loss:

L_{E S R G A N} = α_{1} L_{a d v} + α_{2} L_{p e r c e p} + α_{3} L_{1} .

(3)

Here,

L_{a d v}

,

L_{p e r c e p}

, and

L_{1}

are adversarial, perceptual, and L1 losses, respectively. The adversarial loss is computed from the discriminator output, indicating the probability that the generated image is discriminated as unrealistic. The perceptual loss represents the difference between the VGG [29] features of the SR and HR images. In summary, to train the generator, ESRGAN uses losses that reflect the perceptual quality and authenticity of images, but that are not directly related to human preferences.

3.2. Introducing DPO into the SR Process

We propose an SR-DPO loss in the training process to improve ESRGAN’s performance via preference learning. Here, we describe how to derive the SR-DPO loss.

In order to calculate the DPO loss in Equation (2), the preferred output (

y_{w}

) and the non-preferred output (

y_{l}

) for the input x are required. However, unlike language models or diffusion models that produce probabilistically different outputs from the inputs, SR models always produce the same outputs from the inputs. Therefore, the DPO loss of Equation (2) cannot be directly applied to SR models such as ESRGAN. To resolve this inherent problem, we obtain preferred and non-preferred outputs from the outputs of

N_{B}

images included in the input batch (denoted as

x_{B}

). That is, the DPO loss is modified to be calculated using the input

x_{w}

, which generates the output

y_{w}

with the highest preference among the input images, and the input

x_{l}

, which generates the output

y_{l}

with the lowest preference. The preference calculation is performed using the pre-trained PieAPP model [30], and the output with the lowest PieAPP value is determined as the preferred output, while the output with the highest PieAPP value is determined as the non-preferred output. The PieAPP model learned human preferences for images, i.e., a deep CNN model was trained to predict the probability that humans will prefer an image over the other using large-scale human-labeled preference datasets. It outputs image quality values correlated with real human preference responses; therefore, it fits our purpose well. Since we use the pre-trained and frozen PieAPP model to calculate human preferences, we can omit the explicit or implicit reward model training process required by DPO or RLHF. Furthermore, we no longer need to prepare the human preference datasets. Based on this, Equation (2) is modified as follows:

L_{S R_D P O} (π_{θ}; π_{r e f}) = - E_{x_{B} \sim I, (y_{w}, y_{l}) \sim π_{θ} (y_{B} | x_{B})} [\log σ (β \log \frac{π_{θ} (y_{w} | x_{B})}{π_{r e f} (y_{w} | x_{B})} - β \log \frac{π_{θ} (y_{l} | x_{B})}{π_{r e f} (y_{l} | x_{B})})] .

(4)

Here,

I

is an image dataset that does not include human preferences and

π

represents a model in which the SR model and the frozen PieAPP model are connected. The output of

π

is a PieAPP value, not a probability, unlike the original DPO. However, this does not matter because the reward value is relatively computed from the ratio between preferred and non-preferred

π

outputs. However, considering that the smaller the PieAPP value is, the more preferred the image is, as opposed to the reward value of DPO, we need to change the order of two reward terms as follows:

L_{S R_D P O} (π_{θ}; π_{r e f}) = - E_{x_{B} \sim I, (y_{w}, y_{l}) \sim π_{θ} (y_{B} | x_{B})} [\log σ (β \log \frac{π_{θ} (y_{l} | x_{B})}{π_{r e f} (y_{l} | x_{B})} - β \log \frac{π_{θ} (y_{w} | x_{B})}{π_{r e f} (y_{w} | x_{B})})] .

(5)

This loss allows SR models to be trained to make the non-preferred results less preferred and the preferred results more preferred. However, considering that the PieAPP value is always above 0, it is relatively easier for SR models to be trained in the direction of making the PieAPP values larger than in the direction of making the PieAPP values smaller; therefore, it is highly likely that only the training that makes SR images that are not preferred more non-preferred will proceed. Above all, the performance of SR models can be degraded if SR images that are not preferred are made to be more non-preferred. For this reason, we can prevent backpropagation through the first reward term (which is simply implemented by using the PyTorch detach function), enabling only the training that makes the SR images that are preferred more preferred, as follows:

L_{S R_D P O} (π_{θ}; π_{r e f}) = - E_{x_{B} \sim I, (y_{w}, y_{l}) \sim π_{θ} (y_{B} | x_{B})} [\log σ (β \log \frac{π_{θ_{n o_g r a d}} (y_{l} | x_{B})}{π_{r e f} (y_{l} | x_{B})} - β \log \frac{π_{θ} (y_{w} | x_{B})}{π_{r e f} (y_{w} | x_{B})})] .

(6)

However, this method is not very helpful in improving the performance of SR models. This will be shown in the experimental results later.

Unlike language models, SR models can perform better by training in a way that increases the preference of relatively non-preferred SR images. Therefore, we return to Equation (4). However, we prevent backpropagation through the first reward term, thus disabling the training that makes the preferred SR images less preferred. As a result, our final SR-DPO loss is defined as follows:

L_{S R_D P O} (π_{θ}; π_{r e f}) = - E_{x_{B} \sim I, (y_{w}, y_{l}) \sim π_{θ} (y_{B} | x_{B})} [\log σ (β \log \frac{π_{θ_{n o_g r a d}} (y_{w} | x_{B})}{π_{r e f} (y_{w} | x_{B})} - β \log \frac{π_{θ} (y_{l} | x_{B})}{π_{r e f} (y_{l} | x_{B})})] .

(7)

This loss serves to increase the preference for low-preference SR images while maintaining the preference of high-preference SR images by comparing the preferences of SR images in the batch.

3.3. Determining the Reference Model

DPO is an approach to fine-tune a given reference model through preference learning. Therefore, we need to determine the reference model for our study. At first, we considered using a fully trained ESRGAN generator with the L1, perceptual, and adversarial losses as a reference model. However, this forces ESRGAN to be tweaked, relying too heavily on PieAPP values, resulting in a problem whereby PieAPP values decrease compared to the reference model; however, the performance in other evaluation metrics is rather poor. This will be shown in the experimental results later.

As a result, we use an ESRGAN generator trained only with L1 loss as a reference model. However, in the subsequent fine-tuning process, the SR-DPO loss and the perceptual and adversarial losses used in ESRGAN are used together. This allows us to further improve the perceptual quality of SR images through preference learning without degrading the original performance of ESRGAN.

Additionally, we also consider not using the reference model. That is, unlike DPO, where the relative reward to the reference model is maximized, we attempt to maximize the absolute reward of the policy model. This is the same as not using the KL divergence constraint in Equation (1). The SR-DPO loss without the reference model is defined as follows:

L_{S R_D P O_N o R e f} (π_{θ}) = - E_{x_{B} \sim I, (y_{w}, y_{l}) \sim π_{θ} (y_{B} | x_{B})} [\log σ (β \log π_{θ_{n o_g r a d}} (y_{w} | x_{B}) - β \log π_{θ} (y_{l} | x_{B}))] .

(8)

3.4. Training ESRGAN Using the SR-DPO Loss

The proposed method can be summarized as fine-tuning the ESRGAN generator using an additional loss that is related to user preferences. Figure 1 shows the process flow of the proposed method. The SR-DPO loss is shared among the images in the input batch because it is computed once per batch, while the other losses are computed independently for each image. The generator is fine-tuned to make

y_{l}^{(i)}

more preferred, while being constrained to make it similar to

y_{l}^{(0)}

. That is, regardless of whether the reference model is used or not, the ESRGAN generator is pre-trained first with only the L1 loss and fine-tuned with the discriminator using a combination loss, as follows:

L_{+ D P O} = L_{E S R G A N} + α_{4} L_{S R_D P O} .

(9)

Here,

L_{E S R G A N}

represents the total loss of the original ESRGAN in Equation (3). In

L_{S R_D P O}

,

β

is set to 0.1, because too small a value cannot preserve the performance of the reference model, while a value greater than 0.2 degrades the performance of preference learning [7].

4. Experimental Results and Discussion

4.1. Setup

For the experiment, we trained ESRGAN and its variants using the DIV2K [31] training dataset. To increase the amount of training data, data augmentation was performed through random horizontal or vertical flips. To test the trained models, we used the DIV2K validation dataset, Urban100 [32], BSD100 [33], Set5 [34], and Set14 [35].

For training, HR images were randomly cropped to

128 \times 128

and downsampled using bicubic interpolation with a scaling factor of 4 to obtain LR images. All the training parameters of the original ESRGAN method were maintained, except that the mini-batch size (

N_{B}

) was set to 4. The number of epochs (N) was 500. Only the generator was trained with the L1 loss during the first 200 epochs, and the generator and discriminator were adversarially trained using all the losses in Equations (3) and (9), where

α_{1} = 5 \times 10^{- 3}

,

α_{2} = 1.0

,

α_{3} = 1 \times 10^{- 2}

, and

α_{4} = 1.0

, during the remaining epochs. The learning rate was set to

1 \times 10^{- 4}

and was halved at

[0.125, 0.250, 0.500, 0.750] \times N

epochs. The model was optimized using Adam with

β_{1} = 0.9

and

β_{2} = 0.999

. The implementation was carried out using PyTorch 2.1.0 and training on a PC with Intel i7 2.1 GHz CPU and NVIDIA RTX 3090 GPU.

To evaluate the quantitative and qualitative quality of SR images, we computed the values of PSNR, SSIM, LPIPS, PieAPP, and NIQE. NIQE is a no-reference image quality metric, and the others are full-reference ones. PSNR has been most commonly used to measure the quality of reconstructed images, comparing the maximum value of pixels (

P e a k

) with the mean-squared error (

M S E

) between the SR and HR images [36,37,38]. This is defined as follows:

P S N R = 20 {log}_{10} P e a k - 10 {log}_{10} M S E

. SSIM estimates the similarity between SR and HR images and is defined as follows:

S S I M = l (y, \bar{y}) \cdot c (y, \bar{y}) \cdot s (y, \bar{y})

. Here, y and

\bar{y}

represent SR and HR images, respectively. l, c, and s represent the luminance, contrast, and structural similarities between two images, respectively, and can be computed from the average, standard deviation, and covariance of the pixel values [36,39,40]. Therefore, higher PSNR and SSIM values indicate better image quality. LPIPS extracts features from images using a CNN model pre-trained on an image classification task and computes the L2 distance between the features, as follows:

L P I P S = \sum_{l} \frac{1}{H_{l} W_{l}} \sum_{h, w} {[w_{l} (f_{h, w}^{l} - {\bar{f}}_{h, w}^{l})]}^{2}

[41]. Here,

f^{l}

and

{\bar{f}}^{l}

are the SR and HR feature maps obtained from the l-th layer of the pre-trained CNN model.

w_{l}

,

H_{l}

, and

W_{l}

are the scale factor, the height, and the width of the feature maps for layer l, respectively. PieAPP trains a CNN model using a pairwise learning framework to predict the probability that humans will prefer one image over the other on human-labeled preference datasets, and the trained model is used to measure the perceptual difference between images [30]. The PieAPP value is more correlated with human preferences. NIQE measures deviations from statistical regularities observed in natural images, without reference images [42]. It extracts statistical features based on a natural scene statistic (NSS) model from images and computes a multivariate Gaussian (MVG) fit of the features. The quality of an image is expressed as the distance between its MVG fit and the MVG fit of the NSS features extracted from natural images, as follows:

N I Q E = \sqrt{{(ν_{1} - ν_{2})}^{T} {(\frac{Σ_{1} + Σ_{2}}{2})}^{- 1} (ν_{1} - ν_{2})}

. Here,

ν

and

Σ

are the mean vectors and covariance matrices of the MVG fits. Therefore, lower PieAPP, LPIPS, and NIQE values indicate better image quality.

4.2. Effectiveness of Using the SR-DPO Loss

To show the effectiveness of using the SR-DPO loss, we compared ESRGANs trained with different losses—

L_{E S R G A N}

,

L_{+ P i e A P P}

, and

L_{+ D P O}

.

L_{+ P i e A P P}

is the addition of the PieAPP values of generated SR images to

L_{E S R G A N}

, which is proposed in [5] and has proven to outperform other loss modifications. Table 1 shows the results of the quality metrics. Compared to

L_{E S R G A N}

,

L_{+ D P O}

contributed to a significant reduction in PieAPP values indicating user preferences, and it can be seen that its effectiveness is higher than

L_{+ P i e A P P}

. In addition, it can be seen that

L_{+ P i e A P P}

has the problem of increasing the LPIPS values, but

L_{+ D P O}

has similar LPIPS values to

L_{E S R G A N}

. Quantitative quality metrics such as PSNR or SSIM are also degraded using

L_{+ P i e A P P}

, but are maintained or improved using

L_{+ D P O}

. Both

L_{+ P i e A P P}

and

L_{+ D P O}

contributed to reducing the NIQE values, and the performance difference between the two was not significant. Consequently, it can be said that

L_{+ D P O}

effectively improves the quantitative and perceptual image quality of SR images without compromising the performance of ESRGAN, and

L_{+ D P O}

outperforms

L_{+ P i e A P P}

.

Figure 2 shows some SR images generated using

L_{E S R G A N}

,

L_{+ P i e A P P}

, and

L_{+ D P O}

, respectively.

L_{E S R G A N}

suffered from recovering local details and caused unnatural visual artifacts (e.g., line artifacts in the crosswalk or severely distorted lines in the wood fence).

L_{+ P i e A P P}

reduced the visual artifacts in some images, but tended to lose structural contexts more than

L_{E S R G A N}

, which may be the reason why its SSIM and LPIPS values were worse than

L_{E S R G A N}

. As expected,

L_{+ D P O}

recovered local details much better (e.g., the boundaries of the white dots on the butterfly and the stripes on the tiger are clear, and their shapes are more similar to the ground-truth), while minimizing unnatural visual artifacts (the line artifacts at the crosswalk are not strong).

In this study, we do not replace or eliminate the losses originally used for ESRGAN, as the goal is to fine-tune ESRGAN by adding the SR-DPO loss to the existing losses. In other words, the SR-DPO loss cannot replace the original losses. Referring to [5], it is predicable that any replacement or elimination will significantly degrade the performance of ESRGAN because the original losses play an important role in achieving a high perceptual quality of the SR images.

4.3. Performance Comparison Based on How to Design the SR-DPO Loss

Section 3.2 described how we designed the SR-DPO loss. In this section, we show how much the performance degrades when the SR-DPO loss is designed differently. In Table 2, we compared the results of using Equations (4), (5), (6), and (7), respectively. As aforementioned, it was not good for SR to make the non-preferred results more non-preferred and the preferred results more preferred (Equation (5)). The image quality became severely worse on all metrics. Rather, making preferred SR images less preferred and non-preferred images more preferred resulted in better results (Equation (4)). It seems that the performance improvement by making non-preferred images preferred is relatively large. However, making preferred SR images less preferred has hindered performance improvement, as expected. Performing only training that makes the preferred SR images more preferred also did not result in performance improvement (Equation (6)). Only the proposed scheme that makes non-preferred images preferred has resulted in high performance improvements (Equation (7)).

4.4. Influence of the Reference Model

As mentioned in Section 3.3, different reference models were considered. First, a fully trained ESRGAN with

L_{E S R G A N}

was used as a reference model and was fine-tuned with

L_{S R_D P O}

. Second, an ESRGAN trained with

L_{1}

was used as a reference model and was fine-tuned with

L_{+ D P O}

. Third, no reference model was used and the L1-trained ESRGAN was fine-tuned with

L_{+ D P O_N o R e f}

, which replaced

L_{S R_D P O}

of

L_{+ D P O}

with

L_{S R_D P O_N o R e f}

in Equation (8). Table 3 shows the results. In the first case, the PieAPP value decreased compared to the reference ESRGAN, but the performance was significantly reduced in other evaluation metrics by focusing too much on lowering the PieAPP value. In fact, our DPO method has to maintain the original performance due to the KL constraint. However, a small

β

could not guarantee performance maintenance. This also implies that the DPO loss is not in redundancy with the perceptual and adversarial losses of ESRGAN. When the L1-trained ESRGAN was used as a reference model, preference learning was performed effectively while maintaining the performance of the reference model, and PieAPP and LPIPS values were greatly improved. When no reference model was used, PieAPP values could have been much lower by not caring about maintaining the performance of the reference model, but the performance was worse than that when using the L1-trained reference model in other evaluation metrics.

4.5. Performance Comparison of the PieAPP and LPIPS Models for Preference Calculation

As explained in Section 3.2, our method uses the pre-trained PieAPP model for preference calculation. To show its suitability, we attempted to use the pre-trained LPIPS model [41] for the calculation and analyzed its results. Like the PieAPP value, the smaller the LPIPS value, the more preferred the image; therefore, our formula for DPO-based SR can be used as is. The results are shown in Table 4. Even with the LPIPS model, the perceptual quality of the SR images could be improved while the performance of the reference model (vanilla ESRGAN) could be maintained. However, compared to the results of using the PieAPP model for preference calculation (the results of

L_{+ D P O}

in Table 1), the PieAPP values are significantly higher, although the LPIPS and NIQE values were slightly lower. Low NIQE values did not always imply perceptually better SR results. As shown in Figure 3, the LPIPS model tended to produce a rich but unnatural texture. The LPIPS model also caused a color shift in the SR images. In addition, in the remaining metrics (PSNR and SSIM), the performance of using the LPIPS model was worse. The performance difference between the datasets was not discernible. As a result, we confirm that the PieAPP model is more suitable for preference calculation. This was predictable because the PieAPP model was trained to infer the difference between human preferences, while the LPIPS model was trained to infer the difference between perceptual features, as explained in Section 4.1.

4.6. Influence of the Batch Size for Preference Calculation

As explained in Section 3.2, our DPO method determines preferred and non-preferred outputs within a batch, indicating that the performance may be dependent on the batch size (

N_{B}

). Therefore, we wanted to analyze the effect of batch size on performance. However, due to the lack of GPU memory, the batch size could not be set higher than 8. We simply compared the results of increasing the batch size to 8 with the results presented previously (obtained with a batch size of 4). Table 5 shows the results when the batch size is 8. Compared to the results when the batch size is 4 (the results of

L_{+ D P O}

in Table 1), although there are some differences depending on the dataset, the difference in performance was not discernible according to the batch size. If anything, the PSNR and LPIPS values were slightly better with the smaller batch size. Therefore, smaller batch sizes are preferred.

4.7. Performance Comparison with Other ESRGAN Improvement Models

Our DPO method modifies only the loss function without modifying the network architecture of ESRGAN. In Table 1, it was shown that our method outperforms other methods that have attempted to modify the loss function. Here, our DPO method is compared with recent methods that have attempted to modify the network architecture of the ESRGAN generator or discriminator—Real-ESRGAN [13], StarSRGAN [43], ESRGAN-DP [10], A-ESRGAN [11], MSA-ESRGAN [6], and SeD [44]. The results are given in Table 6.

In the PieAPP values, the performance of our DPO method was the best. However, because our DPO method aims to improve the perceptual quality or human preference of SR images while maintaining the performance of ESRGAN, the degree of improvement in other performance indicators is not so high, resulting in a lower performance than SeD (However, the results of SeD were obtained using the weight file provided by the authors) [44]. The weight file was obtained using a different training dataset from ours and a larger mini-batch size. This may be why SeD performs better than our DPO method). This shows that even if we try to improve the performance of the generator by improving the loss function in the GAN structure, the degree of performance improvement may not be high unless the performance of the discriminator is improved together. This also implies that the performance of ESRGAN can be improved more significantly by improving the performance of the discriminator. ESRGAN’s discriminator was not helpful for recovering local details at the pixel level because it classifies images at the image level. This is why most ESRGAN improvements focused on improving the discriminator. It seems that SeD has effectively solved the ESRGAN discriminator problem. However, when incorporated into the image-wise discriminator, even SeD experienced significant performance degradation [44]. For this reason, as shown in Figure 4, our DPO method was not able to recover local fine details as clearly as SeD (e.g., the dense stripes on the roof are broken or distorted). Nevertheless, our DPO method performed better than the other methods. ESRGAN-DP achieved values close to those of SeD in all metrics, thanks to the use of ResNet features that complement VGG features in calculating

L_{p e r c e p}

. However, ESRGAN-DP still had a limitation in recovering fine details, and often produced annoying visual artifacts or structural distortions that degraded the perceptual quality of SR images. The overall perceptual quality was worse than that of our DPO method. Real-ESRGAN, StarSRGAN, A-ESRGAN, and MSA-ESRGAN have also improved the performance of the ESRGAN discriminator, i.e., allowing for the classification of images at the pixel level by leveraging the U-Net discriminator [45]. However, they improved only the NIQE value and underperformed in other performance metrics. This seems to be due to recovering the rich but unnatural local texture (which can be seen in Figure 4). A-ESRGAN tended to smooth out the fine details and thicken the lines, and Real-ESRGAN and MSA-ESRGAN also reported this tendency. A-ESRGAN and Real-ESRGAN also distorted local structures more significantly. StarSRGAN integrated the network architectures and losses used in Real-ESRGAN, A-ESRGAN, and ESRGAN-DP, thus achieving better performance than Real-ESRGAN, A-ESRGAN, and MSA-ESRGAN. However, StarSRGAN still produced SR images with structural distortion and unnatural textures and performed worse than our DPO method, which simply introduced the DPO loss to the vanilla ESRGAN. In all performance metrics except NIQE, Real-ESRGAN, StarSRGAN, A-ESRGAN, and MSA-ESRGAN performed worse than our DPO method.

5. Conclusions and Future Works

This study proposed DPO-ESRGAN, a method that applied DPO to ESRGAN. The DPO formula was modified to fit the SR process. DPO-ESRGAN used the pre-trained PieAPP model for preference calculation and could be trained without human preference datasets. Through ×4 SR experiments utilizing benchmark datasets, it was demonstrated that DPO-ESRGAN can generate SR images with a significantly higher perceptual quality and human preference than ESRGAN and other ESRGAN variants. DPO-ESRGAN worked well for small batch sizes. Although we were unable to perform a full analysis, smaller batch sizes yielded slightly better results. This study aims to generate SR images with high perceptual quality and human preference. Therefore, our method is more useful in applications such as multimedia industry and commercial image/video enhancement than applications that require high fidelity, such as forensics, medical diagnosis, and biometrics.

In this study, we adopted ESRGAN as the baseline model, but our DPO method can be applied to other SR models with different network architectures without modification. However, to verify this applicability, experimental validation is required. Related experiments remain a part of our future study.

As discussed in Section 4.7, our DPO method did not achieve as impressive a performance improvement as expected, due to the tendency to maintain the performance of the reference model, as well as the lack of discriminator performance. For further improvements, we believe that the performance of the discriminator must be improved, which remains to be explored in a future study.

Unlike what was mentioned in Section 3.4,

β

may operate differently in models other than language models. Therefore, an experimental analysis of the sensitivity of our DPO method to

β

would be an interesting future study.

PieAPP values directly encode human preferences. Therefore, we concluded that our DPO method contributes to improving the human preference of SR images based on the analysis of their PieAPP values. However, a direct human evaluation of the SR images would better support the conclusion. We plan to conduct human evaluation experiments in the near future. This is the highest-priority future work in order that we can fully validate our claims.

Author Contributions

Conceptualization: W.Y. and H.P.; methodology: W.Y. and H.P.; software: W.Y.; supervision: H.P.; validation: W.Y. and H.P.; writing—original draft: W.Y.; writing—review and editing: H.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data that support the findings of this study are publicly available in the online repositories.

Conflicts of Interest

We have no conflicts of interest to declare.

Abbreviations

The following abbreviations are used in this manuscript:

SR	Super-Resolution
CNN	Convolutional Neural Network
GAN	Generative Adversarial Network
DPO	Direct Preference Optimization
LR	Low Resolution
HR	High Resolution
DSPO	Direct Semantic Preference Optimization
RLHF	Reinforcement Learning from Human Feedback
LM	Large-scale Model
KL	Kullback–Leibler

References

Lepcha, D.C.; Goyal, B.; Dogra, A.; Goyal, V. Image super-resolution: A comprehensive review, recent trends, challenges and applications. Inf. Fusion 2023, 91, 230–260. [Google Scholar] [CrossRef]
Ye, S.; Zhao, S.; Hu, Y.; Xie, C. Single-Image Super-Resolution Challenges: A Brief Review. Electronics 2023, 12, 2975. [Google Scholar] [CrossRef]
Wang, X.; Yu, K.; Wu, S.; Gu, J.; Liu, Y.; Dong, C.; Qiao, Y.; Loy, C.C. ESRGAN: Enhanced Super-Resolution Generative Adversarial Networks. In Lecture Notes in Computer Science, Proceedings of the ECCV 2018 Workshops, Munich, Germany, 8–14 September 2018; Springer: Cham, Switzerland, 2018; pp. 63–79. [Google Scholar] [CrossRef]
Rakotonirina, N.C.; Rasoanaivo, A. ESRGAN+: Further Improving Enhanced Super-Resolution Generative Adversarial Network. In Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 3637–3641. [Google Scholar] [CrossRef]
Choi, Y.; Park, H. Improving ESRGAN with an additional image quality loss. Multimed. Tools Appl. 2023, 82, 3123–3137. [Google Scholar] [CrossRef]
Chen, Q.; Li, H.; Lu, G. Training ESRGAN with multi-scale attention U-Net discriminator. Sci. Rep. 2024, 14, 29036. [Google Scholar] [CrossRef] [PubMed]
Rafailov, R.; Sharma, A.; Mitchell, E.; Ermon, S.; Manning, C.D.; Finn, C. Direct preference optimization: Your language model is secretly a reward model. In Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023; Curran Associates Inc.: Red Hook, NY, USA, 2023. [Google Scholar]
Johnson, J.; Alahi, A.; Fei-Fei, L. Perceptual Losses for Real-Time Style Transfer and Super-Resolution. In Proceedings of the 14th European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 694–711. [Google Scholar] [CrossRef]
Ledig, C.; Theis, L.; Huszar, F.; Caballero, J.; Cunningham, A.; Acosta, A.; Aitken, A.; Tejani, A.; Totz, J.; Wang, Z.; et al. Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21– 26 July 2017; pp. 105–114. [Google Scholar] [CrossRef]
Song, J.; Yi, H.; Xu, W.; Li, X.; Li, B.; Liu, Y. ESRGAN-DP: Enhanced super-resolution generative adversarial network with adaptive dual perceptual loss. Heliyon 2023, 9, e15134. [Google Scholar] [CrossRef] [PubMed]
Wei, Z.; Huang, Y.; Chen, Y.; Zheng, C.; Gao, J. A-ESRGAN: Training Real-World Blind Super-Resolution with Attention U-Net Discriminators. In Lecture Notes in Computer Science, Proceedings of the 20th Pacific Rim International Conference on Artificial Intelligence, Jakarta, Indonesia, 15–19 November 2023; Springer: Singapore, 2023; pp. 16–27. [Google Scholar] [CrossRef]
Zhang, K.; Liang, J.; Van Gool, L.; Timofte, R. Designing a Practical Degradation Model for Deep Blind Image Super-Resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Montreal, BC, Canada, 10–17 October 2021; pp. 4771–4780. [Google Scholar] [CrossRef]
Wang, X.; Xie, L.; Dong, C.; Shan, Y. Real-ESRGAN: Training Real-World Blind Super-Resolution with Pure Synthetic Data. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Montreal, BC, Canada, 10–17 October 2021; pp. 1905–1914. [Google Scholar] [CrossRef]
Liang, J.; Cao, J.; Sun, G.; Zhang, K.; Van Gool, L.; Timofte, R. SwinIR: Image Restoration Using Swin Transformer. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Montreal, BC, Canada, 10–17 October 2021; pp. 1833–1844. [Google Scholar] [CrossRef]
Lu, Z.; Li, J.; Liu, H.; Huang, C.; Zhang, L.; Zeng, T. Transformer for Single Image Super-Resolution. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), New Orleans, LA, USA, 19–20 June 2022; pp. 456–465. [Google Scholar] [CrossRef]
Moser, B.B.; Shanbhag, A.S.; Raue, F.; Frolov, S.; Palacio, S.; Dengel, A. Diffusion Models, Image Super-Resolution, and Everything: A Survey. IEEE Trans. Neural Netw. Learn. Syst. 2024, 36, 11793–11813. [Google Scholar] [CrossRef] [PubMed]
Xiao, T.; Yuan, Y.; Zhu, H.; Li, M.; Honavar, V.G. Cal-DPO: Calibrated Direct Preference Optimization for Language Model Alignment. In Proceedings of the 37th International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 10–15 December 2024; Curran Associates, Inc.: Red Hook, NY, USA, 2024; Volume 37, pp. 114289–114320. [Google Scholar]
Zhou, Z.; Liu, J.; Shao, J.; Yue, X.; Yang, C.; Ouyang, W.; Qiao, Y. Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct Preference Optimization. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, Bangkok, Thailand, 11–16 August 2024; Association for Computational Linguistics: Bangkok, Thailand, 2024; pp. 10586–10613. [Google Scholar] [CrossRef]
Ahn, D.; Choi, Y.; Kim, S.; Yu, Y.; Kang, D.; Choi, J. ISR-DPO: Aligning Large Multimodal Models for Videos by Iterative Self-Retrospective DPO. In Proceedings of the 39th Annual AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 20–27 February 2025. [Google Scholar]
Zeng, Y.; Liu, G.; Ma, W.; Yang, N.; Zhang, H.; Wang, J. Token-level direct preference optimization. In Proceedings of the 41st International Conference on Machine Learning, Vienna, Austria, 21–27 July 2024. [Google Scholar]
Park, R.; Rafailov, R.; Ermon, S.; Finn, C. Disentangling Length from Quality in Direct Preference Optimization. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, Bangkok, Thailand, 11–16 August 2024; Association for Computational Linguistics: Bangkok, Thailand, 2024; pp. 4998–5017. [Google Scholar] [CrossRef]
Wallace, B.; Dang, M.; Rafailov, R.; Zhou, L.; Lou, A.; Purushwalkam, S.; Ermon, S.; Xiong, C.; Joty, S.; Naik, N. Diffusion Model Alignment Using Direct Preference Optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 8228–8238. [Google Scholar]
Lee, K.; Kwak, S.; Sohn, K.; Shin, J. Direct Consistency Optimization for Robust Customization of Text-to-Image Diffusion models. In Proceedings of the 37th International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 10–15 December 2024; Curran Associates, Inc.: Red Hook, NY, USA, 2024; Volume 37, pp. 103269–103304. [Google Scholar]
Croitoru, F.A.; Hondru, V.; Ionescu, R.T.; Sebe, N.; Shah, M. Curriculum Direct Preference Optimization for Diffusion and Consistency Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 10–17 June 2025. [Google Scholar]
Cai, M.; Li, S.; Li, W.; Huang, X.; Chen, H.; Hu, J.; Wang, Y. DSPO: Direct Semantic Preference Optimization for Real-World Image Super-Resolution. arXiv 2025, arXiv:2504.15176. [Google Scholar] [CrossRef]
Christiano, P.F.; Leike, J.; Brown, T.B.; Martic, M.; Legg, S.; Amodei, D. Deep reinforcement learning from human preferences. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Curran Associates Inc.: Red Hook, NY, USA, 2017; pp. 4302–4310. [Google Scholar]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]
Bradley, R.A.; Terry, M.E. Rank Analysis of Incomplete Block Designs: I. The Method of Paired Comparisons. Biometrika 1952, 39, 324–345. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Proceedings of the 3rd International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015; Bengio, Y., LeCun, Y., Eds.; pp. 1–14. [Google Scholar]
Prashnani, E.; Cai, H.; Mostofi, Y.; Sen, P. PieAPP: Perceptual Image-Error Assessment Through Pairwise Preference. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 1808–1817. [Google Scholar] [CrossRef]
Agustsson, E.; Timofte, R. NTIRE 2017 Challenge on Single Image Super-Resolution: Dataset and Study. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 1122–1131. [Google Scholar] [CrossRef]
Huang, J.B.; Singh, A.; Ahuja, N. Single image super-resolution from transformed self-exemplars. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 5197–5206. [Google Scholar] [CrossRef]
Martin, D.; Fowlkes, C.; Tal, D.; Malik, J. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proceedings of the Eighth IEEE International Conference on Computer Vision, Vancouver, BC, Canada, 7–14 July 2001; pp. 416–423. [Google Scholar] [CrossRef]
Bevilacqua, M.; Roumy, A.; Guillemot, C.; Alberi Morel, M.-L. Low-Complexity Single-Image Super-Resolution based on Nonnegative Neighbor Embedding. In Proceedings of the British Machine Vision Conference, Surrey, UK, 3–7 September 2012; pp. 135.1–135.10. [Google Scholar] [CrossRef]
Zeyde, R.; Elad, M.; Protter, M. On Single Image Scale-Up Using Sparse-Representations. In Proceedings of the International Conference on Computing Sciences (ICCS), Phagwara, India, 14–15 September 2012; pp. 711–730. [Google Scholar]
Horé, A.; Ziou, D. Image Quality Metrics: PSNR vs. SSIM. In Proceedings of the 20th International Conference on Pattern Recognition, Istanbul, Turkey, 23–26 August 2010; pp. 2366–2369. [Google Scholar] [CrossRef]
Li, L.; Song, S.; Lv, M.; Jia, Z.; Ma, H. Multi-Focus Image Fusion Based on Fractal Dimension and Parameter Adaptive Unit-Linking Dual-Channel PCNN in Curvelet Transform Domain. Fractal Fract. 2025, 9, 157. [Google Scholar] [CrossRef]
Lv, M.; Song, S.; Jia, Z.; Li, L.; Ma, H. Multi-Focus Image Fusion Based on Dual-Channel Rybak Neural Network and Consistency Verification in NSCT Domain. Fractal Fract. 2025, 9, 432. [Google Scholar] [CrossRef]
Wang, Z.; Bovik, A.; Sheikh, H.; Simoncelli, E. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed]
Cao, Z.H.; Liang, Y.J.; Deng, L.J.; Vivone, G. An Efficient Image Fusion Network Exploiting Unifying Language and Mask Guidance. IEEE Trans. Pattern Anal. Mach. Intell. 2025. [Google Scholar] [CrossRef] [PubMed]
Zhang, R.; Isola, P.; Efros, A.; Shechtman, E.; Wang, O. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 586–595. [Google Scholar]
Mittal, A.; Soundararajan, R.; Bovik, A.C. Making a “Completely Blind” Image Quality Analyzer. IEEE Signal Process. Lett. 2013, 20, 209–212. [Google Scholar] [CrossRef]
Vo, K.D.; Bui, L.T. StarSRGAN: Improving Real-World Blind Super-Resolution. In Proceedings of the International Conference in Central Europe on Computer Graphics, Visualization and Computer Vision, Pilsen, Czech Republic, 15–19 May 2023; pp. 62–72. [Google Scholar] [CrossRef]
Li, B.; Li, X.; Zhu, H.; Jin, Y.; Feng, R.; Zhang, Z.; Chen, Z. SeD: Semantic-Aware Discriminator for Image Super-Resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 25784–25795. [Google Scholar] [CrossRef]
Schönfeld, E.; Schiele, B.; Khoreva, A. A U-Net Based Discriminator for Generative Adversarial Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 8204–8213. [Google Scholar] [CrossRef]

Figure 1. Process flow of the proposed method. The shaded part represents the vanilla ESRGAN.

x_{B}

,

y_{B}

, and

{\bar{y}}_{B}

represent batches consisting of

N_{B}

LR, SR, and HR images.

Figure 1. Process flow of the proposed method. The shaded part represents the vanilla ESRGAN.

x_{B}

,

y_{B}

, and

{\bar{y}}_{B}

represent batches consisting of

N_{B}

LR, SR, and HR images.

Figure 2. Visual comparison between SR images generated using different losses. (a) Cropped HR images, (b) SR images using

L_{E S R G A N}

, (c) SR images using

L_{+ P i e A P P}

, and (d) SR images using

L_{+ D P O}

. The red arrow points to visual artifacts.

Figure 2. Visual comparison between SR images generated using different losses. (a) Cropped HR images, (b) SR images using

L_{E S R G A N}

, (c) SR images using

L_{+ P i e A P P}

, and (d) SR images using

L_{+ D P O}

. The red arrow points to visual artifacts.

Figure 3. Visual comparison between SR images generated using different preference calculation models. (a) SR images using the LPIPS model and (b) SR images using the PieAPP model. The red arrow points to visual artifacts.

Figure 4. Visual comparison between SR images generated using different methods. (a) Cropped HR images, (b) bicubic interpolation, (c) our DPO method, (d) Real-ESRGAN, (e) StarSRGAN, (f) ESRGAN-DP, (g) A-ESRGAN, (h) MSA-ESRGAN, and (i) SeD. The red arrow points to visual artifacts.

Table 1. Comparison of ESRGANs trained with different losses.

L_{+ P i e A P P}

and

L_{+ D P O}

represent the addition of the PieAPP loss and the proposed SR-DPO loss to

L_{E S R G A N}

, respectively.

Table 1. Comparison of ESRGANs trained with different losses.

L_{+ P i e A P P}

and

L_{+ D P O}

represent the addition of the PieAPP loss and the proposed SR-DPO loss to

L_{E S R G A N}

, respectively.

		PSNR ↑	SSIM ↑	LPIPS ↓	PieAPP ↓	NIQE ↓
DIV2K_valid	$L_{E S R G A N}$ [3]	25.6795	0.7278	0.1237	0.5507	4.6551
	$L_{+ P i e A P P}$ [5]	25.4889	0.7275	0.1414	0.2749	4.3927
	$L_{+ D P O}$	25.9753	07409	0.1244	0.2626	4.2588
Urban100	$L_{E S R G A N}$ [3]	22.0044	0.6752	0.1587	0.8355	4.5552
	$L_{+ P i e A P P}$ [5]	21.6073	0.6635	0.1831	0.5079	4.4453
	$L_{+ D P O}$	22.1408	0.6860	0.1598	0.5072	4.4582
BSD100	$L_{E S R G A N}$ [3]	23.4732	0.6037	0.1653	0.6956	5.4894
	$L_{+ P i e A P P}$ [5]	22.8331	0.5874	0.1901	0.4011	5.0825
	$L_{+ D P O}$	23.5257	0.5998	0.1696	0.3798	4.6984
Set5	$L_{E S R G A N}$ [3]	26.6157	0.8027	0.0755	0.4486	5.4038
	$L_{+ P i e A P P}$ [5]	26.0691	0.7881	0.0917	0.3511	4.9664
	$L_{+ D P O}$	26.8175	0.7996	0.0748	0.3564	5.3249
Set14	$L_{E S R G A N}$ [3]	23.8251	0.7659	0.1446	0.9212	4.9711
	$L_{+ P i e A P P}$ [5]	22.8668	0.7521	0.1626	0.5072	4.5064
	$L_{+ D P O}$	23.9951	0.7653	0.1432	0.5985	4.6497

Table 2. Performance changes based on how the SR-DPO loss is designed. Equations (4)–(7) represent the DPO losses we designed to introduce DPO to the SR process in Section 3.2, and Equation (7) is our final SR-DPO loss.

		PSNR ↑	SSIM ↑	LPIPS ↓	PieAPP ↓	NIQE ↓
DIV2K_valid	Equation (4)	24.0414	0.7183	0.1616	1.0361	4.3884
	Equation (5)	20.7606	0.7073	0.2426	2.0506	5.5587
	Equation (6)	20.3597	0.6494	0.2698	2.0456	4.9085
	Equation (7)	25.9753	0.7409	0.1244	0.2626	4.2588
Urban100	Equation (4)	20.8525	0.6496	0.1966	1.1522	4.2652
	Equation (5)	18.4527	0.6154	0.2761	1.8359	5.1611
	Equation (6)	18.3813	0.5535	0.3031	1.8517	4.7215
	Equation (7)	22.1408	0.6860	0.1598	0.5072	4.4582
BSD100	Equation (4)	22.7156	0.5885	0.2064	1.1245	5.4479
	Equation (5)	20.6213	0.5903	0.2952	1.7535	6.9459
	Equation (6)	20.1085	0.5366	0.3418	1.8276	6.0400
	Equation (7)	23.5257	0.5998	0.1696	0.3798	4.6984
Set5	Equation (4)	24.2179	0.7456	0.1198	0.5111	4.9955
	Equation (5)	21.4562	0.7253	0.1712	0.9298	6.8279
	Equation (6)	19.7723	0.6271	0.2044	1.9173	6.5020
	Equation (7)	26.8175	0.7996	0.0748	0.3564	5.3249
Set14	Equation (4)	22.8015	0.7515	0.1814	1.3275	5.0441
	Equation (5)	20.2181	0.7316	0.2638	2.6234	6.0784
	Equation (6)	18.8591	0.6548	0.3062	1.5138	5.6434
	Equation (7)	23.9951	0.7653	0.1432	0.5985	4.6497

Table 3. Performance changes with reference model in DPO-based SR.

	Reference Model	PSNR ↑	SSIM ↑	LPIPS ↓	PieAPP ↓
DIV2K_valid	ESRGAN trained with $L_{E S R G A N}$	15.2341	0.5422	0.3866	0.4054
	ESRGAN trained with $L_{1}$	25.9753	0.7409	0.1244	0.2626
	No reference model	25.8533	0.7452	0.1259	0.2547
Urban100	ESRGAN trained with $L_{E S R G A N}$	13.9608	0.4315	0.4033	0.6819
	ESRGAN trained with $L_{1}$	22.1408	0.6860	0.1598	0.5072
	No reference model	22.1086	0.6193	0.1624	0.4574
Set14	ESRGAN trained with $L_{E S R G A N}$	13.3721	0.5037	0.4509	0.5808
	ESRGAN trained with $L_{1}$	23.9951	0.7653	0.1432	0.5985
	No reference model	23.8403	0.7581	0.1498	0.5224

Table 4. Performance when using the LPIPS model for preference calculation.

	PSNR ↑	SSIM ↑	LPIPS ↓	PieAPP ↓	NIQE ↓
DIV2K_valid	25.9707	0.7446	0.1186	0.5191	4.0304
Urban100	22.1383	0.6866	0.1566	0.7609	4.0996
BSD100	23.9345	0.6150	0.1613	0.6565	4.6466
Set5	26.8093	0.8044	0.0718	0.5903	4.9209
Set14	24.2466	0.7757	0.1337	0.7321	4.4916

Table 5. Performance when increasing the batch size to 8.

	PSNR ↑	SSIM ↑	LPIPS ↓	PieAPP ↓	NIQE ↓
DIV2K_valid	25.7276	0.7134	0.1282	0.2828	4.1035
Urban100	21.9077	0.6654	0.1601	0.5648	4.0682
BSD100	23.5713	0.6057	0.1716	0.3612	4.9198
Set5	26.6223	0.8011	0.0784	0.3221	4.7340
Set14	23.3602	0.7508	0.1448	0.5263	4.6600

Table 6. Comparison with other ESRGAN improvement models.

		PSNR ↑	SSIM ↑	LPIPS ↓	PieAPP ↓	NIQE ↓
DIV2K_valid	Bicubic	26.6942	0.7663	0.3407	0.5804	7.2218
	Our DPO method	25.9753	0.7409	0.1244	0.2626	4.2588
	Real-ESRGAN [13]	21.8301	0.6209	0.2758	1.6711	3.5661
	StarSRGAN [43]	24.5426	0.7147	0.1464	0.6914	3.0836
	ESRGAN-DP [10]	26.5069	0.7555	0.0912	0.3996	2.8800
	A-ESRGAN [11]	22.7443	0.6550	0.2358	1.3846	3.1801
	MSA-ESRGAN [6]	24.8405	0.7204	0.1799	1.1802	3.5811
	SeD [44]	27.7939	0.7934	0.0751	0.3469	3.1302
Urban100	Bicubic	21.6991	0.6517	0.4205	1.1409	7.1941
	Our DPO method	22.1408	0.6860	0.1598	0.5072	4.4582
	Real-ESRGAN [13]	18.1046	0.5409	0.2666	2.2056	4.2697
	StarSRGAN [43]	20.3558	0.6483	0.1712	1.1067	3.5930
	ESRGAN-DP [10]	22.7345	0.7149	0.1086	0.6753	3.6833
	A-ESRGAN [11]	18.7962	0.5721	0.2451	1.4945	3.5516
	MSA-ESRGAN [6]	21.0561	0.6575	0.1804	1.6531	4.0658
	SeD [44]	24.3847	0.7714	0.0887	0.6458	3.9645
BSD100	Bicubic	24.6507	0.6415	0.4561	0.7388	7.5764
	Our DPO method	23.5257	0.5998	0.1696	0.3798	4.6984
	Real-ESRGAN [13]	21.1057	0.5145	0.3272	2.0294	3.9643
	StarSRGAN [43]	22.7458	0.6093	0.1769	0.9638	3.8182
	ESRGAN-DP [10]	23.9337	0.6397	0.1328	0.6465	3.3965
	A-ESRGAN [11]	21.3103	0.5405	0.2711	1.5606	3.6621
	MSA-ESRGAN [6]	23.5678	0.6121	0.2372	1.2789	4.0619
	SeD [44]	25.0264	0.6638	0.1224	0.5664	3.5617
Set5	Bicubic	26.6902	0.7899	0.3004	0.9857	8.2823
	Our DPO method	26.8175	0.7996	0.0748	0.3564	5.3249
	Real-ESRGAN [13]	21.6163	0.6269	0.2277	1.5581	5.7241
	StarSRGAN [43]	24.9597	0.7691	0.1107	0.7751	4.3798
	ESRGAN-DP [10]	28.3406	0.8287	0.0598	0.3684	4.1061
	A-ESRGAN [11]	21.9001	0.6427	0.1735	0.6849	4.8551
	MSA-ESRGAN [6]	24.3240	0.7403	0.1438	1.4387	6.5172
	SeD [44]	29.3011	0.8511	0.0521	0.3867	5.3606
Set14	Bicubic	24.2384	0.7737	0.3862	0.7347	7.7363
	Our DPO method	23.9951	0.7653	0.1432	0.5985	4.6497
	Real-ESRGAN [13]	20.3295	0.5336	0.2977	2.7718	4.2195
	StarSRGAN [43]	23.3203	0.6298	0.1732	1.0428	3.8000
	ESRGAN-DP [10]	24.4899	0.6859	0.1158	0.7899	3.3932
	A-ESRGAN [11]	20.7806	0.6487	0.2367	1.7234	3.6678
	MSA-ESRGAN [6]	23.3530	0.7386	0.1979	1.3044	4.5716
	SeD [44]	25.4956	0.8026	0.0969	0.6217	3.8461

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yun, W.; Park, H. DPO-ESRGAN: Perceptually Enhanced Super-Resolution Using Direct Preference Optimization. Electronics 2025, 14, 3357. https://doi.org/10.3390/electronics14173357

AMA Style

Yun W, Park H. DPO-ESRGAN: Perceptually Enhanced Super-Resolution Using Direct Preference Optimization. Electronics. 2025; 14(17):3357. https://doi.org/10.3390/electronics14173357

Chicago/Turabian Style

Yun, Wonwoo, and Hanhoon Park. 2025. "DPO-ESRGAN: Perceptually Enhanced Super-Resolution Using Direct Preference Optimization" Electronics 14, no. 17: 3357. https://doi.org/10.3390/electronics14173357

APA Style

Yun, W., & Park, H. (2025). DPO-ESRGAN: Perceptually Enhanced Super-Resolution Using Direct Preference Optimization. Electronics, 14(17), 3357. https://doi.org/10.3390/electronics14173357

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DPO-ESRGAN: Perceptually Enhanced Super-Resolution Using Direct Preference Optimization

Abstract

1. Introduction

2. Related Works

2.1. Image Super-Resolution

2.2. Direct Preference Optimization

3. Proposed Method

3.1. Preliminaries

3.2. Introducing DPO into the SR Process

3.3. Determining the Reference Model

3.4. Training ESRGAN Using the SR-DPO Loss

4. Experimental Results and Discussion

4.1. Setup

4.2. Effectiveness of Using the SR-DPO Loss

4.3. Performance Comparison Based on How to Design the SR-DPO Loss

4.4. Influence of the Reference Model

4.5. Performance Comparison of the PieAPP and LPIPS Models for Preference Calculation

4.6. Influence of the Batch Size for Preference Calculation

4.7. Performance Comparison with Other ESRGAN Improvement Models

5. Conclusions and Future Works

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI