1. Introduction
Single Image Super-Resolution (SISR) is an important branch of image generation and transformation tasks, which aims to recover a High-Resolution (HR) image from a corresponding Low-Resolution (LR) image [
1,
2,
3]. The inherent symmetry of images, including global structural symmetry and local texture symmetry, serves as a strong prior for the provision of constraints for the detailed reconstruction of HR images and effectively alleviates the problem of detail distortion in the SR process. Given its wide application in scenarios requiring high-fidelity details, this task has consistently been a vital research topic in the field of computer vision. In the domain of reconstruction techniques (such as image reconstruction, signal reconstruction, and 3D reconstruction), early conventional methods, primarily represented by interpolation and sparse representation, often suffered from issues like detail loss, sensitivity to noise, and poor model generalization. These limitations made it difficult to meet the demanding requirements for high resolution, high fidelity, and complex scenes. With the advent of deep learning, image reconstruction performance based on CNNs achieved a significant leap in quality [
4,
5,
6,
7,
8]. Briefly, deep reconstruction methods can be broadly classified into four categories: PSNR-driven methods, GAN-based methods, Flow-based methods, and Diffusion methods.
PSNR-driven methods [
9,
10,
11] optimize the quality of pixel-level reconstruction by minimizing the Mean Squared Error (MSE) between the super-resolved image and the ground truth image, thereby enhancing the Peak Signal-to-Noise Ratio (PSNR) [
12]. However, they frequently produce overly smooth images that deviate from human visual perception.
GAN-based methods [
13,
14,
15] combine content loss with adversarial loss to generate images with sharp edges and fine textures, mitigating the problem of detail loss inherent in PSNR-driven approaches. Nevertheless, they face challenges such as mode collapse, training instability, and structural artifacts [
16].
Flow-based methods [
17,
18] utilize reversible neural networks to encode the HR image into a latent space and perform reconstruction via an inverse transformation, achieving a balance between natural detail and diversity. Although these models strike a better balance between realism and visual quality, they still grapple with challenges such as high computational complexity, slow training and inference speeds, and large memory consumption [
16].
Diffusion Probabilistic Models (DPMs) [
19] have exhibited exceptional performance in the SISR task, demonstrating immense potential and broad prospects in this domain and offering a new paradigm for subsequent research. DPMs are predicated on a Markov chain-based noise injection mechanism that transforms data into latent variables with a Gaussian distribution, followed by a reverse denoising process to reconstruct the data, which has symmetry. Optimized through the Evidence Lower Bound (ELBO), DPMs ensure high-quality sample generation and training stability, which to some extent alleviates the intrinsic shortcomings of GANs, such as lack of diversity, mode collapse, and training instability due to adversarial training, while also exhibiting stronger distribution coverage and generalization performance. However, their sampling process typically requires hundreds or even thousands of iterations, resulting in slow inference speeds. For instance, SR3 [
20] demands as many as 2000 training sampling steps to yield images of satisfactory quality.
To achieve fast and low-cost sampling, some methods employ residual modeling instead of natural image modeling, focusing on the difference between the HR and LR images. As shown in
Figure 1, modeling the residual image requires fewer computational resources and enables faster training speeds, thus allowing the model to avoid quality degradation with fewer denoising steps. For example, SRDiff [
21] uses the bilinear upsampled LR image as a conditional guide for the diffusion model; DVSR [
22] employs a randomly initialized CNN to recover the missing information, thereby alleviating the modeling burden; and ResDiff [
7] proposed a simple pre-trained CNN and decomposed the residual information into high-frequency and low-frequency components, further reducing the diffusion model’s load.
The above research methods [
7,
21,
22] adopt residual-based diffusion models, where the denoising network only needs to model the residual. Compared to traditional diffusion models, which must process a large volume of natural image information, residual diffusion models significantly reduce the processing burden on the denoising network, thereby accelerating convergence and lowering computational cost. Further alleviating the burden on the denoising network is crucial for accelerating the generation process and improving sample quality. Inspired by [
23,
24], it is observed that images with lower entropy typically exhibit higher self-similarity and more regular structural textures, suggesting they contain less information and possess simpler data distribution characteristics, which significantly reduces the difficulty of reconstruction. Experimental data demonstrate that data distributions with lower entropy are easier for generative models to capture, and as the entropy decreases, the residual distribution becomes more concentrated, as shown in
Figure 2. This key property provides a vital theoretical basis for the denoising network in the entropy subtraction-supported residual-diffusion framework proposed in this paper: fast and low-cost sampling can be achieved by modeling low-entropy residuals.
In this paper, we propose a Conditional Diffusion Model (ESRDF) based on a residual architecture and introduce an Entropy Matching Loss . This loss is applied to the CNN-based predictor in the main path to focus on the overall structural alignment of regional information, constraining the image information from an entropy dimension to drive the convergence of the LR image toward the HR image. By incorporating a patch-based mechanism to preserve the overall spatial structure, the model yields residuals with lower entropy. This mechanism reduces the training cost and processing burden on the branch denoising network, guaranteeing reconstruction quality within a limited number of denoising steps, thus avoiding the over-smoothing artifact.
Experiments on the face dataset (FFHQ) and two general datasets (Div2k and Urban100) demonstrate that ESRDF not only achieves fast and low-cost sampling but also generates finer images.
The contributions of this paper are summarized as follows:
Sampling Efficiency: We propose a diffusion model, ESRDF, based on a residual structure to address the SISR problem. Compared to other diffusion methods, this model achieves a low-cost and fast sampling process, specifically requiring fewer denoising steps and demonstrating faster convergence speed.
Perceptually Consistent Output: Compared to other diffusion methods, ESRDF exhibits lower perception-based evaluation scores, indicating that our method generates samples that are more consistent with human perception.
Superior Generation Quality: Compared to other diffusion methods, our method is simpler. By introducing the Entropy Matching Loss in the main path, we enhance the low-frequency global information, allowing the branch diffusion model to focus on high-frequency detail capture and achieve superior generation results.
2. Related Works
The objective of SISR is to recover an HR image from a corresponding LR image. Currently, the academic community has proposed various deep learning techniques to address this reconstruction task, which can be broadly classified into four categories: PSNR-driven methods, GAN-based methods, flow-based methods, and diffusion-based methods. Notably, symmetry priors plays a crucial role in all these deep reconstruction methods: for example, symmetric constraints are incorporated into PSNR-driven models to enhance the rationality of reconstructed structures, symmetric discriminators are utilized in GAN-based methods to improve the consistency of texture details, and symmetric sampling strategies are adopted in diffusion-based models to optimize the global fidelity of images. The first three of these categories can be collectively referred to as non-diffusion methods.
2.1. Non-Diffusion-Based Methods
PSNR-driven methods [
4,
9,
10,
11,
25] leverage pixel-level optimization to minimize the Mean Squared Error (MSE) between the super-resolved image reconstructed from the LR image and the true HR image, thereby improving the Peak Signal-to-Noise Ratio (PSNR). PSNR is a widely used image quality metric quantified by MSE. By minimizing the MSE, PSNR-driven methods effectively reduce the pixel-wise difference between the reconstructed image and the ground truth image, thus enhancing the visual quality of the image. SRCNN [
9], the first model to achieve an end-to-end mapping from LR to HR images, serves as a representative PSNR-driven method. Building upon SRCNN, several subsequent methods [
10,
11,
25] have effectively improved SR performance by fine-tuning network architectures and loss functions. However, PSNR-driven methods are limited to pixel-level optimization and often generate overly smooth images, deviating from human visual perception. This is because, in high-dimensional prediction, these methods capture the median or mean of the entire solution space, which may be insufficient for estimating any specific solution [
26].
GAN-based methods [
14,
15,
27], by combining content loss with adversarial loss, can generate high-quality images featuring sharp edges and fine textures, effectively mitigating the detail loss problem caused by the over-smoothing in traditional PSNR-driven methods. As a landmark study, SRGAN [
13], built upon the SRResNet architecture, integrates perceptual loss and adversarial loss within a Generative Adversarial Network framework, significantly enhancing the visual realism of reconstructed images. Another groundbreaking work, ESRGAN [
14], further optimizes the high-frequency details and texture authenticity of SR images by introducing residual-in-residual dense blocks and a relative discriminator. Nevertheless, GANs suffer from the inherent flaw of mode collapse during training, which significantly affects the diversity of SR-generated samples; simultaneously, these models are highly unstable to train and often produce unexpected structural artifacts in the reconstructed images, thereby compromising visual quality [
28].
Flow-based methods [
17,
18] utilize reversible neural networks to construct a mapping from LR images to HR images, enabling diversified reconstruction results while preserving natural details. SRFlow [
17], a representative method based on normalizing flows, employs a reversible neural network architecture to encode the HR image into a latent space representation, which is then decoded and reconstructed via an inverse transformation process. By constructing a bidirectional mapping between its encoder and decoder, this model effectively alleviates the instability issues encountered in the training process of traditional models. Although flow-based image SR techniques excel in diversity and physical modeling, the high computational complexity and memory requirements introduced by reversible networks remain a challenge. In terms of reconstruction results, while they achieve a balance between diversity and physical plausibility, detail generation can be either excessive or insufficient, and their adaptability to different types of images is limited, especially in complex scenes where distortion is likely to occur, necessitating optimization with prior knowledge [
29].
2.2. Diffusion-Based Methods
DPMs are a class of generative modeling methods that transform Gaussian noise into a target data distribution through a stepwise denoising Markov chain process, which is a symmetric process. This endows diffusion models with the inherent advantages of high-quality sampling, broad mode coverage, and strong sample diversity, but it also makes fast and low-cost sampling challenging to achieve. In the field of SISR, SR3 [
20] and SRDiff [
21] are pioneering works based on diffusion models; both utilize conditional diffusion. However, these two methods differ in their modeling approach: SR3 directly models the natural image using the LR image as the condition, while SRDiff utilizes a pre-trained encoder to extract features from the LR image as the condition for modeling the residual image. To achieve fast and low-cost sampling, researchers have further explored residual-based diffusion model methods. For instance, DVSR [
22] employs a structurally simple CNN to first perform deterministic estimation, recovering the majority of the image information, followed by refinement using a diffusion model, thereby constructing a prediction-refinement conditional diffusion model framework. Subsequently, ResDiff further optimized this framework. The ResDiff [
7] architecture combines a pre-trained CNN with a conditional diffusion model and decomposes the residual information into high-frequency and low-frequency components, thus precisely guiding the prediction of the noise predictor and further enhancing perceptual evaluation metrics.
Specifically, ResDiff utilizes a CNN-based pre-trained pixel predictor to recover the main low-frequency content of the image and introduces a frequency-domain-based loss function to enhance the recovery effect. Building upon this, the conditional diffusion model focuses on predicting high-frequency residual information. Frequency-domain guidance enables the noise predictor to more effectively learn high-frequency detail generation. This design not only accelerates the model’s convergence speed but also significantly improves the quality of the generated images.
However, the sampling efficiency of diffusion models is relatively low because they require traversing the entire high-dimensional network multiple times in both the forward and reverse directions, resulting in immense computational overhead. The core contribution of this framework lies in its ability to utilize the model’s computational resources more efficiently while avoiding the difficulties associated with the direct modeling of complex images. Furthermore, the deterministic estimation provided by the CNN-based pixel predictor offers a good initial condition for the diffusion model, thus reducing the burden on the diffusion model, decreasing the required denoising steps, and consequently improving its efficiency and effectiveness.
2.3. Entropy Subtraction Supported Residual-Diffusion Framework
This paper proposes an entropy subtraction-supported residual-diffusion framework, which aims to integrate the efficiency of residual-driven diffusion models with entropy-based detail optimization strategies to address two core issues faced by diffusion models in image SR tasks, namely over-smoothing and low sampling efficiency.
Currently, to improve the sampling efficiency of diffusion models, the academic community has proposed various SR reconstruction methods based on residual diffusion models [
7,
30]. Traditional diffusion models directly model complete HR images, and the information space of the images they need to restore contains not only substantial redundant background information but also high-entropy high-frequency details. In contrast, such residual diffusion methods only perform diffusion modeling on the residual between LR images and HR images (i.e., the high-frequency detail component that LR images lack). This design essentially reduces the information entropy that the model needs to process during the image restoration stage.
In addition, to enhance the quality of image reconstruction, in related research on entropy-oriented optimization, Xu et al. [
24] analyzed the essence of the over-smoothing phenomenon in SR tasks from the perspective of data characteristics. They pointed out that the higher the information entropy of HR images, the more significant the deviation between the center of the model’s optimization objective and the true distribution of clean images, which ultimately leads to the lack of texture details in the generated images and the occurrence of over-smoothing problems. To address this issue, the team designed DECLoss. Through this loss function, the entropy value of the data distribution is reduced, thereby effectively suppressing the over-smoothing phenomenon and further improving the quality of image reconstruction.
From what has been discussed above, through statistical experiments, we further find that the lower the information entropy of the residual image processed by the diffusion model, the smaller the corresponding Wasserstein distance, the more stable the fluctuation of the training loss, and the better the performance of the image reconstruction metric PSNR, as shown in
Figure 3. The above experimental results fully verify the effectiveness and rationality of the proposed method.
3. Diffusion Model
Denoising Diffusion Probabilistic Models (DDPMs [
19]) are generative models based on the theory of Markov chains, consisting of a sequence of Markov chains
. The core idea is to gradually add Gaussian noise
to the data, transforming it into pure noise
, and then progressively recover the original data through a reverse process until the desired data matching the source data distribution is generated. Therefore, the DDPM is composed of two phases: the forward diffusion process and the reverse denoising process [
31].
Forward diffusion process: Given an image
, the forward diffusion process is defined as a Markov chain
. In this process, Gaussian noise
controlled by the coefficient
is continuously added to
. Specifically, each step of the diffusion process can be expressed as:
where the data
approaches pure noise at time step
; through the Markov property, we can obtain the entire forward noising process from
to
:
where
and
, which can be obtained through reparameterization:
Reverse denoising process:
has been obtained through the forward diffusion process. The reverse denoising process is defined as a reverse Markov chain
. In this process, starting from the noise
and progressively removing noise to recover the original data
, it involves learning the conditional distribution:
where
is interpreted as the mean
and the standard deviation
with the Gaussian distribution, and
denotes the learnable parameters of the noise prediction model
, which is trained using neural networks, such as U-Net, to progressively perform denoising operations.
During the training process, to optimize the model parameters, we are minimizing the variant of ELBO, which can be simply written as follows:
where
is the Gaussian noise sampled randomly, and
is generated by the noise prediction model, which is required to approximate the noise
at the current time step as closely as possible.
When the noise prediction model is trained, we can commence generating data starting from
. It is evident that the variance and mean of the data distribution
in Equation (
4) can be calculated at each time step until image
is obtained.
There is a primary distinction between diffusion models and earlier generative models in that diffusion models involve dynamic execution across iterative time steps, covering both forward and reverse processes. This dynamic nature endows diffusion models with greater flexibility and adaptability when generating complex data distributions.
4. Methodology
4.1. Residual Framework in SISR
DPMs have demonstrated exceptional capabilities in image synthesis [
32] and image restoration tasks [
33,
34], and have shown immense potential in the SISR task. For instance, existing diffusion-based methods (e.g., SR3) directly generate the HR image from random noise, guided by the LR image as a condition. The specific formulation of this conventional diffusion model framework is as follows:
where Unet [
35] indicates the network architecture of the diffusion model,
represents a natural image progressively corrupted by Gaussian noise until it approximates pure Gaussian noise and
t represents the corresponding time step.
is the upsampled image (e.g., bilinear and bicubic), which is used as a conditional guide.
However, natural images contain a vast amount of information and have high requirements for precise restoration. In diffusion models with traditional architectures, each step requires the processing of a large amount of information. To address this issue, residual-based diffusion models provide a solution.
Regarding
Figure 4, this section discusses that in the research of conditional diffusion models, the residual-based diffusion models have more advantages compared to traditional architectures based on natural images.
The goal of both is consistent, which is to obtain an SR output image of large size. But the difference lies in the architecture.
The residual-based diffusion models are developed from the traditional architecture, as shown in Equation (
8), and employ a dual-branch strategy. The main path is responsible for the preliminary restoration of the image to generate an intermediate result close to the target, while the denoising network of the branch path only needs to model the residuals. Finally, by adding the output results of the two paths together, the
image can be obtained. This architecture can be briefly described as:
where
represents a residual image that has been added with Gaussian noise until it reaches saturation.
However, some studies [
7,
22] have shown that by introducing a CNN-based predictor in the main path, the model can complete most of the computational tasks in a single run, which further reduces the computational burden of residual modeling in the denoising process. This enables the denoising network to achieve good reconstruction results even when the number of denoising steps is limited.
Overall, the residual-based diffusion model has higher sampling efficiency because residual modeling can reduce the computational cost of the denoising network during sampling.
4.2. Motivation
Revisiting the residual-based diffusion models from the perspective of entropy, we can understand why those methods can perform sampling more efficiently.
Entropy in image distribution: Entropy is a metric of the uncertainty and randomness of a system. In natural images, the pixels are evenly distributed, while the residual pixel values follow a sharper and concentrated distribution, resulting in a substantial decrease in entropy.
Entropy’s effect on reconstruction: Relevant research [
23,
24] indicates that lower entropy typically corresponds to higher self-similarity and more regular textural features, implying a simpler distribution. Conversely, higher entropy values signify greater image complexity or irregularity, characterized by increased randomness, as shown in
Figure 2. It can be concluded that
Lower-entropy data distributions are more easily captured and learned by generative models.
Compared with traditional interpolation methods, a CNN-based pixel predictor yields residual images with lower entropy.
Motivation: In the residual-based diffusion models, we designed a CNN-based predictor and introduced an entropy-based loss function. This design yields lower entropy of the residual that requires modeling.
As shown in the lavender part in
Figure 4, by enhancing the entropy recovery ability of the initial CNN, residual images with lower entropy can be obtained, and it means the diffusion model on the branch path only needs to bear an extremely low burden, thus achieving efficient sampling. Formally, the optimization objective is expressed as:
Unlike SR3, which uses natural images, ESRDF employs residuals processed by a CNN, and the image entropy of the latter is lower than that of natural images. The FFHQ (
) validate this conclusion, as shown in
Figure 3. This statistical experiment reflects the differences between residual images and natural images. As can be seen from the left subfigure, the residual images have lower entropy values and smaller Wasserstein distances than natural images. As can be seen from the right subfigure, the PSNR curves have a higher starting point at the beginning of optimization and maintain a performance advantage over the comparison models throughout the entire iteration process. It can thus be concluded that the use of residual features with lower entropy values helps to achieve smaller Wasserstein distances, thereby effectively improving the image reconstruction performance of the model.
4.3. ESRDF
ESRDF is derived from the residual-based diffusion models. In its structural design, the main path employs an initial CNN, while the denoising network of the branch path only needs to model the residual part.
To reduce the entropy of the residual for modeling before it is fed into the denoising network of the ESRDF branch, this paper proposes an entropy-matching loss, denoted as
, and combines it with a perceptual loss
[
12] and a pixel-wise loss function
.
Entropy matching Loss
. Inspired by [
36,
37,
38], we designed the histogram loss to measure the similarity between the histograms of the generated image and the target image, and to minimize the KL divergence between their patches. Specifically, we employed a differentiable histogram generation method to calculate the histogram. This method uses Gaussian kernel weights to map the intensity of each pixel continuously to all histogram bins, rather than simply assigning it to the nearest bin. This continuous allocation approach not only makes the pixel intensity distribution of the generated image closer to that of the target image, but also makes the uncertainty (i.e., entropy) of the pixel intensity distribution of the generated image closer to that of the target image. In this way, it not only optimizes the distribution of pixel intensities, but also optimizes the uncertainty of the distribution, thereby achieving tighter information entropy alignment between the generated image and the target image:
where
and
denote the SR image patches and ground-truth image patches, respectively.
and
are numerical stability constants. Given the asymmetry of the KL divergence, the calculation direction is defined as using the histogram
of the ground-truth image as the reference distribution to measure the discrepancy between the histogram
of the SR image and this reference, so as to constrain the reconstructed distribution to align with the ground-truth distribution.
is a function that maps the input image x to its corresponding histogram distribution, with the detailed procedure given below:
1. Dynamic generation of bin centers: The bin centers
c are
points uniformly distributed within the range
, that is:
where
c denotes the set of bin centers, and
represents the total number of bins.
2. Gaussian Kernel Weight Calculation: For each pixel value
x, the Gaussian kernel weight with respect to each bin center is calculated as:
where
w denotes the Gaussian kernel weight between the pixel value
x and the bin center
c,
is the standard deviation of the Gaussian kernel, and
c is the corresponding bin center.
3. For each bin in each channel, accumulate the weights of all pixels, then normalize the histogram for each channel:
where
denotes the accumulated weight for bin
b in channel
j,
is the normalized histogram value for bin
b in channel
j,
H and
W are the height and width of the image, and bins is the total number of bins. In conclusion,
serves as the specific computational implementation of the global histogram mapping function
, and the two have an overall-to-local correspondence.
Feature reconstruction Loss
is represented as the squared Euclidean distance between the feature representations of two images:
where
denotes the third-party evaluation network (e.g., VGG16), and
indicates the activation of the
j-th layer. If the
j-th layer is a convolutional layer,
is a feature map of shape
, corresponding to the number of channels, height, and width, respectively.
Pixel Loss
is the loss in the image spatial domain, defined as the absolute difference between the pixel values of the true image and the predicted image:
where
denotes the absolute value.
Therefore, the overall loss consists of the following components:
4.4. Structure and Training
The main path of ESRDF employs a pre-trained CNN to process LR images, reconstructing the primary image information and laying the groundwork for the subsequent operation of the branch denoising network. Inspired by [
4,
39,
40], a simple design is carried out for the structure of the main path predictor, as shown in
Figure 5.
4.5. Discussion
Wasserstein-Angle: Shorter Is Better
The Wasserstein distance (W-distance) quantifies the discrepancy between two probability distributions by the minimum “transport cost” required to reshape one into the other [
41]. From this vantage point, turning a residual distribution into Gaussian noise is far cheaper than forcing a natural-image distribution to become Gaussian: the residual distribution lies geometrically much closer to the Gaussian target [
42]. Consequently, the W-distance between residuals and Gaussian noise is markedly smaller, as shown in
Figure 3.
5. Experiments
In this section, to comprehensively verify the effectiveness of the proposed ESRDF model, we first introduce the experimental setup, including the dataset, model configuration, and details of training and inference. Then, we report the experimental results and analyze them based on the results of the qualitative analysis.
5.1. Experimental Settings
Datasets & Metrics. This study conducts a systematic evaluation of the ESRDF model using three types of benchmark datasets. For face SR, we build an evaluation benchmark on the FFHQ dataset [
43]. Following ResDiff, HR images are down-sampled with a bicubic kernel; 5000 of them are held out for testing and the rest are used for training.
For General SR, we adopt two standard datasets (Div2K [
44] and Urban100 [
45]). Following ResDiff, training patches of 160 × 160 are first cropped from the HR images and then sequentially down-sampled with a bicubic kernel. At test time, the original HR images are down-sampled and fed to the model to produce SR results. We evaluate on the official 100-image DIV2K test split and randomly reserve 20 images from Urban100 as the test set.
This study conducts a multi-dimensional performance analysis by integrating distortion metrics (PSNR and SSIM [
46]) and perceptual quality metrics (FID [
47,
48]).
Baselines. As a baseline model, DDPM learns the reverse mapping of noise through a U-Net. After modification, it uses only the input image as conditional guidance to gradually recover data from pure noise for generation tasks.
Training & Evaluation Settings. The lightweight CNN is pre-trained for only 1 k iterations with batch size 96. The branch denoiser is a DDPM-style U-Net: 64 initial channels, two residual convolutions per block, dropout 0.0, the multipliers for U-Net depth were set to 1, 2, 4, 8, 8. For diffusion training, we use AdamW (lr = , weight-decay ), batch size 80, , linear -schedule from to , and L1 loss. The whole procedure finishes in 100 k steps on a single NVIDIA GeForce RTX 3090 graphics card (NVIDIA Corporation, Santa Clara, CA, USA). Specific content can be accessed via the public link provided at the end of the paper.
5.2. Performance
In this subsection, we evaluate ESRDF by comparing it with several advanced SR methods on face SR and general SR tasks. The specific results are as follows, where the bolded values represent the optimal values for each evaluation metric, and the “-” symbol indicates that the performance of the corresponding method on some of the used datasets was not explicitly mentioned.
Face SR. We verify the performance of ESRDF in the face restoration task through the
task of the FFHQ, as shown in
Table 1. It displays partial results from ESRDF and other reconstruction approaches, shown in
Figure 6(left).
General SR. To verify the performance of ESRDF in handling high-difficulty image restoration tasks on general-purpose datasets, we select DIV2K and Urban100, and respectively conduct the image SR task tests of
, as shown in
Table 2. It displays partial results from ESRDF and other reconstruction approaches, shown in
Figure 6(right) and
Figure 7.
More Efficient sampling. The curves of pixel-level metric PSNR and perception-level metric FID for ESRDF and ResDiff on the FFHQ dataset are shown in
Figure 8. As can be seen from the curves in the figure, ESRDF’s PSNR curve has a higher starting point at the beginning of optimization and maintains a performance advantage over the comparison models throughout the entire iteration process. This phenomenon can be attributed to the core design of our proposed ESRDF: its main path CNN can provide a sufficient and accurate low-frequency information base for the diffusion model at the initial stage of image restoration, effectively avoiding the problem of a low optimization starting point caused by insufficient initial information in traditional diffusion models, thereby laying a solid foundation for subsequent high-frequency detail reconstruction and metric improvement. Furthermore, further verification can be obtained from the variation trend of the FID curve. Relying on the synergistic effect of entropy matching loss and low-entropy residual modeling, ESRDF can achieve significantly better perceptual metric values than the comparison models at the same iteration stage with fewer denoising steps. This result directly reflects ESRDF’s balanced advantage between sampling efficiency and perceptual quality where it does not need to rely on a large number of iteration steps to obtain performance improvement while ensuring the visual effect of the generated images. Compared with other diffusion methods, ESRDF has a higher sampling efficiency, including fewer denoising steps, better perceptual metrics, and a convergence speed comparable to ResDiff [
7].
5.3. Ablation Study
In this section, we conduct ablation experiments based on the FFHQ dataset to evaluate the effectiveness of each component in ESRDF, including verifying whether the use of residual modeling can achieve superior reconstruction performance, and the specific experimental results are presented in the
Table 3. Among the results, “
” indicates whether the residual model is used, and the “Methods” column indicates that, within the CNN block of ESRDF, training pairs are produced via CNN or Bicubic upsampling and the residuals are then computed from these pairs, as shown in
Figure 4, and the checkmark (✓) in the table indicates that the corresponding experimental condition was adopted.
5.4. Experimental Performance Analysis
To complete the performance evaluation, this paper comprehensively adopts two pixel-level distortion metrics (PSNR and SSIM) and one perceptual quality metric (FID). Specifically, PSNR and SSIM have complementary advantages in quantifying pixel-level errors and evaluating image structural consistency, respectively, while FID can accurately measure the overall distribution similarity between generated content and real samples. The combination of these three metrics ensures that the final evaluation results not only accurately quantify the overall generation quality of the model, but also align with human subjective visual perception.
For both Face SR and general SR reconstruction tasks, our proposed method ESRDF outperforms the second-best comparative model in all core metrics including PSNR, SSIM, and FID. ESRDF relies on a conditional diffusion model with a residual architecture, which can efficiently achieve accurate reconstruction and detail enhancement of LR images to HR counterparts, effectively breaking through the technical bottlenecks of low sampling efficiency and high computational cost faced by traditional diffusion models in SR tasks. The model innovatively constructs a multi-loss fusion optimization framework in the CNN-based predictor of the main path, where the total loss function is composed of , and . Specifically, the feature loss focuses on improving the perceptual consistency of reconstructed images to ensure that the generated results conform to human visual cognition rules; the pixel-wise loss accurately constrains the reconstruction accuracy at the pixel level by calculating the expected value of pixel-wise errors between the ground-truth HR images and the predicted images. However, these two types of losses can only optimize the model performance from the perceptual and pixel levels, respectively, failing to effectively constrain the overall structural alignment of information within regions or drive the rapid convergence of the information distribution from LR images to HR images. The introduction of the entropy matching loss exactly compensates for this critical limitation: it imposes strict and precise constraints on the overall structural alignment of information within regions from the entropy dimension, forcing the learned features of LR images to align with the structural and information distribution characteristics of HR images during training.
The synergistic effect of the three loss functions not only enables the output of high-quality residual information with lower entropy, less noise, and higher pixel accuracy but also comprehensively guarantees the reconstruction performance from three core dimensions: structural alignment, perceptual quality, and pixel precision. This core design fundamentally reduces the training difficulty, training cost, and subsequent computational burden of the branch denoising network, and more importantly, clearly guides the branch model to focus on the accurate capture and fine restoration of key details such as high-frequency textures and edge contours of images. Stable and superior reconstruction performance can be achieved without relying on excessive denoising iteration steps. Finally, validated on two typical SR tasks using the face image dataset (FFHQ) and general natural image datasets (Div2k and Urban100), the proposed method stably achieves excellent performance with both high sampling speed and outstanding generation quality, fully verifying its generality and effectiveness.
Ablation experiments were conducted on the FFHQ dataset for the SR task. By precisely controlling the activation or inactivation states of the entropy matching loss () and residual structure (Res), multiple groups of control experiments were designed to systematically verify the independent contributions and synergistic effects of these two core components. The experimental results demonstrate that the performance of each group exhibits significant gradient differences, and the specific comparative analysis is as follows: First, a comparison between the optimal experimental group (ESRDF-CNN) with both and Res activated and the control group with only activated but Res inactivated shows that the former outperforms the latter in all three core metrics (PSNR, SSIM, and FID), achieving only a modest performance improvement. This indicates that the introduction of the residual structure can further optimize the feature transfer efficiency of the model, thereby marginally enhancing the accuracy of image reconstruction and the realism of generated images. Second, when comparing the optimal experimental group with the control group where Res is activated but is inactivated, the latter is significantly inferior to the optimal experimental group in all metrics, exhibiting a noticeable performance degradation. This confirms that plays a critical role in improving reconstruction accuracy and optimizing perceptual quality. Third, a comparison between the control group with neither nor Res activated and the two control groups lacking only a single component reveals that the former has further degraded performance and is significantly inferior to the latter two. This indicates that the synergistic effect of the two core components can form a superimposed performance gain; the absence of either component will lead to performance degradation, and their combined effect can achieve a superimposed optimization effect. In addition, a comparison between the control group with only activated but Res inactivated and the control group with Res activated but inactivated shows that the former outperforms the latter, further confirming that plays a more critical role in enhancing model performance. Finally, all ESRDF-CNN-related experimental groups significantly outperform the bicubic interpolation method in all metrics, fully verifying the rationality of the model design proposed in this method.
In summary, the entropy matching loss can strengthen the overall structural alignment of information within regions through entropy-dimensional constraints, ensuring the structural consistency and precision of reconstructed images; the residual structure can effectively alleviate the gradient vanishing problem in deep network training, optimize feature transfer efficiency, and facilitate the complete preservation and transmission of effective features; the synergistic effect of these two components can achieve the superposition of performance gains, effectively improve the comprehensive reconstruction performance of the model, and enable the generated images to possess both higher objective accuracy and superior subjective realism.
6. Conclusions
This paper focuses on diffusion models based on residual structures and proposes the Entropy Subtraction-Supported Residual-Diffusion Framework (ESRDF), aiming to address the core challenges of low sampling efficiency and over-smoothing in reconstructed images of traditional diffusion models in image SR tasks. Unlike traditional methods that only use LR images as conditions to guide the generation of HR images, ESRDF constructs a collaborative architecture consisting of a CNN-based main path predictor and a diffusion branch: the main path CNN is responsible for efficiently extracting LR image features and initially reconstructing the basic structure, while the diffusion branch focuses on the accurate modeling of low-entropy residual information. Through the entropy matching loss function (), it imposes strict constraints on the overall structural alignment of intra-regional information from the entropy dimension, forcing the feature distribution of LR images to align with that of HR images. Meanwhile, it cooperates with the feature loss () and pixel loss () to form a multi-loss fusion optimization, ensuring the perceptual consistency and pixel accuracy of reconstructed images. By constraining the entropy of residual information and reducing redundant noise, this design not only effectively reduces the computational burden of the diffusion branch but also achieves a low-cost and high-speed sampling process, fundamentally alleviating the over-smoothing problem in reconstruction. Statistical analysis and systematic experimental verification on multiple public datasets (FFHQ, Div2k, Urban100) show that low-entropy data can significantly improve sampling efficiency. ESRDF outperforms comparative models in three core metrics (PSNR, SSIM, FID), which not only greatly shortens the training convergence period but also achieves breakthrough improvements in generation quality for both face and general SR tasks. Ablation experiments further confirm that the synergistic effect of the entropy matching loss and residual structure can produce a superimposed performance gain, which is the key for the model to overcome the problems of low sampling efficiency and over-smoothing. Currently, the feasibility of ESRDF has been verified on simple models. Future research is expected to further improve performance and generalization ability by deeply integrating it with larger-scale models.