Entropy Subtraction-Supported Residual-Diffusion Framework for Image Super-Resolution

Huang, Honghe; Shao, Changbin; Hu, Chunlong; Shu, Xin; Yu, Hualong

doi:10.3390/sym18010193

Open AccessArticle

Entropy Subtraction-Supported Residual-Diffusion Framework for Image Super-Resolution

by

Honghe Huang

,

Changbin Shao

^*

,

Chunlong Hu

,

Xin Shu

and

Hualong Yu

School of Computer Science, Jiangsu University of Science and Technology, Zhenjiang 212100, China

^*

Author to whom correspondence should be addressed.

Symmetry 2026, 18(1), 193; https://doi.org/10.3390/sym18010193

Submission received: 10 December 2025 / Revised: 4 January 2026 / Accepted: 6 January 2026 / Published: 20 January 2026

(This article belongs to the Section Computer)

Download

Browse Figures

Versions Notes

Abstract

Diffusion probabilistic models have demonstrated remarkable superiority in SISR. Yet, their multi-step denoising mechanism incurs prohibitive computational overhead, which severely limits real-world deployment. To address this issue, we propose an Entropy Subtraction-Supported Diffusion Denoising framework for image Reconstruction (ESRDF). The core idea is to shift part of the SR burden from the diffusion model to an image Decoder, with a key focus on recovering the symmetric structural correspondence between LR and HR images that is often degraded during downsampling. Specifically, ESRDF’s main branch employs a CNN that performs one-step feature reconstruction, supervised by a novel entropy-matching loss in addition to the conventional reconstruction loss. This loss adopts a patch-wise entropy matching strategy that enforces regional consistency between the True and the predicted images. Building on L1’s focus on pixel-level details and perceptual loss’s grasp of global semantics, region-wise entropy measurement further completes the global alignment of intra-region information structures. Under this framework, the main branch delivers coarse low-frequency content, drastically reducing the workload of the diffusion branch, which now only needs to sparsely refine high-frequency details. Experimental results on multiple benchmark datasets demonstrate that ESRDF achieves shorter model convergence times and higher generation quality with fewer denoising steps, outperforming previous diffusion-based image reconstruction methods.

Keywords:

SISR; diffusion models; deep learning; loss function; information entropy; residual modeling

1. Introduction

Single Image Super-Resolution (SISR) is an important branch of image generation and transformation tasks, which aims to recover a High-Resolution (HR) image from a corresponding Low-Resolution (LR) image [1,2,3]. The inherent symmetry of images, including global structural symmetry and local texture symmetry, serves as a strong prior for the provision of constraints for the detailed reconstruction of HR images and effectively alleviates the problem of detail distortion in the SR process. Given its wide application in scenarios requiring high-fidelity details, this task has consistently been a vital research topic in the field of computer vision. In the domain of reconstruction techniques (such as image reconstruction, signal reconstruction, and 3D reconstruction), early conventional methods, primarily represented by interpolation and sparse representation, often suffered from issues like detail loss, sensitivity to noise, and poor model generalization. These limitations made it difficult to meet the demanding requirements for high resolution, high fidelity, and complex scenes. With the advent of deep learning, image reconstruction performance based on CNNs achieved a significant leap in quality [4,5,6,7,8]. Briefly, deep reconstruction methods can be broadly classified into four categories: PSNR-driven methods, GAN-based methods, Flow-based methods, and Diffusion methods.

PSNR-driven methods [9,10,11] optimize the quality of pixel-level reconstruction by minimizing the Mean Squared Error (MSE) between the super-resolved image and the ground truth image, thereby enhancing the Peak Signal-to-Noise Ratio (PSNR) [12]. However, they frequently produce overly smooth images that deviate from human visual perception.

GAN-based methods [13,14,15] combine content loss with adversarial loss to generate images with sharp edges and fine textures, mitigating the problem of detail loss inherent in PSNR-driven approaches. Nevertheless, they face challenges such as mode collapse, training instability, and structural artifacts [16].

Flow-based methods [17,18] utilize reversible neural networks to encode the HR image into a latent space and perform reconstruction via an inverse transformation, achieving a balance between natural detail and diversity. Although these models strike a better balance between realism and visual quality, they still grapple with challenges such as high computational complexity, slow training and inference speeds, and large memory consumption [16].

Diffusion Probabilistic Models (DPMs) [19] have exhibited exceptional performance in the SISR task, demonstrating immense potential and broad prospects in this domain and offering a new paradigm for subsequent research. DPMs are predicated on a Markov chain-based noise injection mechanism that transforms data into latent variables with a Gaussian distribution, followed by a reverse denoising process to reconstruct the data, which has symmetry. Optimized through the Evidence Lower Bound (ELBO), DPMs ensure high-quality sample generation and training stability, which to some extent alleviates the intrinsic shortcomings of GANs, such as lack of diversity, mode collapse, and training instability due to adversarial training, while also exhibiting stronger distribution coverage and generalization performance. However, their sampling process typically requires hundreds or even thousands of iterations, resulting in slow inference speeds. For instance, SR3 [20] demands as many as 2000 training sampling steps to yield images of satisfactory quality.

To achieve fast and low-cost sampling, some methods employ residual modeling instead of natural image modeling, focusing on the difference between the HR and LR images. As shown in Figure 1, modeling the residual image requires fewer computational resources and enables faster training speeds, thus allowing the model to avoid quality degradation with fewer denoising steps. For example, SRDiff [21] uses the bilinear upsampled LR image as a conditional guide for the diffusion model; DVSR [22] employs a randomly initialized CNN to recover the missing information, thereby alleviating the modeling burden; and ResDiff [7] proposed a simple pre-trained CNN and decomposed the residual information into high-frequency and low-frequency components, further reducing the diffusion model’s load.

The above research methods [7,21,22] adopt residual-based diffusion models, where the denoising network only needs to model the residual. Compared to traditional diffusion models, which must process a large volume of natural image information, residual diffusion models significantly reduce the processing burden on the denoising network, thereby accelerating convergence and lowering computational cost. Further alleviating the burden on the denoising network is crucial for accelerating the generation process and improving sample quality. Inspired by [23,24], it is observed that images with lower entropy typically exhibit higher self-similarity and more regular structural textures, suggesting they contain less information and possess simpler data distribution characteristics, which significantly reduces the difficulty of reconstruction. Experimental data demonstrate that data distributions with lower entropy are easier for generative models to capture, and as the entropy decreases, the residual distribution becomes more concentrated, as shown in Figure 2. This key property provides a vital theoretical basis for the denoising network in the entropy subtraction-supported residual-diffusion framework proposed in this paper: fast and low-cost sampling can be achieved by modeling low-entropy residuals.

In this paper, we propose a Conditional Diffusion Model (ESRDF) based on a residual architecture and introduce an Entropy Matching Loss

L_{Em}

. This loss is applied to the CNN-based predictor in the main path to focus on the overall structural alignment of regional information, constraining the image information from an entropy dimension to drive the convergence of the LR image toward the HR image. By incorporating a patch-based mechanism to preserve the overall spatial structure, the model yields residuals with lower entropy. This mechanism reduces the training cost and processing burden on the branch denoising network, guaranteeing reconstruction quality within a limited number of denoising steps, thus avoiding the over-smoothing artifact.

Experiments on the face dataset (FFHQ) and two general datasets (Div2k and Urban100) demonstrate that ESRDF not only achieves fast and low-cost sampling but also generates finer images.

The contributions of this paper are summarized as follows:

Sampling Efficiency: We propose a diffusion model, ESRDF, based on a residual structure to address the SISR problem. Compared to other diffusion methods, this model achieves a low-cost and fast sampling process, specifically requiring fewer denoising steps and demonstrating faster convergence speed.
Perceptually Consistent Output: Compared to other diffusion methods, ESRDF exhibits lower perception-based evaluation scores, indicating that our method generates samples that are more consistent with human perception.
Superior Generation Quality: Compared to other diffusion methods, our method is simpler. By introducing the Entropy Matching Loss $L_{E m}$ in the main path, we enhance the low-frequency global information, allowing the branch diffusion model to focus on high-frequency detail capture and achieve superior generation results.

2. Related Works

The objective of SISR is to recover an HR image from a corresponding LR image. Currently, the academic community has proposed various deep learning techniques to address this reconstruction task, which can be broadly classified into four categories: PSNR-driven methods, GAN-based methods, flow-based methods, and diffusion-based methods. Notably, symmetry priors plays a crucial role in all these deep reconstruction methods: for example, symmetric constraints are incorporated into PSNR-driven models to enhance the rationality of reconstructed structures, symmetric discriminators are utilized in GAN-based methods to improve the consistency of texture details, and symmetric sampling strategies are adopted in diffusion-based models to optimize the global fidelity of images. The first three of these categories can be collectively referred to as non-diffusion methods.

2.1. Non-Diffusion-Based Methods

PSNR-driven methods [4,9,10,11,25] leverage pixel-level optimization to minimize the Mean Squared Error (MSE) between the super-resolved image reconstructed from the LR image and the true HR image, thereby improving the Peak Signal-to-Noise Ratio (PSNR). PSNR is a widely used image quality metric quantified by MSE. By minimizing the MSE, PSNR-driven methods effectively reduce the pixel-wise difference between the reconstructed image and the ground truth image, thus enhancing the visual quality of the image. SRCNN [9], the first model to achieve an end-to-end mapping from LR to HR images, serves as a representative PSNR-driven method. Building upon SRCNN, several subsequent methods [10,11,25] have effectively improved SR performance by fine-tuning network architectures and loss functions. However, PSNR-driven methods are limited to pixel-level optimization and often generate overly smooth images, deviating from human visual perception. This is because, in high-dimensional prediction, these methods capture the median or mean of the entire solution space, which may be insufficient for estimating any specific solution [26].

GAN-based methods [14,15,27], by combining content loss with adversarial loss, can generate high-quality images featuring sharp edges and fine textures, effectively mitigating the detail loss problem caused by the over-smoothing in traditional PSNR-driven methods. As a landmark study, SRGAN [13], built upon the SRResNet architecture, integrates perceptual loss and adversarial loss within a Generative Adversarial Network framework, significantly enhancing the visual realism of reconstructed images. Another groundbreaking work, ESRGAN [14], further optimizes the high-frequency details and texture authenticity of SR images by introducing residual-in-residual dense blocks and a relative discriminator. Nevertheless, GANs suffer from the inherent flaw of mode collapse during training, which significantly affects the diversity of SR-generated samples; simultaneously, these models are highly unstable to train and often produce unexpected structural artifacts in the reconstructed images, thereby compromising visual quality [28].

Flow-based methods [17,18] utilize reversible neural networks to construct a mapping from LR images to HR images, enabling diversified reconstruction results while preserving natural details. SRFlow [17], a representative method based on normalizing flows, employs a reversible neural network architecture to encode the HR image into a latent space representation, which is then decoded and reconstructed via an inverse transformation process. By constructing a bidirectional mapping between its encoder and decoder, this model effectively alleviates the instability issues encountered in the training process of traditional models. Although flow-based image SR techniques excel in diversity and physical modeling, the high computational complexity and memory requirements introduced by reversible networks remain a challenge. In terms of reconstruction results, while they achieve a balance between diversity and physical plausibility, detail generation can be either excessive or insufficient, and their adaptability to different types of images is limited, especially in complex scenes where distortion is likely to occur, necessitating optimization with prior knowledge [29].

2.2. Diffusion-Based Methods

DPMs are a class of generative modeling methods that transform Gaussian noise into a target data distribution through a stepwise denoising Markov chain process, which is a symmetric process. This endows diffusion models with the inherent advantages of high-quality sampling, broad mode coverage, and strong sample diversity, but it also makes fast and low-cost sampling challenging to achieve. In the field of SISR, SR3 [20] and SRDiff [21] are pioneering works based on diffusion models; both utilize conditional diffusion. However, these two methods differ in their modeling approach: SR3 directly models the natural image using the LR image as the condition, while SRDiff utilizes a pre-trained encoder to extract features from the LR image as the condition for modeling the residual image. To achieve fast and low-cost sampling, researchers have further explored residual-based diffusion model methods. For instance, DVSR [22] employs a structurally simple CNN to first perform deterministic estimation, recovering the majority of the image information, followed by refinement using a diffusion model, thereby constructing a prediction-refinement conditional diffusion model framework. Subsequently, ResDiff further optimized this framework. The ResDiff [7] architecture combines a pre-trained CNN with a conditional diffusion model and decomposes the residual information into high-frequency and low-frequency components, thus precisely guiding the prediction of the noise predictor and further enhancing perceptual evaluation metrics.

Specifically, ResDiff utilizes a CNN-based pre-trained pixel predictor to recover the main low-frequency content of the image and introduces a frequency-domain-based loss function to enhance the recovery effect. Building upon this, the conditional diffusion model focuses on predicting high-frequency residual information. Frequency-domain guidance enables the noise predictor to more effectively learn high-frequency detail generation. This design not only accelerates the model’s convergence speed but also significantly improves the quality of the generated images.

However, the sampling efficiency of diffusion models is relatively low because they require traversing the entire high-dimensional network multiple times in both the forward and reverse directions, resulting in immense computational overhead. The core contribution of this framework lies in its ability to utilize the model’s computational resources more efficiently while avoiding the difficulties associated with the direct modeling of complex images. Furthermore, the deterministic estimation provided by the CNN-based pixel predictor offers a good initial condition for the diffusion model, thus reducing the burden on the diffusion model, decreasing the required denoising steps, and consequently improving its efficiency and effectiveness.

2.3. Entropy Subtraction Supported Residual-Diffusion Framework

This paper proposes an entropy subtraction-supported residual-diffusion framework, which aims to integrate the efficiency of residual-driven diffusion models with entropy-based detail optimization strategies to address two core issues faced by diffusion models in image SR tasks, namely over-smoothing and low sampling efficiency.

Currently, to improve the sampling efficiency of diffusion models, the academic community has proposed various SR reconstruction methods based on residual diffusion models [7,30]. Traditional diffusion models directly model complete HR images, and the information space of the images they need to restore contains not only substantial redundant background information but also high-entropy high-frequency details. In contrast, such residual diffusion methods only perform diffusion modeling on the residual between LR images and HR images (i.e., the high-frequency detail component that LR images lack). This design essentially reduces the information entropy that the model needs to process during the image restoration stage.

In addition, to enhance the quality of image reconstruction, in related research on entropy-oriented optimization, Xu et al. [24] analyzed the essence of the over-smoothing phenomenon in SR tasks from the perspective of data characteristics. They pointed out that the higher the information entropy of HR images, the more significant the deviation between the center of the model’s optimization objective and the true distribution of clean images, which ultimately leads to the lack of texture details in the generated images and the occurrence of over-smoothing problems. To address this issue, the team designed DECLoss. Through this loss function, the entropy value of the data distribution is reduced, thereby effectively suppressing the over-smoothing phenomenon and further improving the quality of image reconstruction.

From what has been discussed above, through statistical experiments, we further find that the lower the information entropy of the residual image processed by the diffusion model, the smaller the corresponding Wasserstein distance, the more stable the fluctuation of the training loss, and the better the performance of the image reconstruction metric PSNR, as shown in Figure 3. The above experimental results fully verify the effectiveness and rationality of the proposed method.

3. Diffusion Model

Denoising Diffusion Probabilistic Models (DDPMs [19]) are generative models based on the theory of Markov chains, consisting of a sequence of Markov chains

(x_{0}, x_{1}, \dots, x_{T})

. The core idea is to gradually add Gaussian noise

ϵ \sim N (0, I)

to the data, transforming it into pure noise

x_{T} \sim N (0, I)

, and then progressively recover the original data through a reverse process until the desired data matching the source data distribution is generated. Therefore, the DDPM is composed of two phases: the forward diffusion process and the reverse denoising process [31].

Forward diffusion process: Given an image

x_{0}

, the forward diffusion process is defined as a Markov chain

x_{0 : t}

. In this process, Gaussian noise

ϵ

controlled by the coefficient

β_{1 : t} \in (0, 1)

is continuously added to

x_{0} \sim q (x_{0})

. Specifically, each step of the diffusion process can be expressed as:

q (x_{t} | x_{t - 1}) = N (x_{t}; \sqrt{1 - β_{t}} x_{t - 1}, β_{t} I)

(1)

where the data

x_{T} \sim N (0, I)

approaches pure noise at time step

t = 1, . . ., T

; through the Markov property, we can obtain the entire forward noising process from

x_{0}

to

x_{t}

:

q (x_{t} | x_{0}) = N (x_{t}; \sqrt{{\bar{α}}_{t}} x_{0}, (1 - {\bar{α}}_{t}) I)

(2)

where

α_{t} = 1 - β_{t}

and

\bar{α_{t}} = \prod_{i = 0}^{t} α_{i}

, which can be obtained through reparameterization:

x_{t} = \sqrt{{\bar{α}}_{t}} x_{0} + \sqrt{1 - {\bar{α}}_{t}} ϵ, ϵ \sim N (0, I)

(3)

Reverse denoising process:

x_{T} \sim N (0, I)

has been obtained through the forward diffusion process. The reverse denoising process is defined as a reverse Markov chain

x_{t : 0}

. In this process, starting from the noise

x_{T}

and progressively removing noise to recover the original data

x_{0}

, it involves learning the conditional distribution:

p_{θ} (x_{t - 1} ∣ x_{t}) = N (x_{t - 1}; μ_{θ} (x_{t}, t), σ_{θ}^{2} (x_{t}, t) I)

(4)

where

x_{t}

is interpreted as the mean

μ_{θ}

and the standard deviation

σ_{θ}^{2}

with the Gaussian distribution, and

θ

denotes the learnable parameters of the noise prediction model

ϵ_{θ} (x_{t}, t)

, which is trained using neural networks, such as U-Net, to progressively perform denoising operations.

During the training process, to optimize the model parameters, we are minimizing the variant of ELBO, which can be simply written as follows:

min_{θ} L_{t - 1} (θ) = E_{x_{0}, ϵ, t} [∥ ϵ - ϵ_{θ} (\sqrt{{\bar{α}}_{t}} x_{0} + \sqrt{1 - {\bar{α}}_{t}} ϵ, t) ∥_{1}]

(5)

where

ϵ \sim N (0, I)

is the Gaussian noise sampled randomly, and

ϵ_{θ}

is generated by the noise prediction model, which is required to approximate the noise

ϵ

at the current time step as closely as possible.

When the noise prediction model is trained, we can commence generating data starting from

x_{T} \sim N (0, I)

. It is evident that the variance and mean of the data distribution

p_{θ} (x_{t - 1} ∣ x_{t})

in Equation (4) can be calculated at each time step until image

x_{0}

is obtained.

μ_{θ} (x_{t}, t) = \frac{1}{\sqrt{α_{t}}} (x_{t} - \frac{1 - α_{t}}{\sqrt{1 - {\bar{α}}_{t}}} ϵ_{θ} (x_{t}, t))

(6)

σ_{θ}^{2} (x_{t}, t) = \frac{1 - {\bar{α}}_{t - 1}}{1 - {\bar{α}}_{t}} β_{t}

(7)

There is a primary distinction between diffusion models and earlier generative models in that diffusion models involve dynamic execution across iterative time steps, covering both forward and reverse processes. This dynamic nature endows diffusion models with greater flexibility and adaptability when generating complex data distributions.

4. Methodology

4.1. Residual Framework in SISR

DPMs have demonstrated exceptional capabilities in image synthesis [32] and image restoration tasks [33,34], and have shown immense potential in the SISR task. For instance, existing diffusion-based methods (e.g., SR3) directly generate the HR image from random noise, guided by the LR image as a condition. The specific formulation of this conventional diffusion model framework is as follows:

I_{SR} = {Unet}_{t \in [T, 1]} (x_{t}, t, Iu (I_{LR}))

(8)

where Unet [35] indicates the network architecture of the diffusion model,

x_{t}

represents a natural image progressively corrupted by Gaussian noise until it approximates pure Gaussian noise and t represents the corresponding time step.

Iu (I_{LR})

is the upsampled image (e.g., bilinear and bicubic), which is used as a conditional guide.

However, natural images contain a vast amount of information and have high requirements for precise restoration. In diffusion models with traditional architectures, each step requires the processing of a large amount of information. To address this issue, residual-based diffusion models provide a solution.

Regarding Figure 4, this section discusses that in the research of conditional diffusion models, the residual-based diffusion models have more advantages compared to traditional architectures based on natural images.

The goal of both is consistent, which is to obtain an SR output image of large size. But the difference lies in the architecture.

The residual-based diffusion models are developed from the traditional architecture, as shown in Equation (8), and employ a dual-branch strategy. The main path is responsible for the preliminary restoration of the image to generate an intermediate result close to the target, while the denoising network of the branch path only needs to model the residuals. Finally, by adding the output results of the two paths together, the

I_{S R}

image can be obtained. This architecture can be briefly described as:

I_{SR} = {Unet}_{t \in [T, 1]} ({Res}_{t}, t, Iu (I_{LR})) + Iu (I_{LR})

(9)

where

{Res}_{t}

represents a residual image that has been added with Gaussian noise until it reaches saturation.

However, some studies [7,22] have shown that by introducing a CNN-based predictor in the main path, the model can complete most of the computational tasks in a single run, which further reduces the computational burden of residual modeling in the denoising process. This enables the denoising network to achieve good reconstruction results even when the number of denoising steps is limited.

Overall, the residual-based diffusion model has higher sampling efficiency because residual modeling can reduce the computational cost of the denoising network during sampling.

4.2. Motivation

Revisiting the residual-based diffusion models from the perspective of entropy, we can understand why those methods can perform sampling more efficiently.

Entropy in image distribution: Entropy is a metric of the uncertainty and randomness of a system. In natural images, the pixels are evenly distributed, while the residual pixel values follow a sharper and concentrated distribution, resulting in a substantial decrease in entropy.

Entropy’s effect on reconstruction: Relevant research [23,24] indicates that lower entropy typically corresponds to higher self-similarity and more regular textural features, implying a simpler distribution. Conversely, higher entropy values signify greater image complexity or irregularity, characterized by increased randomness, as shown in Figure 2. It can be concluded that

Lower-entropy data distributions are more easily captured and learned by generative models.
Compared with traditional interpolation methods, a CNN-based pixel predictor yields residual images with lower entropy.

Motivation: In the residual-based diffusion models, we designed a CNN-based predictor and introduced an entropy-based loss function. This design yields lower entropy of the residual that requires modeling.

As shown in the lavender part in Figure 4, by enhancing the entropy recovery ability of the initial CNN, residual images with lower entropy can be obtained, and it means the diffusion model on the branch path only needs to bear an extremely low burden, thus achieving efficient sampling. Formally, the optimization objective is expressed as:

m i n L_{Entropy} (HR, Iu (LR))

(10)

Unlike SR3, which uses natural images, ESRDF employs residuals processed by a CNN, and the image entropy of the latter is lower than that of natural images. The FFHQ (

4 \times

) validate this conclusion, as shown in Figure 3. This statistical experiment reflects the differences between residual images and natural images. As can be seen from the left subfigure, the residual images have lower entropy values and smaller Wasserstein distances than natural images. As can be seen from the right subfigure, the PSNR curves have a higher starting point at the beginning of optimization and maintain a performance advantage over the comparison models throughout the entire iteration process. It can thus be concluded that the use of residual features with lower entropy values helps to achieve smaller Wasserstein distances, thereby effectively improving the image reconstruction performance of the model.

4.3. ESRDF

ESRDF is derived from the residual-based diffusion models. In its structural design, the main path employs an initial CNN, while the denoising network of the branch path only needs to model the residual part.

To reduce the entropy of the residual for modeling before it is fed into the denoising network of the ESRDF branch, this paper proposes an entropy-matching loss, denoted as

L_{E m}

, and combines it with a perceptual loss

L_{F e a t}

[12] and a pixel-wise loss function

L_{P i x e l}

.

Entropy matching Loss

L_{E m}

. Inspired by [36,37,38], we designed the histogram loss to measure the similarity between the histograms of the generated image and the target image, and to minimize the KL divergence between their patches. Specifically, we employed a differentiable histogram generation method to calculate the histogram. This method uses Gaussian kernel weights to map the intensity of each pixel continuously to all histogram bins, rather than simply assigning it to the nearest bin. This continuous allocation approach not only makes the pixel intensity distribution of the generated image closer to that of the target image, but also makes the uncertainty (i.e., entropy) of the pixel intensity distribution of the generated image closer to that of the target image. In this way, it not only optimizes the distribution of pixel intensities, but also optimizes the uncertainty of the distribution, thereby achieving tighter information entropy alignment between the generated image and the target image:

\begin{matrix} L_{E m} & = K L (H (y) | | H (\hat{y})) \\ = E [H (y_{i}) \cdot log \frac{H (y_{i}) + ϵ_{1}}{H (\hat{y_{i}}) + ϵ_{2}}] \end{matrix}

(11)

where

{\hat{y}}_{i}

and

y_{i}

denote the SR image patches and ground-truth image patches, respectively.

ϵ_{1}

and

ϵ_{2}

are numerical stability constants. Given the asymmetry of the KL divergence, the calculation direction is defined as using the histogram

H (y)

of the ground-truth image as the reference distribution to measure the discrepancy between the histogram

H (\hat{y})

of the SR image and this reference, so as to constrain the reconstructed distribution to align with the ground-truth distribution.

H (x)

is a function that maps the input image x to its corresponding histogram distribution, with the detailed procedure given below:

1. Dynamic generation of bin centers: The bin centers c are

b i n s

points uniformly distributed within the range

{c ∣ 0 \leq c \leq 1}

, that is:

c = \{c_{i} ∣ c_{i} = \frac{i}{b i n s - 1}, i = 0, 1, \dots, b i n s - 1\}

(12)

where c denotes the set of bin centers, and

b i n s

represents the total number of bins.

2. Gaussian Kernel Weight Calculation: For each pixel value x, the Gaussian kernel weight with respect to each bin center is calculated as:

w = \frac{1}{σ \sqrt{2 π}} exp (- \frac{{(x - c)}^{2}}{2 σ^{2}})

(13)

where w denotes the Gaussian kernel weight between the pixel value x and the bin center c,

σ

is the standard deviation of the Gaussian kernel, and c is the corresponding bin center.

3. For each bin in each channel, accumulate the weights of all pixels, then normalize the histogram for each channel:

h_{j b} = \sum_{k = 1}^{H \times W} w_{k}, H_{j b} = \frac{h_{j b}}{\sum_{b = 0}^{bins - 1} h_{j b}}

(14)

where

h_{j b}

denotes the accumulated weight for bin b in channel j,

H_{j b}

is the normalized histogram value for bin b in channel j, H and W are the height and width of the image, and bins is the total number of bins. In conclusion,

H_{j b}

serves as the specific computational implementation of the global histogram mapping function

H (x)

, and the two have an overall-to-local correspondence.

Feature reconstruction Loss

L_{F e a t}

is represented as the squared Euclidean distance between the feature representations of two images:

L_{F e a t} = E [\frac{1}{C_{j} H_{j} W_{j}} {∥ϕ_{j} (\hat{y}) - ϕ_{j} (y)∥}_{2}^{2}]

(15)

where

φ (x)

denotes the third-party evaluation network (e.g., VGG16), and

φ_{j} (x)

indicates the activation of the j-th layer. If the j-th layer is a convolutional layer,

φ_{j} (x)

is a feature map of shape

C_{j} \times H_{j} \times W_{j}

, corresponding to the number of channels, height, and width, respectively.

Pixel Loss

L_{P i x e l}

is the loss in the image spatial domain, defined as the absolute difference between the pixel values of the true image and the predicted image:

L_{P i x e l} = E [∥ y - \hat{y} ∥_{1}]

(16)

where

| \cdot |

denotes the absolute value.

Therefore, the overall loss consists of the following components:

L = L_{E m} + L_{F e a t} + L_{P i x e l}

(17)

4.4. Structure and Training

The main path of ESRDF employs a pre-trained CNN to process LR images, reconstructing the primary image information and laying the groundwork for the subsequent operation of the branch denoising network. Inspired by [4,39,40], a simple design is carried out for the structure of the main path predictor, as shown in Figure 5.

4.5. Discussion

Wasserstein-Angle: Shorter Is Better

The Wasserstein distance (W-distance) quantifies the discrepancy between two probability distributions by the minimum “transport cost” required to reshape one into the other [41]. From this vantage point, turning a residual distribution into Gaussian noise is far cheaper than forcing a natural-image distribution to become Gaussian: the residual distribution lies geometrically much closer to the Gaussian target [42]. Consequently, the W-distance between residuals and Gaussian noise is markedly smaller, as shown in Figure 3.

5. Experiments

In this section, to comprehensively verify the effectiveness of the proposed ESRDF model, we first introduce the experimental setup, including the dataset, model configuration, and details of training and inference. Then, we report the experimental results and analyze them based on the results of the qualitative analysis.

5.1. Experimental Settings

Datasets & Metrics. This study conducts a systematic evaluation of the ESRDF model using three types of benchmark datasets. For face SR, we build an evaluation benchmark on the FFHQ dataset [43]. Following ResDiff, HR images are down-sampled with a bicubic kernel; 5000 of them are held out for testing and the rest are used for training.

For General SR, we adopt two standard datasets (Div2K [44] and Urban100 [45]). Following ResDiff, training patches of 160 × 160 are first cropped from the HR images and then sequentially down-sampled with a bicubic kernel. At test time, the original HR images are down-sampled and fed to the model to produce SR results. We evaluate on the official 100-image DIV2K test split and randomly reserve 20 images from Urban100 as the test set.

This study conducts a multi-dimensional performance analysis by integrating distortion metrics (PSNR and SSIM [46]) and perceptual quality metrics (FID [47,48]).

Baselines. As a baseline model, DDPM learns the reverse mapping of noise through a U-Net. After modification, it uses only the input image as conditional guidance to gradually recover data from pure noise for generation tasks.

Training & Evaluation Settings. The lightweight CNN is pre-trained for only 1 k iterations with batch size 96. The branch denoiser is a DDPM-style U-Net: 64 initial channels, two residual convolutions per block, dropout 0.0, the multipliers for U-Net depth were set to 1, 2, 4, 8, 8. For diffusion training, we use AdamW (lr =

1 \times 10^{- 4}

, weight-decay

5 \times 10^{- 4}

), batch size 80,

T = 200

, linear

β

-schedule from

1 \times 10^{- 6}

to

1 \times 10^{- 2}

, and L1 loss. The whole procedure finishes in 100 k steps on a single NVIDIA GeForce RTX 3090 graphics card (NVIDIA Corporation, Santa Clara, CA, USA). Specific content can be accessed via the public link provided at the end of the paper.

5.2. Performance

In this subsection, we evaluate ESRDF by comparing it with several advanced SR methods on face SR and general SR tasks. The specific results are as follows, where the bolded values represent the optimal values for each evaluation metric, and the “-” symbol indicates that the performance of the corresponding method on some of the used datasets was not explicitly mentioned.

Face SR. We verify the performance of ESRDF in the face restoration task through the

32 \times 32 \to 128 \times 128 (4 \times)

task of the FFHQ, as shown in Table 1. It displays partial results from ESRDF and other reconstruction approaches, shown in Figure 6(left).

General SR. To verify the performance of ESRDF in handling high-difficulty image restoration tasks on general-purpose datasets, we select DIV2K and Urban100, and respectively conduct the image SR task tests of

40 \times 40 \to 160 \times 160 (4 \times)

, as shown in Table 2. It displays partial results from ESRDF and other reconstruction approaches, shown in Figure 6(right) and Figure 7.

More Efficient sampling. The curves of pixel-level metric PSNR and perception-level metric FID for ESRDF and ResDiff on the FFHQ dataset are shown in Figure 8. As can be seen from the curves in the figure, ESRDF’s PSNR curve has a higher starting point at the beginning of optimization and maintains a performance advantage over the comparison models throughout the entire iteration process. This phenomenon can be attributed to the core design of our proposed ESRDF: its main path CNN can provide a sufficient and accurate low-frequency information base for the diffusion model at the initial stage of image restoration, effectively avoiding the problem of a low optimization starting point caused by insufficient initial information in traditional diffusion models, thereby laying a solid foundation for subsequent high-frequency detail reconstruction and metric improvement. Furthermore, further verification can be obtained from the variation trend of the FID curve. Relying on the synergistic effect of entropy matching loss and low-entropy residual modeling, ESRDF can achieve significantly better perceptual metric values than the comparison models at the same iteration stage with fewer denoising steps. This result directly reflects ESRDF’s balanced advantage between sampling efficiency and perceptual quality where it does not need to rely on a large number of iteration steps to obtain performance improvement while ensuring the visual effect of the generated images. Compared with other diffusion methods, ESRDF has a higher sampling efficiency, including fewer denoising steps, better perceptual metrics, and a convergence speed comparable to ResDiff [7].

5.3. Ablation Study

In this section, we conduct ablation experiments based on the FFHQ dataset to evaluate the effectiveness of each component in ESRDF, including verifying whether the use of residual modeling can achieve superior reconstruction performance, and the specific experimental results are presented in the Table 3. Among the results, “

R e s

” indicates whether the residual model is used, and the “Methods” column indicates that, within the CNN block of ESRDF, training pairs are produced via CNN or Bicubic upsampling and the residuals are then computed from these pairs, as shown in Figure 4, and the checkmark (✓) in the table indicates that the corresponding experimental condition was adopted.

5.4. Experimental Performance Analysis

To complete the performance evaluation, this paper comprehensively adopts two pixel-level distortion metrics (PSNR and SSIM) and one perceptual quality metric (FID). Specifically, PSNR and SSIM have complementary advantages in quantifying pixel-level errors and evaluating image structural consistency, respectively, while FID can accurately measure the overall distribution similarity between generated content and real samples. The combination of these three metrics ensures that the final evaluation results not only accurately quantify the overall generation quality of the model, but also align with human subjective visual perception.

For both Face SR and general SR reconstruction tasks, our proposed method ESRDF outperforms the second-best comparative model in all core metrics including PSNR, SSIM, and FID. ESRDF relies on a conditional diffusion model with a residual architecture, which can efficiently achieve accurate reconstruction and detail enhancement of LR images to HR counterparts, effectively breaking through the technical bottlenecks of low sampling efficiency and high computational cost faced by traditional diffusion models in SR tasks. The model innovatively constructs a multi-loss fusion optimization framework in the CNN-based predictor of the main path, where the total loss function is composed of

L_{F e a t}

,

L_{P i x e l}

and

L_{Em}

. Specifically, the feature loss

L_{F e a t}

focuses on improving the perceptual consistency of reconstructed images to ensure that the generated results conform to human visual cognition rules; the pixel-wise loss

L_{P i x e l}

accurately constrains the reconstruction accuracy at the pixel level by calculating the expected value of pixel-wise errors between the ground-truth HR images and the predicted images. However, these two types of losses can only optimize the model performance from the perceptual and pixel levels, respectively, failing to effectively constrain the overall structural alignment of information within regions or drive the rapid convergence of the information distribution from LR images to HR images. The introduction of the entropy matching loss

L_{Em}

exactly compensates for this critical limitation: it imposes strict and precise constraints on the overall structural alignment of information within regions from the entropy dimension, forcing the learned features of LR images to align with the structural and information distribution characteristics of HR images during training.

The synergistic effect of the three loss functions not only enables the output of high-quality residual information with lower entropy, less noise, and higher pixel accuracy but also comprehensively guarantees the reconstruction performance from three core dimensions: structural alignment, perceptual quality, and pixel precision. This core design fundamentally reduces the training difficulty, training cost, and subsequent computational burden of the branch denoising network, and more importantly, clearly guides the branch model to focus on the accurate capture and fine restoration of key details such as high-frequency textures and edge contours of images. Stable and superior reconstruction performance can be achieved without relying on excessive denoising iteration steps. Finally, validated on two typical SR tasks using the face image dataset (FFHQ) and general natural image datasets (Div2k and Urban100), the proposed method stably achieves excellent performance with both high sampling speed and outstanding generation quality, fully verifying its generality and effectiveness.

Ablation experiments were conducted on the FFHQ dataset for the

4 \times

SR task. By precisely controlling the activation or inactivation states of the entropy matching loss (

L_{E m}

) and residual structure (Res), multiple groups of control experiments were designed to systematically verify the independent contributions and synergistic effects of these two core components. The experimental results demonstrate that the performance of each group exhibits significant gradient differences, and the specific comparative analysis is as follows: First, a comparison between the optimal experimental group (ESRDF-CNN) with both

L_{E m}

and Res activated and the control group with only

L_{E m}

activated but Res inactivated shows that the former outperforms the latter in all three core metrics (PSNR, SSIM, and FID), achieving only a modest performance improvement. This indicates that the introduction of the residual structure can further optimize the feature transfer efficiency of the model, thereby marginally enhancing the accuracy of image reconstruction and the realism of generated images. Second, when comparing the optimal experimental group with the control group where Res is activated but

L_{E m}

is inactivated, the latter is significantly inferior to the optimal experimental group in all metrics, exhibiting a noticeable performance degradation. This confirms that

L_{E m}

plays a critical role in improving reconstruction accuracy and optimizing perceptual quality. Third, a comparison between the control group with neither

L_{E m}

nor Res activated and the two control groups lacking only a single component reveals that the former has further degraded performance and is significantly inferior to the latter two. This indicates that the synergistic effect of the two core components can form a superimposed performance gain; the absence of either component will lead to performance degradation, and their combined effect can achieve a superimposed optimization effect. In addition, a comparison between the control group with only

L_{E m}

activated but Res inactivated and the control group with Res activated but

L_{E m}

inactivated shows that the former outperforms the latter, further confirming that

L_{E m}

plays a more critical role in enhancing model performance. Finally, all ESRDF-CNN-related experimental groups significantly outperform the bicubic interpolation method in all metrics, fully verifying the rationality of the model design proposed in this method.

In summary, the entropy matching loss can strengthen the overall structural alignment of information within regions through entropy-dimensional constraints, ensuring the structural consistency and precision of reconstructed images; the residual structure can effectively alleviate the gradient vanishing problem in deep network training, optimize feature transfer efficiency, and facilitate the complete preservation and transmission of effective features; the synergistic effect of these two components can achieve the superposition of performance gains, effectively improve the comprehensive reconstruction performance of the model, and enable the generated images to possess both higher objective accuracy and superior subjective realism.

6. Conclusions

This paper focuses on diffusion models based on residual structures and proposes the Entropy Subtraction-Supported Residual-Diffusion Framework (ESRDF), aiming to address the core challenges of low sampling efficiency and over-smoothing in reconstructed images of traditional diffusion models in image SR tasks. Unlike traditional methods that only use LR images as conditions to guide the generation of HR images, ESRDF constructs a collaborative architecture consisting of a CNN-based main path predictor and a diffusion branch: the main path CNN is responsible for efficiently extracting LR image features and initially reconstructing the basic structure, while the diffusion branch focuses on the accurate modeling of low-entropy residual information. Through the entropy matching loss function (

L_{E m}

), it imposes strict constraints on the overall structural alignment of intra-regional information from the entropy dimension, forcing the feature distribution of LR images to align with that of HR images. Meanwhile, it cooperates with the feature loss (

L_{F e a t}

) and pixel loss (

L_{P i x e l}

) to form a multi-loss fusion optimization, ensuring the perceptual consistency and pixel accuracy of reconstructed images. By constraining the entropy of residual information and reducing redundant noise, this design not only effectively reduces the computational burden of the diffusion branch but also achieves a low-cost and high-speed sampling process, fundamentally alleviating the over-smoothing problem in reconstruction. Statistical analysis and systematic experimental verification on multiple public datasets (FFHQ, Div2k, Urban100) show that low-entropy data can significantly improve sampling efficiency. ESRDF outperforms comparative models in three core metrics (PSNR, SSIM, FID), which not only greatly shortens the training convergence period but also achieves breakthrough improvements in generation quality for both face and general SR tasks. Ablation experiments further confirm that the synergistic effect of the entropy matching loss and residual structure can produce a superimposed performance gain, which is the key for the model to overcome the problems of low sampling efficiency and over-smoothing. Currently, the feasibility of ESRDF has been verified on simple models. Future research is expected to further improve performance and generalization ability by deeply integrating it with larger-scale models.

Author Contributions

Conceptualization, H.H.; Methodology, H.H.; Validation, H.H.; Formal analysis, C.S.; Investigation, H.H.; Resources, H.H., C.H., X.S. and H.Y.; Writing—original draft, H.H.; Writing—review & editing, H.H.; Visualization, H.H.; Supervision, C.S., C.H., X.S. and H.Y.; Project administration, C.S.; Funding acquisition, H.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China grant number 62176107 and the National Natural Science Foundation of China grant number 62376109.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author. The code is available at https://github.com/hriple/ESRDF.git, accessed on 1 January 2026.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Bashir, S.M.A.; Wang, Y.; Khan, M.; Niu, Y. A comprehensive review of deep learning-based single image super-resolution. PeerJ Comput. Sci. 2021, 7, e621. [Google Scholar] [CrossRef]
Chen, H.; He, X.; Qing, L.; Wu, Y.; Ren, C.; Sheriff, R.E.; Zhu, C. Real-World Single Image Super-Resolution: A Brief Review. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 124–145. [Google Scholar] [CrossRef]
Li, J.; Pei, Z.; Li, W.; Gao, G.; Wang, L.; Wang, Y.; Zeng, T. A Systematic Survey of Deep Learning-Based Single-Image Super-Resolution. ACM Comput. Surv. 2024, 56, 1–38. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
Li, M.; Liu, W.; Shao, C.; Qin, B.; Tian, A.; Yu, H. Multi-Scale Feature Enhancement Method for Underwater Object Detection. Symmetry 2025, 17, 63. [Google Scholar] [CrossRef]
Shao, C.; Li, W.; Huo, J.; Feng, Z.; Gao, Y. Attention-based Investigation and Solution to the Trade-off Issue of Adversarial Training. Neural Netw. 2024, 174, 106224. [Google Scholar] [CrossRef]
Shang, S.; Shan, Z.; Liu, G.; Wang, L.; Wang, X.; Zhang, Z.; Zhang, J. ResDiff: Combining CNN and Diffusion Model for Image Super-Resolution. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 8975–8983. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All You Need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar] [CrossRef]
Dong, C.; Loy, C.C.; He, K.; Tang, X.O. Learning a Deep Convolutional Network for Image Super-Resolution. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 184–199. [Google Scholar] [CrossRef]
Kim, J.; Lee, J.K.; Lee, K.M. Accurate Image Super-Resolution Using Very Deep Convolutional Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1646–1654. [Google Scholar] [CrossRef]
Zhang, Y.; Li, K.; Li, K.; Wang, L.; Zhong, B.; Fu, Y. Image Super-Resolution Using Very Deep Residual Channel Attention Networks. In Proceedings of the 15th European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; Springer: Cham, Switzerland; pp. 286–301. [Google Scholar] [CrossRef]
Johnson, J.; Alahi, A.; Fei-Fei, L. Perceptual Losses for Real-Time Style Transfer and Super-Resolution. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–18 October 2016; Springer: Cham, Switzerland; pp. 694–711. [Google Scholar] [CrossRef]
Ledig, C.; Theis, L.; Huszar, F.; Caballero, J.; Cunningham, A.; Acosta, A.; Aitken, A.; Tejani, A.; Totz, J.; Wang, Z.; et al. Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4681–4690. [Google Scholar] [CrossRef]
Wang, X.; Yu, K.; Wu, S.; Gu, J.; Liu, Y.; Dong, C.; Qiao, Y.; Loy, C.C. ESRGAN: Enhanced Super-Resolution Generative Adversarial Networks. In Proceedings of the 15th European Conference on Computer Vision Workshops, Munich, Germany, 8–14 September 2018; Springer: Cham, Switzerland; pp. 98–113. [Google Scholar] [CrossRef]
Zhang, Y.; Liu, S.; Dong, C.; Zhang, X.; Yuan, Y. Unsupervised Image Super-Resolution Using Cycle-in-Cycle Generative Adversarial Networks. In Proceedings of the 31st IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 814–819. [Google Scholar] [CrossRef]
Moser, B.B.; Shanbhag, A.S.; Raue, F.; Frolov, S.; Palacio, S.; Dengel, A. Diffusion Models, Image Super-Resolution and Everything: A Survey. IEEE Trans. Neural Netw. Learn. Syst. 2024, 36, 11793–11813. [Google Scholar] [CrossRef] [PubMed]
Lugmayr, A.; Danelljan, M.; Van Gool, L.; Timofte, R. SRFlow: Learning the Super-Resolution Space with Normalizing Flow. In Proceedings of the European Conference on Computer Vision, Cham, Switzerland, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 251–268. [Google Scholar]
Song, C.; Lang, Z.; Feng, W.; Zhang, L. E2FIF: Push the Limit of Binarized Deep Imagery Super-resolution Using End-to-end Full-precision Information Flow. IEEE Trans. Image Process. 2023, 32, 5379–5393. [Google Scholar] [CrossRef]
Ho, J.; Jain, A.; Abbeel, P. Denoising Diffusion Probabilistic Models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar] [CrossRef]
Saharia, C.; Ho, J.; Chan, W.; Salimans, T.; Fleet, D.J.; Norouzi, M. Image Super-Resolution Via Iterative Refinement. Adv. Neural Inf. Process. Syst. 2022, 35, 30789–30801. [Google Scholar] [CrossRef] [PubMed]
Li, H.; Yang, Y.; Chang, M.; Chen, S.; Feng, H.; Xu, Z.; Li, Q.; Chen, Y. SRDiff: Single Image Super-Resolution with Diffusion Probabilistic Models. Neurocomputing 2022, 479, 47–59. [Google Scholar] [CrossRef]
Whang, J.; Delbracio, M.; Talebi, H.; Saharia, C.; Dimakis, A.G.; Milanfar, P. Deblurring via Stochastic Refinement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 11874–11884. [Google Scholar] [CrossRef]
Lhermitte, E.; Hilal, M.; Furlong, R.; O’Brien, V.; Humeau-Heurtier, A. Deep Learning and Entropy-Based Texture Features for Color Image Classification. Entropy 2022, 24, 1577. [Google Scholar] [CrossRef]
Xu, T.; Li, L.; Mi, P.; Zheng, X.; Chao, F.; Ji, R.; Tian, Y.; Shen, Q. Uncovering the Over-smoothing Challenge in Image Super-Resolution: Entropy-based Quantification and Contrastive Optimization. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 6199–6215. [Google Scholar] [CrossRef]
Dong, C.; Loy, C.C.; Tang, X. Accelerating the Super-Resolution Convolutional Neural Network. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–18 October 2016; Springer: Cham, Switzerland; pp. 391–407. [Google Scholar] [CrossRef]
He, X.; Cheng, J. Revisiting L1 Loss in Super-Resolution: A Probabilistic View and Beyond. arXiv 2022, arXiv:2201.10084. [Google Scholar] [CrossRef]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Networks. Adv. Neural Inf. Process. Syst. 2014, 27, 2672–2680. [Google Scholar] [CrossRef]
Grover, A.; Dhar, M.; Ermon, S. Flow-GAN: Combining Maximum Likelihood and Adversarial Learning in Generative Models. Proc. AAAI Conf. Artif. Intell. 2018, 32, 2901–2908. [Google Scholar] [CrossRef]
Zhang, Q.; Chen, Y. Diffusion Normalizing Flow. Adv. Neural Inf. Process. Syst. 2021, 34, 16280–16291. [Google Scholar] [CrossRef]
Tang, Z.; Jiang, C.; Cui, Z.; Shen, D. HF-ResDiff: High-Frequency-Guided Residual Diffusion for Multi-dose PET Reconstruction. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), Marrakech, Morocco, 8–12 October 2024; Springer: Cham, Switzerland, 2024; pp. 372–381. [Google Scholar] [CrossRef]
Croitoru, F.A.; Hondru, V.; Ionescu, R.T.; Shah, M. Diffusion Models in Vision: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 10850–10869. [Google Scholar] [CrossRef] [PubMed]
Saharia, C.; Saxena, S.; Li, L.L.; Whang, J.; Denton, E.; Ghasemipour, S.K.S.; Gontijo-Lopes, R.; Ayan, B.K.; Mahdavi, S.S.; Salimans, T.; et al. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. In Proceedings of the 36th International Conference on Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, 28 November–9 December 2022; pp. 36479–36494. [Google Scholar] [CrossRef]
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-Resolution Image Synthesis with Latent Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 10684–10695. [Google Scholar] [CrossRef]
Wang, Y.; Yu, J.; Zhang, J. Zero-Shot Image Restoration Using Denoising Diffusion Null-Space Model. arXiv 2022, arXiv:2212.00490. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), Munich, Germany, 5–9 October 2015; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar] [CrossRef]
Ustinova, E.; Lempitsky, V. Learning Deep Embeddings with Histogram Loss. Adv. Neural Inf. Process. Syst. 2016, 29, 4173–4181. [Google Scholar] [CrossRef]
Luan, F.; Paris, S.; Shechtman, E.; Bala, K. Deep Painterly Harmonization. Comput. Graph. Forum 2018, 37, 95–106. [Google Scholar] [CrossRef]
Wang, G.; Zhang, Z.J. An Improved Method for Histogram Equalization with Controllable Image Entropy. J. Nantong Vocat. Univ. 2020, 34, 67–71. Available online: http://dianda.cqvip.com/Qikan/Article/Detail?id=7103842966 (accessed on 1 January 2026).
Kim, J.H.; Lee, J.S. Deep Residual Network with Enhanced Upscaling Module for Super-Resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Salt Lake City, UT, USA, 18–23 June 2018; pp. 2072–2080. [Google Scholar]
Cheon, M.; Kim, J.H.; Choi, J.H.; Lee, J.S. Generative Adversarial Network-Based Image Super-Resolution Using Perceptual Content Losses. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 471–486. [Google Scholar] [CrossRef]
Arjovsky, M.; Chintala, S.; Bottou, L. Wasserstein Generative Adversarial Networks. In Proceedings of the International Conference on Machine Learning (ICML), Sydney, Australia, 6–11 August 2017; pp. 214–223. [Google Scholar] [CrossRef]
Asokan, S.; Seelamantula, C.S. Spider GAN: Leveraging Friendly Neighbors to Accelerate GAN Training. IEEE Trans. Image Process. 2023, 32, 3883–3893. [Google Scholar] [CrossRef]
Karras, T.; Laine, S.; Aittala, M.; Hellsten, J.; Lehtinen, J.; Aila, T. A Style-Based Generator Architecture for Generative Adversarial Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 4401–4410. [Google Scholar] [CrossRef]
Agustsson, E.; Timofte, R. NTIRE 2017 Challenge on Single Image Super-Resolution: Dataset and Study. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Honolulu, HI, USA, 21–26 July 2017; pp. 1122–1131. Available online: https://openaccess.thecvf.com/content_cvpr_2017_workshops/w12/html/Agustsson_NTIRE_2017_Challenge_CVPR_2017_paper.html (accessed on 1 January 2026).
Huang, J.B.; Singh, A.; Ahuja, N. Single Image Super-Resolution from Transformed Self-Exemplars. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 5197–5206. [Google Scholar] [CrossRef]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef]
Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. Adv. Neural Inf. Process. Syst. 2017, 30, 6626–6637. [Google Scholar] [CrossRef]
Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The Unreasonable Effectiveness of Deep Features As a Perceptual Metric. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 586–595. [Google Scholar] [CrossRef]

Figure 1. Diffusion model denoising illustration. (a) Traditional diffusion framework for natural image modeling; (b) Residual diffusion framework for residual image modeling.

Figure 2. Reconstruction stats visualization across entropy levels. Downsample 8×, then reconstruct via bilinear, bicubic, or our predictor; lower-entropy images yield better LPIPS.

Figure 3. Subplot-left: Residual versus natural-image entropy and W-distance to Gaussian noise. Subplot-right: ESRDF exploits residual for better reconstruction.

Figure 4. The visualization of ESRDF architecture. A residual diffusion net: The main-path CNN maintains low residual entropy of the branch, corresponding to the cloud region in the figure.

Figure 5. The CNN architecture of ESRDF.

Figure 6. Reconstruction Result Visualization on FFHQ (left) and DIV2K (right). The boxes indicate the key region for observing the recovery effect.

Figure 7. Reconstruction Results Visualization on Urban100 Dataset. The areas marked by red boxes denote the reconstructed image patches for detailed visual comparison.

Figure 8. The visualization of FFHQ SR reuslt. Subplot-left: ESRDF has a slightly faster convergence speed than ResDiff, and its PSNR is also slightly better. Subplot-right: ESRDF demonstrates superior FID metrics on the validation set compared to ResDiff. Moreover, by employing fewer sampling steps, it achieves higher sampling efficiency.

Table 1. A quantitative comparison was conducted in the FFHQ.

Methods	PSNR	SSIM	FID
SRGAN	17.57	0.688	156.07
ESRGAN	15.43	0.267	166.36
BRGM	24.16	0.70	-
PULSE	15.74	0.37	-
SRDiff	26.07	0.794	72.36
SR3	25.37	0.778	75.29
ResDiff	26.73	0.818	70.54
ESRDF	26.80	0.821	62.69

Table 2. A quantitative comparison was conducted in the DIV2K and Urban100.

DataSets	DIV2K $4 \times$			Urban100 $4 \times$
Methods	PSNR	SSIM	FID	PSNR	SSIM	FID
SRDiff	26.87	0.69	110.32	26.49	0.79	51.37
SR3	26.17	0.65	111.45	25.18	0.62	61.14
ResDiff	27.94	0.72	106.71	27.43	0.82	42.35
ESRDF	28.55	0.76	91.31	27.55	0.85	39.19

Table 3. Ablation experiments on different model components were conducted on the FFHQ

(4 \times)

.

Table 3. Ablation experiments on different model components were conducted on the FFHQ

(4 \times)

.

Methods	Model Components				Metrics
Methods	$L_{pixel}$	$L_{feat}$	$L_{Em}$	$Res$	PSNR	SSIM	FID
ESRDF-CNN	✓	✓	✓	✓	26.80	0.821	62.69
ESRDF-CNN	✓	✓	✓		26.77	0.819	63.27
ESRDF-CNN	✓	✓		✓	26.52	0.811	70.96
ESRDF-CNN	✓	✓			26.46	0.809	71.43
Bicubic				✓	25.71	0.783	82.04
Bicubic					25.64	0.781	82.11

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Huang, H.; Shao, C.; Hu, C.; Shu, X.; Yu, H. Entropy Subtraction-Supported Residual-Diffusion Framework for Image Super-Resolution. Symmetry 2026, 18, 193. https://doi.org/10.3390/sym18010193

AMA Style

Huang H, Shao C, Hu C, Shu X, Yu H. Entropy Subtraction-Supported Residual-Diffusion Framework for Image Super-Resolution. Symmetry. 2026; 18(1):193. https://doi.org/10.3390/sym18010193

Chicago/Turabian Style

Huang, Honghe, Changbin Shao, Chunlong Hu, Xin Shu, and Hualong Yu. 2026. "Entropy Subtraction-Supported Residual-Diffusion Framework for Image Super-Resolution" Symmetry 18, no. 1: 193. https://doi.org/10.3390/sym18010193

APA Style

Huang, H., Shao, C., Hu, C., Shu, X., & Yu, H. (2026). Entropy Subtraction-Supported Residual-Diffusion Framework for Image Super-Resolution. Symmetry, 18(1), 193. https://doi.org/10.3390/sym18010193

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Entropy Subtraction-Supported Residual-Diffusion Framework for Image Super-Resolution

Abstract

1. Introduction

2. Related Works

2.1. Non-Diffusion-Based Methods

2.2. Diffusion-Based Methods

2.3. Entropy Subtraction Supported Residual-Diffusion Framework

3. Diffusion Model

4. Methodology

4.1. Residual Framework in SISR

4.2. Motivation

4.3. ESRDF

4.4. Structure and Training

4.5. Discussion

Wasserstein-Angle: Shorter Is Better

5. Experiments

5.1. Experimental Settings

5.2. Performance

5.3. Ablation Study

5.4. Experimental Performance Analysis

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI