A Comprehensive Review of Image Restoration Research Based on Diffusion Models

Li, Jun; Wang, Heran; Li, Yingjie; Zhang, Haochuan

doi:10.3390/math13132079

Open AccessReview

A Comprehensive Review of Image Restoration Research Based on Diffusion Models

¹

School of Management Science and Information Engineering, Jilin University of Finance and Economics, Changchun 130117, China

²

Center for Artificial Intelligence, Jilin University of Finance and Economics, Changchun 130117, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Mathematics 2025, 13(13), 2079; https://doi.org/10.3390/math13132079

Submission received: 13 May 2025 / Revised: 21 June 2025 / Accepted: 22 June 2025 / Published: 24 June 2025

Download

Browse Figures

Versions Notes

Abstract

Image restoration is an indispensable and challenging task in computer vision, aiming to enhance the quality of images degraded by various forms of degradation. Diffusion models have achieved remarkable progress in AIGC (Artificial Intelligence Generated Content) image generation, and numerous studies have explored their application in image restoration, achieving performance surpassing that of other methods. This paper provides a comprehensive overview of diffusion models for image restoration, starting with an introduction to the background of diffusion models. It summarizes relevant theories and research in utilizing diffusion models for image restoration in recent years, elaborating on six commonly used methods and their unified paradigm. Based on these six categories, this paper classifies restoration tasks into two main areas: image super-resolution reconstruction and frequency-selective image restoration. The frequency-selective image restoration category includes image deblurring, image inpainting, image deraining, image desnowing, image dehazing, image denoising, and low-light enhancement. For each area, this paper delves into the technical principles and modeling strategies. Furthermore, it analyzes the specific characteristics and contributions of the diffusion models employed in each application category. This paper summarizes commonly used datasets and evaluation metrics for these six applications to facilitate comprehensive evaluation of existing methods. Finally, it concludes by identifying the limitations of current research, outlining challenges, and offering perspectives on future applications.

Keywords:

diffusion model; image restoration; image super-resolution reconstruction; image deblurring; image inpainting

MSC:

68T07

1. Introduction

In the field of computer vision, image restoration refers to the reconstruction of a low-quality image by targeting the damaged region of an image and utilizing the pixels surrounding the region to supplement or remove the content of the region. Image restoration plays an irreplaceable role in improving the subjective quality of images. Furthermore, in the field of medical imaging, the results of image restoration also play a significant role in medical diagnostics.

Depending on the type of image damage, image restoration tasks can be divided into six main aspects: image super-resolution, image deblurring, image patching, image deraining, desnowing, defogging, image denoising, and low-light enhancement. The traditional image restoration task is mainly based on the patch method, which is based on the principle of cutting the damaged image into a number of regions, and finding other patches in the image that are similar in content to the damaged region, and copying them to the region for filling. The method is easy to implement, but it is easy to appear patchwork traces, and the repair effect is not ideal. With the development of deep learning, various generative models such as autoregressive model (AR), variational autoencoder (VAE), flow model (FLOW), and generative adversarial network (GAN) have effectively improved the quality of restored images, especially with the emergence of generative adversarial networks, and seminal works on image restoration have been proposed one after another.

For instance, Ouyang et al. [1] proposed a new framework that combines a U-shaped Transformer with a generative adversarial network (GAN) for the efficient handling of complex image degradation problems. Their method significantly reduces computational complexity through a locally enhanced window Transformer block while effectively improving the realism of image detail restoration using the adversarial training mechanism of GANs. It also introduces a multi-scale modulator to achieve the targeted restoration of different degradation regions. This approach has achieved breakthrough performance on multiple tasks, including image denoising, deraining, and deblurring, driving the development of GAN-based image restoration techniques. Furthermore, Rama et al. [2] proposed a GAN-based image restoration method that effectively addresses the issues of insufficient detail recovery and artifact generation in complex degradation scenarios that arise with traditional methods through an improved generator and discriminator architecture. This method employs a multi-scale feature fusion mechanism and an adaptive attention module to significantly enhance the realism and structural consistency of image restoration, having a profound impact on image enhancement tasks in specialized fields such as medical imaging and remote sensing.

In the field of medical image analysis, Masmoudi et al. [3] addressed the challenges of detecting gastrointestinal ulcers in Wireless Capsule Endoscopy (WCE) images by proposing a deep learning-based optimization solution. Through multi-scale deep feature fusion and a dual-branch classification architecture, their approach effectively extracted optimal features of ulcers and suppressed background interference, overcoming the limitations of traditional methods in terms of insufficient feature discrimination and low classification accuracy in complex backgrounds. Similarly, Habib et al. [4], addressing the limitations of traditional medical image analysis methods for COVID-19 detection, introduced an intelligent hybrid detection technique combining deep learning and hand-crafted features. Their method integrated the automatic feature extraction capabilities of the deep convolutional network ResNet with domain-specific prior knowledge from hand-designed features, effectively addressing the issues of insufficient feature representation and poor generalization ability of single methods in X-ray/CT images. Furthermore, in the area of image enhancement and denoising, Muslim et al. [5] proposed a domain knowledge-based intelligent image processing method, combining a variational optimization framework based on a physical degradation model with the non-linear mapping capabilities of deep convolutional networks. Their approach utilized a knowledge-guided attention mechanism to adaptively balance denoising and detail enhancement, resolving the problems of insufficient detail preservation and poor adaptability in complex scenarios.

Self-supervised methods for image restoration are also a common approach in this field. These methods eliminate the need for large amounts of paired labeled data, greatly reducing the cost of data acquisition, and the pre-trained self-supervised features can be directly transferred to different restoration tasks, significantly improving performance in small-sample scenarios. For example, Li et al. [6] proposed a hybrid supervised-self-supervised learning framework that effectively addresses the challenges of difficult data acquisition and high annotation costs in hyperspectral image restoration by combining a small amount of labeled data with a large amount of unlabeled data. This method designs a dual-branch network architecture that utilizes limited labeled data in the supervised branch to learn accurate spectral–spatial feature mappings, while leveraging the unlabeled data in the self-supervised branch to mine the intrinsic physical laws of hyperspectral images through the design of spectral continuity preservation and spatial context reconstruction tasks. This provides a more practical solution for hyperspectral image processing in fields such as remote sensing and agricultural monitoring. Additionally, Monroy et al. [7] proposed a self-supervised learning framework that addresses the limitations of existing methods that are confined to Gaussian noise assumptions by extending the traditional “noise-to-noise” learning paradigm. This method innovatively designs a generalized corruption mechanism that constructs a controllable non-Gaussian noise degradation process, enabling the model to learn more complex real-world noise distribution characteristics and achieving restoration performance comparable to fully supervised methods in scenarios dominated by non-Gaussian noise, such as medical imaging and low-light photography. This opens up new avenues for the application of self-supervised learning in real-world complex degradation scenarios.

In recent years, diffusion models, as a new branch of generative models, have made a series of breakthroughs in generative tasks. Compared with GAN, diffusion models produce high fidelity and diverse generative results, thus replacing GAN in a range of applications. With the development of visual language modeling, diffusion models have been extended to cross-modal generation, which greatly facilitates the development of Artificial Intelligence Generated Content (AIGC). With its powerful generative ability and theoretical basis, the diffusion model is able to recover images damaged due to various reasons, which has played a great role in promoting the research of image restoration. Figure 1 illustrates the timeline of recent developments in diffusion modeling.

In recent years, many scholars have reviewed the task of image restoration based on diffusion models. For example, Xin et al. [8] reviewed the field from the representatives of supervised-based methods and pre-training-based methods, but did not categorize the tasks in the field of image restoration. Ziwei et al. [9] mainly introduced the conditional diffusion model, untrained diffusion model, and degradation-oriented image diffusion model to illustrate the image restoration tasks from three perspectives. Currently, there is a lack of comprehensive surveys summarizing recent advancements in diffusion model-based image restoration. Therefore, this paper synthesizes a substantial body of work from the past 2 years, building upon previous research to provide a detailed overview of structural improvements to diffusion models, restoration method principles, and various restoration tasks. It comprehensively introduces the application of diffusion models to six types of image restoration and the development status of various models over the past 2 years, analyzing their respective characteristics and advantages/disadvantages. In terms of future development directions, we highlight the progress in knowledge distillation, model lightweighting, and conditional diffusion model sparsification, providing new perspectives on the future development of diffusion models.

2. Definition of the Diffusion Model and Its Improvement

In recent years, diffusion models have sparked a revolution in the field of generative modeling, which uses Markov chain modeling to transform a complex and unstable generative process into multiple independent and stable inverse processes. Diffusion models have made a series of breakthroughs in image restoration tasks.

2.1. Definition of Diffusion Models

The diffusion model consists of a forward diffusion process and an inverse process. The forward diffusion process adds noise to the real data samples gradually until they degenerate into almost isotropic Gaussian noise. Additionally, the inverse process learns to denoise from the Gaussian noise by training the neural network, thus recovering the original data. It is worth noting that the data input and output dimensions of the diffusion model forward and reverse processes should be consistent, and the U-Net network structure meets this requirement and the computational overhead is small, so it is widely used in the reverse denoising process of a diffusion model. By definition, diffusion models can be divided into denoising diffusion probabilistic models [10,11] (DDPMs), score-based generative models [12,13] (SGMs), and stochastic differential equations [14,15] (SDEs).

2.1.1. Denoising Diffusion Probabilistic Models

The denoising diffusion probabilistic model (DDPM) is inspired by nonequilibrium thermodynamics [16], and Ho et al. [10] further proposed DDPM [10] to successfully accomplish a series of image generation tasks with good results. A schematic of the denoising diffusion probability model is shown in Figure 2.

According to the time step t, the forward process of DDPM gradually adds noise samples obeying Gaussian distribution to the given initial data distribution

x_{0} \sim q (x_{0})

to obtain noisy data, as follows:

\begin{matrix} q (x_{t} | x_{t - 1}) & = N (x_{t}; \sqrt{1 - β_{t}} \cdot x_{t - 1}, β_{t} \cdot I) \\ \forall t \in {1, \dots, T}, \end{matrix}

(1)

where T is the total number of diffusion steps, the variance of the growth level of the noise is a hyperparameter obeying a Gaussian distribution, and

I

is a matrix of units with the same dimensions as the

x_{0}

input samples, with the samples at each step being related only to the samples at the moment. As t keeps increasing, the noise data gradually approach the Gaussian distribution. In order to simplify the calculation, introduce

α_{t} = 1 - β_{t}

,

{\bar{α}}_{t} = \prod_{s = 0}^{t} α_{s}

, and transform Equation (2) into

\begin{matrix} q (x_{t} | x_{0}) = N (x_{t}; \sqrt{{\bar{α}}_{t}} x_{0}, (1 - {\bar{α}}_{t}) I) . \end{matrix}

(2)

When the time step t tends towards infinity,

x_{T}

will become a standard Gaussian distribution, and the noise data

x_{t}

at any t moment can be obtained from Equation (2).

In the inverse process, data samples

x_{0}

are generated by iteratively sampling noise vectors from the distribution of

x_{t - 1} \sim p (x_{t - 1} | x_{t})

until

t = 1

. Since

p (x_{t - 1} | x_{t})

requires data from the entire propagation pathway, the neural network needs to be trained so that T satisfies the following Equation (3):

\begin{matrix} p_{θ} (x_{t - 1} | x_{t}) = N (x_{t - 1}; μ_{θ} (x_{t}, t), \sum_{θ} (x_{t}, t)), \end{matrix}

(3)

where

θ

is a model parameter.

To enable the trained inverse Markov process to accurately match the forward process, the parameter

θ

needs to be adjusted so that the joint distribution of the inverse process is gradually close to that of the forward process. The DDPM achieves this by optimizing the variational lower bound on the negative log-likelihood, as shown in Equation (4).

\begin{matrix} K L (q (x_{0}, x_{1}, \dots, x_{T}) ∥ p_{θ} (x_{0}, x_{1}, \dots, x_{T})), \end{matrix}

(4)

where KL is the relative entropy between two probability distributions, also known as KL divergence; to simplify the training process, the model sets the variance as a constant, and the trainable

μ_{θ} (x_{t}, t)

is shown in the following Equation (5):

\begin{matrix} μ_{θ} (x_{t}, t) = \frac{1}{\sqrt{α_{t}}} (x_{t} - \frac{1 - α_{t}}{\sqrt{1 - {\hat{β}}_{t}}} \cdot z_{θ} (x_{t}, t)), \end{matrix}

(5)

where

z_{θ} (x_{t}, t)

is the noise parameter to be predicted.

In practical training, the loss function typically considers two aspects: the noise prediction loss and the variational lower bound. The following Equation (6) demonstrates the noise prediction loss:

\begin{matrix} L (θ) = E_{t, x_{t} \in} [{∥z_{t} - z_{θ} (x_{t}, t)∥}^{2}], \end{matrix}

(6)

where

z_{t}

represents standard Gaussian noise, and

z_{θ}

represents a neural network that, given input

x_{t}

and t, predicts the noise added to

x_{t}

.

Equation (7) presents the variational lower bound loss function, with the following specific form:

L_{VLB} = E_{t, x_{0}} [D_{KL} (q (x_{t - 1} | x_{t}, x_{0}) ∥ p_{θ} (x_{t - 1} | x_{t}))]

(7)

where

D_{K L}

represents the KL divergence, which measures the similarity between the forward and reverse distributions. By maximizing the lower bound of the data likelihood, we ensure that the generative process learned by the model asymptotically approaches the true data distribution. Here, q and

p_{θ}

represent the forward and reverse data distributions, respectively.

In theoretical studies, diffusion models are trained by jointly minimizing Equations (6) and (7). However, in practical applications, the variational lower bound has a high computational complexity. Furthermore, the noise prediction loss (Equation (6)) alone can achieve excellent experimental results and stable training. Therefore, in practice, researchers often use Equation (6) as the final loss function for training DDPMs.

2.1.2. Score-Based Generative Models

Two key elements of the score generation model are the score and the Langevin dynamics [17]. The score is the gradient of the data distribution; for the independent identically distributed sample data

{x_{i} \in R^{D}}_{i = 1}^{N}

of the data distribution

p_{data} (x)

, its probability density function

p (x)

has a score of

\nabla_{x} log (p (x))

. In order to better match the scores, it is necessary to train a score network

s_{θ} (x) : R^{D} \to R^{D}

, where

θ

is the network parameter. Additionally, the samples of a score function of noisy data distribution in the network are generated by Langevin dynamics. The score network is trained by minimizing the following objective function:

\begin{matrix} E_{p (x)} {∥s_{θ} (x) - \nabla_{x} log p (x)∥}_{2}^{2} . \end{matrix}

(8)

However, this method is not very accurate in estimating scores in regions of low data density, thus not effective in obtaining high-quality samples. Therefore, a Noise Conditional Score Networks (NCSNs) [13] algorithm proposes to add multi-scale noise for interference in low-density regions, where the noise decreases by standard deviation

σ_{1} > σ_{2} > \dots > σ_{N}

, and gives information to the regions with low data density about how to reach the high-density regions. The distribution after adding noise is

q_{σ} (\tilde{x}) ≜ \int p (x) N (\tilde{x} | x, σ^{2} I) d x

. NCSNs can be optimized by matching the objective function with the noise score; the equation is shown as follows:

L (θ, σ) = \frac{1}{2} E_{p (x)} E_{\tilde{x} \sim N (x, σ^{2} I)} [{∥s_{θ} (\tilde{x}, σ) + \frac{\tilde{x} - x}{σ^{2}}∥}_{2}^{2}] .

(9)

2.1.3. Stochastic Differential Equations

SDE [14,15], a diffusion model based on the form of stochastic differential equations, is a unified form of denoising diffusion probabilistic and fractional models. Both DDPMs and score-based generative models (SGMs) are discretizations of stochastic differential equations (SDEs). DDPM corresponds to discrete-time SDEs, while the Langevin dynamics of SGMs are a discrete sampling method for reverse-time SDEs. SDEs generalize both DDPMs and SGMs to continuous diffusion time, enabling more flexible and diverse noise sampling methods. SDEs use a continuous stochastic differential equation to describe the continuous diffusion process, and the form of the equation is shown as follows:

\begin{matrix} d x = f (x, t) d t + g (t) d w, \end{matrix}

(10)

where

f (x, t)

and

g (t)

are the drift and diffusion coefficients of the stochastic differential equation and w denotes the standard Wiener process, also known as Brownian motion. The drift coefficient is needed to control the trend of the inverse process and to ensure that the generated sample data satisfy a Gaussian distribution. The diffusion coefficient controls the degree of random noise perturbation. The above continuous diffusion process can be modeled by the process of inverse SDE, which is shown as follows:

\begin{matrix} d x = [f (x, t) - g {(t)}^{2} \nabla_{x} log p_{t} (x)] d t + g (t) d \hat{w}, \end{matrix}

(11)

where

\hat{w}

is the standard Wiener process for time from T to 0. The inverse solution is accomplished when

g^{2} (t) \nabla_{x} log p_{t} (x)

converges infinitely to

f (x, t)

, where

\nabla_{x} log p_{t} (x)

is the drift term, which is the score-modeling function in score-based generative models (SGMs). Similar to a score-based generative model, the SDEs (stochastic differential equations) need to train a score network

s_{θ} (x_{t}, t)

for estimation. The training objective function is given by the following equation:

\begin{matrix} min_{θ} E_{x_{0}, x_{t}} [{∥s_{θ} (x_{t}, t) - \nabla_{x_{t}} log p (x_{t} | x_{0})∥}^{2}] . \end{matrix}

(12)

2.2. Common Improvements to Diffusion Models

2.2.1. Restructuring the Network

The inputs and outputs of the diffusion model need to ensure the consistency of dimensionality, and the U-Net network architecture satisfies this requirement and reduces the computational resource consumption by downsampling the feature space at multiple granularities. A series of works have been devoted to improving the U-Net network architecture, such as coupling the cross-attention module [18], the multi-head attention module [13,14,19], and the position encoding [10] in the U-Net architecture.

In recent years, Transformer has achieved great success in the field of natural language processing and has gradually been applied to areas such as image processing. Since it can effectively capture long-range dependencies and can integrate multimodal information, the structure of Transformer is often used when dealing with the task of using text as conditional information to guide image restoration. For example, Yang et al. [20] proposed the use of a trained ViT (Vision Transformer) model in the backpropagation process to recover the original image from noise, which showed excellent performance. There is also Guided Language to Image Diffusion for Generation and Editing (GLIDE) [21], which uses a pre-trained Transformer model to encode text into text vectors, guides the diffusion model’s backpropagation process, and progressively recovers from the noise textually descriptive image. Meanwhile, to address the problem of insufficient matching between text content and generated images, K. Lee et al. [22] proposed to use human feedback to train a reward function that predicts human ratings of the degree of matching of image–text pairs to fine-tune the model and improve the text alignment problem of the text-to-generate-image model.

2.2.2. Accelerated Sampling Process

Although diffusion models have good results in image generation tasks, the quality of the generation is extremely dependent on the sampling step, thus presenting significant challenges in terms of resource consumption and sampling efficiency. For this reason, J. Song et al. [23] proposed the DDIM model, which introduces a non-Markovian chain in the forward diffusion process and models the process as a continuous time-varying Ordinary Differential Equation (ODE) describing the changes of the image during the forward diffusion process, thus allowing the DDIM to carry out the sampling process at any time point, improving the sampling efficiency and generating more compliant images, sampling at any point in time, improving the sampling efficiency, and generating more responsive images. When introducing a non-Markovian chain, DDIM performs sampling by skipping a fixed number of steps, s, in a uniform subsequence, executing reverse updates only at key steps. DDIM directly utilizes the noise prediction model to estimate the initial data,

x_{0}

, thereby reducing dependence on intermediate processes and eliminating redundant steps that are inconsequential to the restoration quality. To ensure restoration quality, the number of sampling steps, T, needs to be balanced between restoration quality and sampling efficiency. If T is too small, the restored result may suffer from detail loss. If T is too large, the time consumed during the inference process will be comparable to that of DDPM, offering no significant improvement. Therefore, in practical applications, researchers often set T = 50, performing 50 sampling steps to strike a balance between restoration quality and inference time, achieving acceleration without compromising the restoration task’s results. Ordinary Differential Equation (ODE) solvers and distillation are two prevalent methods for accelerating sampling in diffusion models. However, both approaches introduce the problem of error accumulation during the acceleration process, thereby affecting restoration quality. When using ODE solvers for accelerated sampling, the continuous time must be discretized into a finite number of steps. A small number of steps introduces errors due to the truncation of higher-order terms in the Taylor expansion, and these errors are cumulatively amplified in each iteration step. When using distillation for accelerated sampling, a lightweight student model is trained to mimic the behavior of a multi-step teacher model, directly compressing the sampling steps. As a result, high-frequency details are omitted due to insufficient mode coverage, and prediction errors accumulate during iterative sampling, causing the generated results to deviate from the target distribution. To address these issues, such work also includes the Dpm-solver fast higher-order ODE solver proposed by C. Lu et al. [24], which ensures that the Dpm-solver achieves a higher order of convergence by computing the analytical solution of the linear part of the diffusion ODE, while mitigating truncation errors, obtaining a more accurate solution in fewer steps, and thus generating high-quality samples in an iterative process of 10 to 20 steps. In addition, Lou et al. [25] proposed an early stopping mechanism for truncating the diffusion process and sampling from non-Gaussian distributions generated by pre-trained VAE/GAN models. In addition, the diffusion model can be guided to efficient sampling by generating a conditional prior [26].

The previous section primarily presented the fundamental principles underlying various formulations of diffusion models, along with techniques to improve their performance, clearly outlining the background and development trajectory of diffusion models. In the following section, we will classify and describe the various image restoration methods that leverage diffusion models within this field.

3. Diffusion Model-Based Image Restoration Method

In recent years, diffusion models have been rapidly developing and occupying an important position in the field of generative artificial intelligence, and the work on image restoration utilizing the powerful generative ability of diffusion models has also attracted extensive attention from scholars.

3.1. Conditional Guided Image Restoration

In recent years, conditionally guided generative models based on diffusion models have achieved impressive results. For example, GLIDE [21] encodes text into text vectors by using the Contrastive Language-Image Pre-Training (CLIP) [27] model, which is mapped to the potential space using VAE [28] as conditional information to guide the diffusion model for image generation. A similar work is Stable Diffusion [18], which accelerates the sampling process while using text vectors to guide the diffusion model image generation.

Unlike the pure image generation task based on the diffusion model, the image restoration task aims to generate restored high-quality images from degraded images, i.e., learning the corresponding posterior distribution

p_{θ} (x_{t - 1} | y, x_{t})

at time step t, where

x_{t - 1}

denotes the restored image at the moment

t - 1

. In order to effectively utilize the conditional information in the degraded image to complete the image restoration task, C. Saharia et al. [29] proposed the SR3 model, as shown in Figure 3, which uses the U-Net network structure to predict the noise at the moment

t - 1

by taking the degraded image y as the conditional information, and the denoised output at the moment t,

x_{t}

, as the joint input to the diffusion model. When

t = 0

, the super-resolution output of the image is completed. Additionally, using the degraded image as conditional information, C. Saharia et al. [30] proposed the Palette method for the image transformation task, which controls the style and content of the output image by introducing a conditional mechanism to provide the diffusion model with a description of the image in the target style or a description of the text. Excellent performance was achieved in tasks such as image coloring, image restoration, JPEG artifact removal, and de-cropping.

In addition, Y. Zhang et al. [31] proposed a unified conditional framework that integrates information from multiple sources, which first inputs the degraded image into an initial predictor with a U-Net structure to predict the rough output of the degraded image, and then integrates the initial output and the auxiliary conditions, e.g., the time step, the noise level, etc., and inputs them into a diffusion model as the conditional information, which guides the diffusion model to generate high-quality restoration results. In order to make no artifacts around the restored image, an inter-step block division strategy is proposed, in which the input is cut into small overlapping blocks of the middle overlapping region, which is directly discarded, and the final output image is directly predicted, which greatly reduces the problem of artifacts around the generated image.

In order to adapt to different image types and image restoration tasks, B. Fei et al. [32] proposed a unified Generative Diffusion Prior (GDP) framework, which learns the data distribution of an image by training a diffusion model and uses the model to generate image samples that conform to this distribution, incorporating this model as a priori information to guide image restoration. A similar approach was used by B. T. Feng et al. [33] to incorporate a score-based diffusion model as a priori information into the solution problem and quantify the uncertainty of the generated images using the Deep Probabilistic Imaging (DPI) [34] method, while defining a family of distributions

q_{θ}

using the RealNVP [35] normalized flow with parameter

θ

by minimizing the KL divergence between the true posterior distribution and the estimated posterior distribution to optimize this family of distributions.

3.2. Pre-Training-Based Image Restoration

Although the use of conditional guidance for image restoration tasks has achieved better results, problems such as the presence of artifacts at the edges of the restored image and slow inference speed still occur in the process of inverse generation of the diffusion model. In order to solve this problem, before utilizing the diffusion model for image processing, it can be considered to first pre-train the image, and then the pre-trained image can be used as the input of the diffusion model. The overall process of image restoration based on pre-training is shown in Figure 4. A. Niu et al. [36] proposed a single-image super-resolution model, conditional diffusion probabilistic model for single-image super-resolution (CDPMSR), based on conditional diffusion probabilistic models, which first uses the existing pre-trained super-resolution models (e.g., RCAN [37], SwinIR [38], EDSR [39]) to enhance the preprocessing of low-resolution images as the initial input images for the diffusion model. To solve the problem of continuous image super-resolution, S. Gao et al. [40] proposed the Implicit Diffusion Model (IDM) using pre-trained enhanced deep residual networks for single-image super-resolution (EDSR). Deep residual networks are used to extract features from low-resolution images, and then the features are downsampled, and the obtained features at different scales are used as inputs for the upsampling conditions of each layer of the diffusion model. In addition, L. Guo et al. [41] proposed the ShadowDiffusion model for the problem of image shadow removal, which uses a pre-trained Transformer to extract the degradation prior, which is used as auxiliary information to refine the shadow mask and to guide the diffusion model to generate shadow-free images. In order to improve the inference speed of the diffusion model for the image restoration task, B. Xia et al. [26] proposed the efficient DiffIR model for the extraction of a priori information for degraded images, using the DIRformer coupled with the U-Net and Transformer structure for pre-training the real and degraded image pairs, and in the final stage of the inverse process of the diffusion model, the recovery image generation using DIRformer.

In summary, the pre-training-based diffusion model image restoration method mainly enhances or extracts features from low-quality images, and uses the obtained images or features as conditions to guide the high-quality image restoration task.

3.3. Estimation-Based Image Restoration

3.3.1. Image Restoration Based on Posterior Estimation

To ensure the consistency between the restored image and the measured data of the original image, many scholars have considered direct a posteriori estimation of the diffusion model inverse process, which means estimating the data distribution of the restored image given the original degraded image as a way to reduce the bias of the restored image and ensure the reliability of the restoration results.

H. Chung et al. [42] proposed a generalized framework, Diffusion Posterior Sampling (DPS), based on the diffusion model to solve various image restoration inverse problems concerning known degradation types by incorporating the likelihood function of observed data from the degraded image into the reverse sampling process of a diffusion model. During inference, the reverse sampling process is modeled as a Bayesian inference problem, utilizing Bayes’s theorem to compute the posterior distribution of the target data. The DPS used

p (y | {\hat{x}}_{0})

to approximate

p (y | x_{t})

, where

{\hat{x}}_{0} = E [x_{0} | x_{t}]

, based on which the posterior distribution estimated by the DPS is shown in Equation (13), as follows:

\begin{matrix} \nabla_{x_{t}} log p_{t} (y | x_{t}) & = \nabla_{x_{t}} log p (y | {\hat{x}}_{0}) \\ = - \frac{1}{σ^{2}} \nabla_{x_{t}} {∥y - H ({\hat{x}}_{0} (x_{t}))∥}_{2}^{2} . \end{matrix}

(13)

Based on the above work, due to the problem of error accumulation in DPS, J. Song et al. [43] proposed a new Pseudoinverse-Guided Diffusion Models (IIGDMs) framework to introduce the concept of pseudoinverse into the diffusion model and designed two variants, IIGDM-S and IIGDM-D, which are used to adjust the noise dispatch function and incorporate a regularization term, to guide the diffusion model to learn the solution of the posterior distribution and generalize the estimated posterior distribution to the range of linear, nonlinear, and differentiable inverse problems, and the posterior distribution estimated by IIGDM is shown in Equation (14), as follows:

\begin{matrix} \nabla_{x_{t}} log p_{t} (y | x_{t}) \approx r_{t}^{- 2} {({(h^{†} (y) - h^{†} (h ({\hat{x}}_{o})))}^{T} \frac{\partial {\hat{x}}_{o}}{\partial x_{t}})}^{T}, \end{matrix}

(14)

where

r_{t}^{- 2}

is set to

\sqrt{\frac{σ_{t}^{2}}{σ_{t}^{2} + 1}}

and h is the mapping function between the original signal

x_{0}

and the real image y. The IIGDM model structure is shown in Figure 5.

3.3.2. Blind Image Kernel Estimate

The idea of kernel estimation originates from the problem of blind image restoration, namely, how to recover a clear image from an unknown degraded kernel. Therefore, the accuracy of fuzzy kernel estimation is the key to recovering high-quality images. The traditional approach uses convolutional neural networks for degradation kernel estimation. The degradation process of an image is modeled as shown in Equation (15), as follows:

\begin{matrix} y = x * k + n, \end{matrix}

(15)

where k denotes the degenerate kernel, n denotes random noise, and x denotes the original image, and ∗ denotes the convolution operation. Inspired by this, scholars have used diffusion models for fuzzy kernel estimation. H. Chung et al. [44] proposed the Blind Diffusion Posterior Sampling (BlindDPS) model, which is theoretically based on the DPS architecture [42]. However, unlike DPS, BlindDPS is typically used to address real-world blind image restoration problems, which involve restoring degraded images without prior knowledge of the image degradation type or process. Therefore, during restoration, the model performs two processes concurrently: pre-training two parallel diffusion models on synthesized degradation kernels. One diffusion model estimates the degradation kernel, while the other restores the original image. These two diffusion models are trained jointly. A similar work is the Gibbs Denoising Diffusion Probabilistic Models for Image Restoration (GibbsDDRM) model proposed by N. Murata et al. [45], where a partially collapsed Gibbs sampler [46] is used in the inverse process of diffusion modeling from the joint posterior distribution to sample both the degenerate kernel and the original image, effectively solving the blind image restoration problem.

3.4. Image Restoration Based on Image Domain Transformation

3.4.1. Image Restoration Based on Potential Space and Decomposition Space

In order to alleviate the problems of long training time and resource consumption of diffusion models, many scholars have employed diffusion models in potential space for image restoration tasks. Stable Diffusion proposed by R. Rombach et al. [18] is the first model based on a diffusion model to accomplish image generation tasks in potential space. It utilizes a pre-trained self-encoder to transfer the image from the pixel space to the potential space, where the perceptual quality of the image is preserved, and performs diffusion in the potential space to effectively reduce the training cost. Based on this innovation, due to the high training overhead of Stable Diffusion, Z. Luo et al. [47] proposed the refusion model, which uses a U-Net network to compress the image into the potential space to accelerate the training process, and adopts the Nonlinear Activation Free Network (NAFNet), which is improved based on U-Net, as a noise prediction network to improve the noise prediction accuracy. In addition, B. Kawar et al. [48] proposed the Denoising Diffusion Restoration Models (DDRMs), which perform singular value decomposition of the degenerate matrix H. The information in each dimension in the spectral space of H corresponds to a singular value, which effectively simplifies the solution process of the linear inverse problem by diffusing in the spectral space, and greatly reduces the computational complexity and improves the sampling efficiency. On this basis, B. Kawar et al. [49] studied the nonlinear inverse problem based on the special case of DDRM without noise, and introduced the concept of pseudoinverse to complete the JPEG artifact correction. For different purposes, Y. Wang et al. [50] proposed the Diffusion Null-Space Model (DDNM) to accomplish a variety of zero-sample image restoration tasks; i.e., the DDNM model does not target a specific type of degradation, and can be trained to recover high-quality images. Y. Wang et al. [50] proposed another decomposition strategy, the range-zero-space decomposition; given a noise-free degradation model

y = H x

, it can be decomposed, as shown in Equation (16) below:

\begin{matrix} y = H H^{†} H x + H (I - H^{†} H) x, \end{matrix}

(16)

where

H^{†}

is the pseudoinverse of H, the range space

H H^{†} H x + H x = y

ensures data consistency, and the zero space

H (I - H^{†} H) x = 0

contains mainly information irrelevant to the degradation process. The method eliminates the need to train multiple models and achieves excellent performance in various image restoration tasks.

3.4.2. Image Restoration Based on Data Domain Synthesis

The problem of real-world degraded image restoration faces various challenges due to the complexity of the causes of image degradation in reality and the difficulty in obtaining degraded-clear image pairs for training. Many scholars have proposed the use of domain transformation methods for the restoration of degraded images. The domain transformation method can be roughly divided into two methods: the transformation from the synthesized degraded image domain to the real-world degraded image domain, and the transformation from the real-world degraded image domain to the high-quality clear image domain. The overall steps of data domain-based synthesis are shown in Figure 6.

In order to solve the problem that real-world degraded–clear image pairs are difficult to obtain, T. Yang et al. [51] proposed to transform high-quality images into low-quality synthetic images, and then perform a forward diffusion process on the low-quality synthetic images to obtain their prior distributions, and then next, perform an inverse diffusion process by using a pre-trained diffusion model, which is trained using a large number of real-world image pairs, to generate a low-quality synthetic image, which is closer to the real-world low-quality image, and finally obtains the real-world degraded–clear image pair. Using the obtained real-world image pairs for the image restoration task can better simulate different world scenes and improve the generalization of the model processing results. In addition, M. Wei et al. [52] proposed the cyclic consistency approach for the image deraining task. This method utilizes two pre-trained generators to generate a restored image of a real-world degraded image and to generate a degraded image from the restored image again, and the generated restored image is used as an input to the diffusion model to further remove rainwater from the image. The problem of lack of real-world image pairs leading to poor restoration results when utilizing unsupervised learning is solved by this domain transformation method. To address the issues of semantic misjudgment and insufficient adaptability due to fixed feature fusion in complex scenes that arise with traditional dehazing methods, Zhang et al. [53] proposed an innovative framework combining semantic guidance and dynamic feature fusion. This method identifies scene categories through a lightweight semantic segmentation branch, differentially regulates the dehazing intensity, and designs an adaptive feature fusion module (AFF) to dynamically weight multi-scale features, enabling precise processing of fog-dense regions. By adding semantic constraints during the pixel-level domain transformation, the method completes the mapping from the hazy image domain to the clear image domain, providing a novel approach to image dehazing.

3.5. Projection-Based Image Restoration

In addition to the methods mentioned above, many studies have proposed to improve the quality of restored images by using projection-based methods, i.e., extracting the information of the image’s intrinsic structure, texture, etc., from the low-quality image as a complement to the generated image in each step of the diffusion model, so as to ensure the data consistency between the original low-quality image and the generated image. In the image complementation task, A. Lugmayr et al. [54] used the RePaint model to establish the projection from the low-quality image to the denoised image, replacing the denoised image corresponding to step

t - 1

with the unmasked region of the defective image. As shown in Equation (17),

\begin{matrix} x_{t - 1} = m ⊙ x_{t - 1}^{k n o w n} + (1 - m) ⊙ x_{t - 1}^{u n k n o w n}, \end{matrix}

(17)

where

x_{t - 1}^{k n o w n}

is the diffusion result of adding noise to the defective mask image at step

t - 1

.

x_{t - 1}^{u n k n o w n}

is the result of sampling from the reverse denoising process.

Typical work other than this is the iterative latent Variable refinement (ILVR) proposed by J. Choi et al. [55] for graphical super-resolution reconstruction using low-resolution projections. At the moment

t - 1

, the latent variable

x_{t - 1}

predicted by backsampling has the same low-resolution component as the result

y_{t - 1}

obtained by adding noise to the low-resolution image at step

t - 1

of the diffusion process. Therefore, the corresponding part of

y_{t - 1}

is utilized to replace the low-frequency part of

x_{t - 1}

. This projection ensures data consistency between the generated image and the original image, making possible the task of small-sample image restoration based on the diffusion model, thus improving the quality of the generated image and the performance of the model.

The above content mainly classifies and summarizes commonly used methods for image restoration based on diffusion models, distilling and introducing various techniques. It clearly articulates the main ideas and commonly used solutions in this field. In the next section, we will group and categorize degraded images according to different tasks and degradation types, providing a detailed summary of each degradation type and clearly describing the characteristics that differentiate them.

4. Applications of Diffusion Models in Super-Resolution Reconstruction and Frequency-Selective Image Restoration

According to the different processing tasks required, the applications of diffusion models in image restoration can be divided into two main areas: image super-resolution reconstruction and frequency-selective image restoration. Frequency-selective image restoration includes image deblurring, image inpainting, image deraining, image desnowing, image dehazing, image denoising, and low-light enhancement. We introduce each of the six tasks and list the current SOTA models for these tasks.

4.1. Super-Resolution Reconstruction

Image super-resolution reconstruction (SR) is a technique to construct a high-resolution (HR) image by enhancing high-frequency components from one or more existing low-resolution (LR) images in the same scene. Traditional image super-resolution reconstruction algorithms include two categories of reconstruction-based and sample-based methods, but both suffer from the problems of blurring or loss of details in the reconstructed image and poor generalization ability. In recent years, with the rise of diffusion models, many researchers have utilized diffusion models for the task of super-resolution reconstruction.

Among the tasks of super-resolution reconstruction using diffusion models, the SR3 model proposed by C. Saharia et al. [29] is a pioneering work that specifically implements the SR problem using denoising diffusion probabilistic models.The SR3 model works in the same way as the workflow of the DDPM, using a pre-trained SR model to iteratively optimize the low-resolution image step-by-step using a pre-training SR model in reverse to generate a high-resolution image. Gaussian noise is first added to the LR image up to

\tilde{y} \sim N (0, 1),

and the noise

ϵ_{t}

is predicted in the T denoising steps of the inverse process, and the predicted noise is removed iteratively to generate the HR image

y_{0}

. The optimization function is shown in Equation (18) below:

\begin{matrix} y_{0} = E_{(x, y)} E_{ϵ, γ} {∥f_{θ} (\underset{\tilde{y}}{\underset{︸}{x, \sqrt{γ} y_{0} + \sqrt{1 - γ} ϵ, γ}}) - ϵ∥}_{p}^{p} \end{matrix}

(18)

where

γ \sim p (γ)

,

γ

is the noise variance,

ϵ

is the prediction noise, and

\tilde{y}

is the target image with noise. Since SR3 generates HR images by removing the prediction noise, it is slightly worse than generating images by regression in two metrics, Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM), but better in the Fréchet Inception Distance (FID) metrics and consistency metrics, as well as in the ability to trick the human eye into recognizing them. In order to solve the problems of this model, J. Ho et al. [56] proposed a three-layer cascade diffusion model based on the improved SR3 backbone. Except for the first diffusion model, both of them take the LR image output from the previous model as input and generate a finer and richer LR image at the end.

Since the HR image contains a larger amount of pixel information, the diffusion model needs to deal with data of higher dimensions, plus the diffusion model needs to sample the inference several times to generate the HR image, so it often suffers from inefficiency. In order to solve the above problems, H. Li et al. [57] proposed a single-image Super-Resolution Diffusion probabilistic model (SRDiff) for a single-image super-resolution reconstruction task; the network improves the training speed of the model by using U-Net structure and introducing residual connections. Due to the use of a CNN encoder, this model is commonly used to address image super-resolution problems, but its performance is suboptimal for motion blur. Furthermore, SRDiff is sensitive to the image content input into the model. For extremely low-resolution input images, SRDiff’s restoration results are often distorted, and the training process is unstable. Building upon this understanding, in order to address the issues with SRDiff. R. Rombach et al. [18] proposed a new Latent Diffusion Models (LDMs) architecture, which first uses a conditional self-encoder to decompose the LR feature information of the same class condition in the latent space into each small space, which is used to generate different parts of the image, such as texture, color, and shape. In addition, a cross-attention mechanism is introduced in the latent space, which enables the model to interact between features in different layers to better capture the global information of the image. By employing these two methods, model sampling efficiency can be effectively improved, and through information decomposition, the training process can focus more on the corresponding spaces, ensuring training stability while also guaranteeing the quality of the generated HR images, representing a significant contribution.

In addition, since real-world LR images may still have problems such as noise, blurring and artifacts, and lack of training image pairs, and the distribution of image data in the training set is bound to differ from that of real-world images, in order to better solve the problem of facing the difficulty of reconstructing real-world SR images, J. Wang et al. [58] proposed the Stable-SR model to solve the problem of real-world image super-resolution reconstruction. The model is based on Stable Diffusion and proposes a time-aware encoder and a Spatial Feature Transform (SFT) module. The time-aware encoder extracts multi-scale features

{F^{n}}_{n = 1}^{N}

from the input LR image and adaptively adjusts the intermediate features

{F_{d i f}^{n}}_{n = 1}^{N}

of Stable Diffusion in the SFT module, as shown in Equation (19), as follows:

\begin{matrix} {\hat{F}}_{dif}^{n} = (1 + α^{n}) ⊙ F_{dif}^{n} + β^{n}; α^{n}, β^{n} = M_{θ}^{n} (F^{n}), \end{matrix}

(19)

where

α^{n}

and

β^{n}

denote the affine parameters in SFT,

M_{θ}^{n}

denotes a small network with a series of convolutional layers, and n denotes the spatial scale of the U-Net structure in Stable Diffusion. In order to balance the requirement of image fidelity for SR tasks and the problem that the diffusion model generation is stochastic, the Controllable Feature Wrapping (CFW) module is proposed, which takes the LR image encoder features

F_{e}

and the Stable Diffusion decoder features

F_{d}

as inputs, and balances the generated image quality and fidelity, as shown in Equation (20), as follows:

\begin{matrix} F_{m} = F_{d} + C (F_{e}, F_{d}; θ) \times w, \end{matrix}

(20)

where

C (\cdot; θ)

denotes a convolutional layer containing the trainable parameter

w \in [0, 1]

and

θ

is an adjustable parameter. A progressive aggregated sampling strategy is also proposed, which enables the pre-trained diffusion model to adapt to any size of resolution, and can handle the real-world SR reconstruction problem more flexibly. With the above introduction, the super-resolution reconstruction methods for images based on diffusion models can be summarized as shown in Table 1.

4.2. Image Restoration Based on Frequency Selection

Frequency-selective image restoration tasks are commonly divided into two main categories: high-frequency-selective image restoration and mixed high- and low-frequency image restoration. In the following sections, we will focus on restoration through different frequency selections, providing a detailed overview of various degradation tasks.

4.2.1. Image Deblurring

Depending on the type of blur, image blur can be categorized into three main types: motion blur, out-of-focus blur, and hybrid blur. Among them, the causes of motion blur are camera shake and object motion. Overall, blur is caused by the smoothing of high-frequency information. Therefore, the key to this task is how to generate accurate and detailed high-frequency content. Traditional image deblurring methods view this task as an inverse filtering problem, and use the Lucy–Richardson algorithm to perform the inverse convolution computation for the case where the blur kernel is known; for the case where the blur kernel is unknown, due to the pathological nature of the problem, a variety of regularization conditions need to be attached to solve the problem. Considering the powerful generative ability of diffusion models, many scholars have used diffusion models to solve the problem of deblurring in recent years.

J. Whang et al. [62] first applied the diffusion model to the field of image deblurring and proposed the Denoising Probabilistic Model for Deblurring (DeblurDPM), which employs two channels to accomplish the deblurring task. The first channel first inputs the blurred image into a CNN-based deblurring model to obtain an initial clear image. Next, the initial blurred image and the blurred image are residualized, and the residual features along with the sampled noise are used as inputs to the second-channel diffusion model denoiser to obtain a more realistic refined residual image, and finally, the refined residual image is summed up with the initial clear image to obtain a recovered clear image. This method breaks through the use of a diffusion model to solve the problem of image blurring, which significantly improves the perceptual quality of the results compared with the use of traditional methods or CNN, GAN, etc., and lays a solid foundation for the development of subsequent deblurring work.

How to improve the generalization ability of the model for various complex causes of blurring is the key element to improve the quality of deblurred images. Therefore, M. Ren et al. [63] proposed a multi-scale structural conditional guidance mechanism, which first extracts multi-scale structural information for the blurred image and inputs the structural information of each scale correspondingly into the intermediate layer of each scale of the diffusion model to help the model learn to recover the structure of the image in a better way. In addition, in order to improve the generalization ability of the model, the model is trained using blurred image data from multiple domains, which improves the robustness of the deblurring model. The model solves the limitation that only one specific model can be used to solve the problem for a specific blurring cause in the field of deblurring, and provides a new idea to solve the problem in this field.

While better experimental results are often achieved for the problem of non-blind image deblurring, the effect for the blind inverse problem still needs to be explored. For this reason H. Chung et al. [44] proposed the Blind Diffusion Posterior Sampling (BlindDPS) to deal with the blind inverse problem and applied it to the image deblurring task. The model constructs a fuzzy a priori model for the blurred image and the estimated fuzzy kernel that leads to blurring, respectively, and performs parallel backpropagation on both at the same time to generate the corresponding deblurred image and a more accurate fuzzy kernel. The process guides the co-optimization of the two through an intermediate stage gradient (the gradient of the residuals between the measured values and the deblurred image), which ultimately results in an accurate estimation of the clear image and the blurred kernel. The method breaks through the bottleneck of dealing with the blind inverse problem and applies the diffusion model to the solution of the blind inverse problem, providing a new method and solution for the field of image restoration. More image deblurring methods based on a diffusion model are shown in Table 2.

4.2.2. Image Inpainting

Image inpainting involves completing and filling in the missing parts of a damaged image so that the overall image achieves consistency in high-frequency texture and unity in structure. The missing region morphology is mainly categorized into random occlusion and regular occlusion. Traditional image inpainting methods are mainly based on texture synthesis, and methods based on sparse representation, but they often suffer from poor inpainting effect, hard and obvious boundaries, and shape dependence on the missing parts. With the development of generative models, especially the rise of diffusion models, image inpainting tasks have begun to utilize diffusion models to understand and learn the semantic features of the image, which has led to a significant improvement in the effect of image inpainting, greatly reducing the sensitivity to the shape of the missing region, and being able to handle various types of inpainting tasks with high fidelity of the inpainting effect.

A. Lugmayr et al. [54] observed the excellent performance of a diffusion model in various fields and proposed the repaint model, which is the first attempt to use a diffusion model for the task of image inpainting in randomly occluded regions and achieved better restoration results. Repaint divides the image into two parts,

x^{k n o w n}

and

x^{u n k n o w n}

, and processes them, first, using the pre-trained unconditionally guided diffusion model from the complete image. First, the pre-trained unconditionally guided diffusion model is used to randomly sample from the complete image after adding the corresponding level of Gaussian noise

x_{t}

to generate the missing region

x_{t - 1}^{u n k n o w n}

of the image, and then

t - 1

noise addition is performed on the input

x^{k n o w n}

of the missing image to obtain

x_{t - 1}^{k n o w n}

; the addition of the two parts of

x_{t - 1}^{u n k n o w n}

and

x_{t - 1}^{k n o w n}

completes the output of the

t - 1

step, and then the inverse diffusion sampling is performed on the output image of the

t - 1

step to iterate the above steps, and the work of missing image completion can be completed at last. This work divides the missing image into two parts to be processed separately so that the model is more targeted to train the missing regions of the image, better learn the features of the known and unknown regions, and be able to more accurately inpaint the content of the image so that the restored image has a high degree of fidelity.The repaint model structure is shown in Figure 7.

However, when the diffusion-based model is used for the image inpainting task, the problem of inconsistency between the generated content and the content of the observed image often occurs, such as inpaint shapes and incongruous structures; for this reason, G. Zhang et al. [66] proposed the Coherent Inpaint (COPAINT) model for the consistency of the restoration of the image complementation problem. To achieve that purpose, an approximate posterior probability distribution of the noisy image

{\tilde{x}}_{T}

under the constraints of the observation area is defined as shown in Equation (21), as follows:

\begin{matrix} log p_{θ}^{'} ({\tilde{X}}_{T} | C) & = log (p_{θ}^{'} ({\tilde{X}}_{T})) \\ + log (p_{θ}^{'} (r ({\tilde{X}}_{0}) = s_{0} | {\tilde{X}}_{T})) + C^{'} . \end{matrix}

(21)

The first term on the right side of the equation represents the prior probability distribution of the noisy image

{\tilde{X}}_{T}

, and the second term represents the approximate conditional probability distribution that the observation region

r ({\tilde{X}}_{0})

of the final generated image

{\tilde{X}}_{0}

is equal to the known observation region

s_{0}

, given the noisy image

{\tilde{X}}_{T}

. By maximizing Equation (20), a complete image

{\tilde{X}}_{0}

consistent with the observation region can be generated. This method effectively ensures the problem of consistency between the generated content and the observed part, and improves the quality and realism of the generated image.

In addition to this, many works solve the inverse problem by utilizing the diffusion model to deal with the task of image inpainting. H. Chung et al. [67] proposed the Come-Closer-Diffuse-Faster (CCDF) model to accelerate the solution process of the inverse problem by utilizing the conditional diffusion model.The CCDF compresses the image into the mapping space by applying a non-expansive mapping to the image and performs conditional diffusion in that space. Through the above two steps, the length of the diffusion process can be effectively shortened to effectively improve the efficiency of the solution problem while ensuring the quality of image restoration. On this basis, due to the susceptibility of CCDF to mode collapse, J. Song et al. [43] proposed the IIGDM to further improve the efficiency of diffusion models for image restoration and enhance model stability by using multimodal contrastive learning and diverse noise scheduling. The model uses the first-order term of the Taylor expansion as the pseudoinverse approximation of the degradation operation, avoiding the complexity of pseudoinverse computation. The amplitude of denoising in the next step is adaptively adjusted by calculating the noise error between the noisy image of the current step and the generated image, which greatly reduces the time duration of solving the problem and improved the stability of model training.

Table 3 provides a comprehensive demonstration of the diffusion model-based image inpainting approach. In the task of performing image inpainting, the diffusion model plays a great role in effectively improving the realism of the recovered image and the consistency of the observed data. In addition to this, scholars have also used various methods to improve the generation efficiency, making the diffusion model play a great role in the task of image inpainting.

4.2.3. Image Deraining, Desnowing, and Dehazing

Image deraining, desnowing, and dehazing are usually regarded as preprocessing steps in target recognition, semantic segmentation, and other related fields, so the effectiveness of this type of image restoration task is crucial. In general, this type of degraded image is caused by periodic high-frequency interference in the frequency domain, resulting in degradation artifacts. Conventional image dehazing is commonly used with both dark channel a priori algorithms and atmospheric scattering models, but both methods are dependent on weather conditions and suffer from severe loss of detail. Deraining and desnowing often decompose a degraded image into a clear image and a raindrop or snowflake image, and then estimate the distribution of the raindrop or snowflake image to restore the clear image, but require an accurate estimation of the features of the raindrops or snowflakes, which results in a poor generalization capability and slow computation. In addition, for real-world deraining, desnowing, defogging, etc., the problem is still difficult due to the lack of paired datasets for training. With the rapid development of a diffusion model in recent years, many scholars have utilized a diffusion model to deal with this task, which greatly improves the quality of image restoration.

ÖZDENIZCI. O et al. [70] found that many image deraining, desnowing, and defogging problems are only able to deal with fixed-size image inputs, while facing various sizes of image restoration problems, so they proposed to use block-based overlapping block noise estimation to guide the diffusion model to realize different sizes of image restoration problems. For the first time, the model splits the input degraded image into multiple overlapping blocks and uses each block to train the DDPM. Inference is performed using the DDPM on each block to recover clean image blocks, and finally, the recovered blocks are stitched together to generate the final restored image. The method proposes a generalized solution for degraded images of different sizes, and has a faster processing speed in dealing with multiple overlapping rain, snow, and fog input blocks compared with dealing with complete large-size input images. However, the method still fails to propose new solutions when facing the challenges posed by the lack of paired training sets in the real world.

M. Wei et al. [52] proposed the RainDiffusion model in order to solve the real-world image deraining problem; the model takes the rainy dataset and the clean dataset as inputs, and utilizes two generators and, respectively, to convert the clean image to a rainy image and the rainy image to a clean image, and uses the image pairs generated by this loop-consistency architecture to train the Degradation-Conditioned Diffusion model (DCDM). DCDM adapts to different types of rainy-day degradation, such as rain lines, raindrops, and rain fog, by introducing degradation-conditioned images and learns different degradation problems through the inverse process of the diffusion model, thus effectively removing all kinds of rain marks. RainDiffusion employs cyclic consistency loss so that the model learns a stable mapping relationship between generators, which greatly mitigates the problem of false or inconsistent recovery data due to lack of paired data. In the testing phase, rainy-day images are input, and the DCDM extracts the degradation features through a Degradation-Guided Hypernetwork (DG-Hyper), generates a vector containing the degradation information, and parameterizes the vector to adapt to different rainy-day weather scenarios. A multi-scale noise estimator is also introduced in the DCDM to input image and the noisy image in the denoising step as inputs to predict the noise estimates at different scales so that the model has better image recovery for raindrops at different scales. Finally, the noise estimates of different scales are fused to generate the final noise map, which further undergoes backpropagation to obtain the denoised image in the next step, and iteratively operates the above steps until a clean image is generated. This task effectively solves the problem of image restoration due to the lack of a training set and the complexity of real scene images through cyclic consistency architecture, which provides a new solution for the subsequent work. The RainDiffusion model structure is shown in Figure 8.

In addition to this, Z. Luo et al. [71] proposed the IR-SDE (Image Restoration-Stochastic Differential Equation) model to solve the poor generalization performance in real-world rain-degraded image restoration by using a multi-stage iterative approach based on standard stochastic differential variance. IR-SDE utilizes Mean-Reverting Stochastic Differential Equations (Mean-Reverting). The model uses Mean-Reversing Stochastic Differential Equations (SDEs) to model the image degradation process. Since the training parameters of SDEs can be learned automatically, the model does not need to provide any a priori knowledge when it is trained, which improves the model’s generalizability and generalization ability, and is closer to the real-world application scenarios. In addition, it is found that training instability often occurs when training with complex degraded images, and the model solves the problem by maximizing the likelihood of the optimal inverse trajectory to ensure more stable training and improve image restoration results. The method verifies its generality and effectiveness on a variety of tasks, and reduces the complexity of model training, as well as the consumption of computational resources by simulating the diffusion forward and reverse process using SDE without providing additional a priori guidance, providing new ideas for future work.

After the above combing, Table 4 comprehensively demonstrates the image derain, desnow, and defog methods based on diffusion models. The use of a diffusion model makes great progress in image deraining, desnowing, and defogging, and lays a solid foundation for subsequent tasks such as target detection and image semantics.

4.2.4. Image Denoising

According to the interrelationship between the image and the noise, the noise of the image is generally categorized into two types: additive noise and multiplicative noise. (1) Additive noise: the noise is not related to the original image; generally, in the transmission process channel, noise interference leads to image degradation. (2) Multiplicative noise: the noise is correlated with the original image, and according to the type of probability distribution of the noise, there are Gaussian noise, Poisson noise, Rayleigh noise and other forms. Generally, this type of degraded image exhibits degradation artifacts due to periodic high-frequency interference occurring in the frequency domain.

The diffusion model is the process of denoising generation in the process of sampling, so the diffusion model is naturally applied to the task of image denoising.

Y. Xie et al. [73] observed that the inverse diffusion process of the diffusion model resembles the removal of Gaussian noise and, therefore, proposed a method of using the diffusion model for different types of noise to derive a posteriori probabilities for different tasks by replacing the sampling process of inverse diffusion, so as to make the inverse diffusion sampled noise consistent with the task noise. Experiments using Gaussian, Gamma, and Poisson noise are extensively tested, and the results contain equal degrees of image detail and greatly reduce the number of model iterations. The method provides a new solution in the field of image denoising without significant resource consumption, but no specific solution is given for the processing of complex types of hybrid blur. For this reason, B. T. Feng et al. [33] used a fraction-based diffusion model to deal with the inverse problem and applied the model to the field of image denoising. The model uses fraction matching to train the diffusion model so that the model can learn the a priori knowledge of the data, and applies the trained diffusion model to the solution of the inverse problem, and guides the optimization process of the problem through the fraction function so as to achieve the purpose of image denoising.

Both of the above methods demonstrate the excellent performance of the diffusion model in dealing with the noisy image problem and provide new thinking for the development of this field.

Furthermore, Yue et al. [74] proposed a joint conditional diffusion framework that effectively addresses the limited restoration performance of traditional methods in mixed degradation scenarios by uniformly modeling the complex interactions of various image degradation types, such as noise, blur, and low-light. This method designs a hierarchical conditional mechanism that dynamically integrates prior information of different degradation types during the diffusion process, enabling synergistic optimization of composite degradations and demonstrating significant advantages, especially in extreme conditions of high noise and strong blur. To further address the persistent issues of artifact generation and detail loss in image denoising, Yue. C et al. [75] proposed a diffusion bridge control method that significantly improves the accuracy and stability of diffusion models in image restoration tasks by introducing an adaptive regulation mechanism. This method effectively resolves the problems of artifact generation and detail loss, which are common for traditional diffusion models in complex degradation scenarios, through the design of a dynamic weighting strategy and a gradient constraint module. It achieves more refined structure recovery while maintaining the naturalness of the image, providing a more reliable technical approach for the application of diffusion models in specialized fields such as medical imaging and remote sensing.

4.2.5. Low-Light Enhancement

Low-light enhancement aims to improve the visual quality of images captured in poorly lit environments, addressing issues such as low brightness, poor contrast, significant noise, and blurred details to improve human visual perception or the accuracy of computer vision tasks. In general, this task involves a lack of global brightness and low contrast in the overall image due to insufficient low-frequency components. However, some low-light images are often accompanied by degradations caused by high-frequency components, such as blurred edges or noise. Traditional methods mainly include histogram equalization and frequency domain transformation, but these methods generally face the trade-off between complex manual parameter tuning, noise amplification, and detail loss. The emergence of diffusion models has led to significant progress in this task, with methods such as progressive denoising and multi-domain information fusion effectively improving the ability to preserve image details, providing new avenues for this task.

Due to the common issues of low computational efficiency and insufficient restoration of high-frequency details in low-light image enhancement, Jiang et al. [76] proposed the WaveDiff framework. This framework combines wavelet transforms, utilizing Haar wavelets to decompose the low-frequency domain to reduce computation, and employs a low-frequency domain diffusion and high-frequency restoration module (HFRM) to specifically restore vertical, horizontal, and diagonal high-frequency sub-bands. It generates normally illuminated low-frequency components using a conditional diffusion strategy and merges these with the high-frequency components for reconstruction. This method effectively increases the inference speed of the task by a factor of three and provides the HFRM module with excellent edge and texture restoration capabilities. Since low-light images are often accompanied by blur, Lv et al. [77] proposed a zero-shot diffusion model guided by Fourier priors. This method separates amplitude and phase using the Fourier transform, utilizes a pre-trained diffusion model to adjust the amplitude to naturalize the brightness, and optimizes the phase to restore clear details. This approach can jointly enhance low-light images and deblur them without paired data, resolving the problem of poor generalization in traditional methods and resulting in restored images with more natural brightness and sharper textures. Furthermore, Nguyen et al. [78] discovered that high-frequency details in low-light images, such as text, are severely lost, and therefore proposed a multi-scale block diffusion model (DiD) to process image blocks of different resolutions in stages, combining iterative latent variable refinement (ILVR) to maintain cross-scale exposure consistency and prioritize the preservation of high-order semantic information such as text. This method can recover high-frequency details needed for text recognition, even in extreme low-light conditions, avoiding over-smoothing and making a significant contribution to the field of low-light enhancement. In addition to the above issues, low-light enhancement also suffers from diverse degradation models, making it difficult for traditional diffusion models to adaptively learn different degradation types. To address this problem, Wang et al. [79] proposed a degradation representation learning framework that explicitly models low-light degradation, such as noise and color deviation, during the diffusion process. It separates high and low-frequency information using wavelet transforms, selectively restoring low-frequency illumination and high-frequency details. This method can dynamically adapt to the degradation characteristics of different low-light scenes, avoiding the limitations of manually designed priors and providing a new solution for existing low-light enhancement tasks.

Finally, we summarize and list the SOTA models for these six tasks, along with their performance on PSNR, SSIM, LPIPS, and FID metrics, highlighting their crucial role in the field of image restoration. See Table 5 for details.

The preceding content categorized and introduced image restoration according to different tasks and degradation types, comprehensively presenting the current state of development for each task. In the next section, we will summarize and compile the commonly used datasets and evaluation metrics for each task.

5. Dataset and Evaluation Metrics

5.1. Datasets

Datasets play a crucial role in model training and testing, and the generalization and robustness performance of a model relies heavily on the quality of the dataset. In the field of image restoration, since the datasets for different image restoration tasks differ significantly in terms of content and degradation patterns, this section aims to generalize and summarize the commonly used datasets for different restoration tasks, as shown in Table 5. These datasets provide rich information on content, features, texture, etc., in different application scenarios, which provide a strong experimental basis for the training of diffusion models.

For traditional image super-resolution, the standard training data are usually DIV2K [82] and Flick2K [83]. However, the performance of diffusion models is inherently limited by the dataset size. Therefore, SR3 uses ImageNet to train the diffusion model for natural image super-resolution and FFHQ [84] to train the model for face super-resolution. During testing, SR3 evaluated natural image super-resolution using ImageNet 1K [85] and face super-resolution using CelebA-HQ [86]. In addition to this, some works have introduced commonly used super-resolution test datasets for evaluation, such as Set5 [87], Set14 [88], Manga109 [89], and Urban100 [90]. For real-world super-resolution, SR3+ [61] provides two versions of training data, where the first version consists of DIV2K, Flick2K, and OST [91] (OST300), and the second version contains an additional 61 million internal images and DF2K + OST. For evaluation, the test data consist of RealSR [92] and DRealSR [93], which were obtained from two DSLR cameras with different lenses.

For image deblurring, the diffusion model-based approach is typically trained using the GoPro [94] training dataset and validated on the GoPro, RealBlur-J [95], REDS [96], and HIDE [97] test datasets.

In the shadow removal task, ISTD [98] and SRD [99] are used for training and evaluation, where ISTD contains 135 scenes containing shadow masks.

For the image dehaze task, three typical datasets are used for evaluation, including Haze-4K [100], Dense-Haze [101], and RESIDE [102]. Among them, Haze-4K contains 4000 fogged images; Dense-Haze consists of 33 pairs of outdoor fogged and non-fogged images; and RESIDE collects 443,950 real-world training images and 5342 test images. Image De-Snow uses three datasets, namely, CSD [103], Snow100k [104], and SRRS [105]. The image deraining dataset contains multiple rainfall types. For example, Rain800 [106] and DDN-Data [107] contain a large number of synthetic rain streaks. RainDrop [108] collects 1119 pairs of rainy/sunny images with different backgrounds and raindrops. Outdoor-Rain [109] takes rainfall accumulation into account and provides a more rational modeling for heavy rain images. SPA-data [110] provides a more rational modeling for heavy rain images by combining the clean images from multiple consecutive rainy-day images to distort them to construct large-scale real-world rain streaks.

For more detailed information about the above datasets, refer to Table 6.

5.2. Evaluation Metrics

Two evaluation systems, objective and subjective, have been established in the field of image restoration to measure the performance of algorithms. However, the subjective evaluation system relies on human perception, which is easily affected by the individual differences of the evaluators and may be inconsistent with the objective evaluation results, making it difficult to build a fair subjective evaluation system. Objective indicators, on the other hand, quantitatively evaluate the image restoration results and can distinguish pixel-level differences that are imperceptible to the human eye. Therefore, this section will elaborate the commonly used objective evaluation metrics in image restoration, including PSNR [110], SSIM [117], LPIPS [118], DISTS [119], FID [120], KID [121], NIQE [122], and PI [123].

Peak Signal-to-Noise Ratio [110] (PSNR): as the most commonly used metric in image restoration, PSNR aims to measure the pixel-level distance between a noisy image and its corresponding clean image by calculating its Mean Square Error (MSE). The MSE is defined as shown in Equation (22), as follows:

\begin{matrix} M S E = \frac{1}{m n} \sum_{i = 0}^{m - 1} \sum_{j = 0}^{n - 1} {∥ I (i, j) - K (i, j) ∥}^{2}, \end{matrix}

(22)

where

m n

denotes the size of the image, I denotes the original image, and K denotes the noise image. PSNR can be defined as shown in Equation (23), as follows:

\begin{matrix} P S N R = 10 \cdot {log}_{10} (\frac{M A X_{I}^{2}}{M S E}), \end{matrix}

(23)

where

M A X_{I}

denotes the maximum value of the image pixels.

SSIM [117] (Structural Similarity Index) is also a traditional image quality assessment (IQA) metric designed to satisfy the human visual perception system. In contrast with PSNR, it compares the similarity between distorted and clean images in three aspects: contrast, luminance, and structure. For further improvement, multi-scale information was introduced into SSIM, called MS-SSIM. SSIM has a fast computational speed compared with learning-based IQA metrics, but is still far from the level of human perception.The SSIM calculation formula is shown below:

\begin{matrix} S S I M (x, y) = {[l (x, y)]}^{α} {[c (x, y)]}^{β} {[s (x, y)]}^{γ}, \end{matrix}

(24)

where

l (x, y) = \frac{2 μ_{x} μ_{y} + c_{1}}{μ_{x}^{2} + μ_{y}^{2} + c_{1}}

represents the luminance comparison of the two images,

c (x, y) = \frac{2 σ_{x y} + c_{2}}{σ_{x}^{2} + σ_{y}^{2} + c_{2}}

is the contrast comparison, and

s (x, y) = \frac{σ_{x y} + c_{3}}{σ_{x} σ_{y} + c_{3}}

is the structural comparison.

μ_{x}

and

μ_{y}

represent the mean of x, y, respectively;

σ_{x}

and

σ_{y}

represent the standard deviation of x, y, respectively;

σ_{x y}

represents the covariance of x and y; and

c_{1}

,

c_{2}

, and

c_{3}

are constants.

To calculate perceptual similarity, researchers have proposed the LPIPS [118] (Learned Perceptual Image Patch Similarity) metric, which simulates human subjective judgment. LPIPS is a widely used learning-based IQA metric for perceptually oriented image restoration tasks. Instead of using image-level statistical information for quality metrics, LPIPS uses a pre-trained AlexNet as a feature extractor to extract high-level semantic features from images, calculates the distance between generated and real images in the feature space, and simulates perceptual differences in the human visual system and optimizes the linear layers for human perception. A lower LPIPS value indicates that the two images are more similar in the perceptual space.

DISTS [119] (Differentiable Image Saliency Transform for Improved Scalability and Portability of Image Quality Assessment) observed that texture similarity and structural similarity of two images can be be measured by the mean and correlation obtained from their features using VGG, respectively. Based on this finding, this work performs an SSIM-like distance metric for texture and structural similarity in the feature space.

To measure the quality of generated images from the perspective of data distribution matching, researchers proposed FID [120] (Fréchet Inception Distance). FID is widely used to measure the fidelity and diversity of generated images and is an improved version of Inception Score. In contrast with IS, which lacks a real-world reference image, FID utilizes the features of the coding layer of the inception model to model a multivariate Gaussian distribution, which is used to model the sampled images. It evaluates the global statistical similarity between generated and ground truth images by comparing the distances between their distributions in feature space.

Because PSNR calculates numerical differences between images using mean squared error, and SSIM improves upon PSNR by considering luminance, contrast, and structural similarity but remains limited to local window calculations, they fail to adequately reflect human perception. The human visual system exhibits non-linear sensitivity to pixel changes; thus these metrics are unable to perform semantic understanding from a global image perspective. In contrast, LPIPS utilizes a pre-trained AlexNet to extract deep semantic features from images and calculates the distance in the feature space, thereby more closely approximating human image perception. FID, by using the Inception-v3 network to extract features and calculate the distribution distance between generated and real images in the feature space, assesses the similarity of the overall data distribution. It reflects whether generated images fall on the real data manifold while also evaluating the overall performance of the generative model. Therefore, LPIPS and FID are more advantageous than PSNR and SSIM for tasks involving measuring perceived image quality.

KID [121] (Kernel Inception Distance) and FID use the same features of the inception model for quality assessment but have different distance metric strategies (i.e., maximum mean difference (MMD)) using a polynomial kernel function. KID is more stable than FID even with a small sample size.

NIQE [122] (Natural Image Quality Evaluator) is an early reference-free/blind image quality assessment metric, where the quality score is computed by calculating the distance between the distorted image and the natural scene stills of the natural image using a multivariate Gaussian model (MGM).

PI [123] (Perceptual Image) is a metric proposed in the PIRM super-resolution perceptual challenge to assess the perceptual quality of super-resolution images. The calculation method is shown in Equation (25), as follows:

\begin{matrix} P I = 0.5 ((10 - M a) + N I Q E), \end{matrix}

(25)

where

M a

is a reference-free IQA metric for super-resolution.

In the diffusion model-based image restoration task, the above evaluation metrics can accurately and comprehensively assess the quality of image restoration and clearly quantify the model performance, which provides standard visualization data for subsequent work and helps the optimization and adjustment of future work.

6. Conclusions

A diffusion model, as a generative model, is centered on gradually adding noise to the data by simulating the diffusion process, and subsequently learning to invert this process to construct the desired data samples from the noise. Relying on powerful learning capabilities and data-driven, the diffusion model has made significant progress in the field of image restoration: (1) Powerful generative capabilities. Better restoration is also possible for images with particularly severe distortion. (2) Strong generalization ability. Dealing with multi-type and hybrid image restoration problems has achieved better results, which are widely used in many fields. (3) Clear details and high fidelity. For a variety of low-quality images, the diffusion model can restore finer texture details, realizing natural and realistic image restoration. (4) Diversity of results allows the generation of images with diverse restoration results.

Although the diffusion model has been applied to a large number of various image restoration tasks, there are still many problems to be solved by further research, and these problems will become the focus of future research. Based on the current problems and difficulties in image restoration using diffusion modeling, the following outlook is made for the future of this field:

(1): Efficient sampling

Due to the theoretical design of the diffusion model, the sampling step is a key factor in determining the quality of restoration. If fewer steps are taken, it will lead to the fidelity of the generated image, but an increase in the sampling step at the same time will bring about a sudden increase in the amount of computation and the restoration time; therefore, improving the efficiency of sampling is a key issue in the diffusion model.

Existing work models the diffusion process through non-Markovian chains such as DDIM or designing efficient ODE solvers and using knowledge distillation to reduce the sampling steps. The sampling steps of diffusion models have been drastically reduced to 10–20 steps by the improvements of the above methods, but there is still room for improvement of these methods.

How to improve the structure of the diffusion model, as well as how to reduce the number of parameters in the model and reduce the number of iterations without significantly degrading the quality of the generated image, and improve the sampling efficiency are potential research directions.

(2): Model Compression

Model size is also an important factor that affects the computational cost and influences the effectiveness of diffusion models used in practical applications. DDPM and SR3 have 113.7 M and 155.3 number of parameters, respectively, which have far exceeded the number of parameters for image restoration based on CNN or Transformer. To address this problem, existing work focuses on model compression in the following three ways.

Model pruning: removing unimportant parameters by estimating the importance score of each parameter.
Knowledge distillation: transferring complex teacher model content to student model species. For example, Salimans et al. [124] proposed a strategy of progressive distillation, which iteratively compresses the knowledge of the original model into a student model with fewer sampling steps through multiple rounds of teacher–student training. Each round of distillation halves the sampling steps by aligning the outputs and dynamically adjusts the noise schedule to maintain generation quality, effectively addressing the issue of slow sampling speeds caused by multi-step iteration. Addressing the limitations of traditional diffusion model knowledge distillation methods that rely on original training data, Xiang et al. [125] proposed a data-free knowledge distillation framework (DKDM). This method utilizes the pre-trained diffusion model itself to generate synthetic training samples, designs a noise-conditional generation strategy, and incorporates adversarial feature matching with multi-scale feature consistency constraints, achieving efficient distillation for student models of arbitrary architectures.
Low-rank decomposition: decomposing a tensor with a huge number of parameters into multiple low-rank tensors. All of the above methods work on model compression from the structure of the diffusion model itself, but very little work has been conducted to design a model compression strategy for the image restoration aspect of the diffusion-based model, so this direction needs to be studied urgently.
Model Lightweighting: Two common techniques are used in the lightweighting process: half-precision floating-point quantization (FP16) and 8-bit integer quantization (INT8). FP16 converts the model weights and activation values from 32-bit single-precision floating point (FP32) to 16-bit half-precision floating point (FP16) with almost no loss of precision, and is suitable for most network architectures, effectively increasing the model’s running speed by approximately 2×. INT8 maps FP32 weight values to INT8 (−128–127), significantly reducing memory footprint and increasing computation speed by 3–4×. However, it often exhibits a significant loss of precision for sensitive tasks such as image super-resolution, making it unable to fully guarantee image restoration quality.
Conditional Diffusion Model Sparsification: This method significantly reduces the model size and improves inference speed while maintaining conditional control capabilities by removing redundant weights, entire neurons, or attention heads, while striving to maintain generation quality.

(3): Model Structure

Most of the diffusion model-based image restoration methods are designed based on the U-Net network structure, which is generally improved from the perspective of image generation, including image pixel space, residual space, and potential space. The image pixel space improves the U-Net architecture by introducing the cross-attention module, group normalization, multi-head attention, and position coding. The image restoration task in potential space is performed by designing encoder and decoder structures to reduce the computational cost while ensuring the quality of the restored image. Subject to the fact that in recent years Transformer has demonstrated its ability in modeling long-range dependencies, U-ViT [126] attempted to build a denoising network based on Transformer. Similar to ViT, the model uses both time condition and noisy image blocks as tokens and feeds them into Transformer for processing. In addition to this, U-ViT removes the downsampling and upsampling operations from U-Net and adds long jump connections to learn low-level features and improve the quality of the generated images. In addition, the combination of diffusion model with generative adversarial network and autoregressive model can be considered to combine the advantages of different networks to generate high-quality images. The above methods show that there is room for improvement in the denoising network of the diffusion model, and we can continue to explore better network structures in the future.

(4): Evaluation Index

The evaluation of image restoration is generally divided into two aspects: objective evaluation and subjective evaluation, and objective evaluation can be further divided into qualitative evaluation and quantitative evaluation. The advantage of subjective evaluation is that it is completely oriented to the purpose of evaluation, and restoration methods with high subjective evaluation tend to be more practical; however, it is difficult to quantify and compare. Objective evaluation is often based on specific evaluation indexes and data sets, which has the advantages of precision and quantification, but objective evaluation is not directly oriented to the evaluation purpose, and it is easy to have the phenomenon of inconsistency between evaluation indexes and evaluation purpose. Evaluation indicators such as Peak Signal-to-Noise Ratio (PSNR) and structural similarity (SSIM) used by the current researchers often show high scores in the evaluation of indicators and poor results in the actual restoration.

(5): Multi-Modal Fusion

With the emergence of multimodal technology, many works have been guided by utilizing multimodal content as conditional information for image restoration tasks. In addition to this method, the future is expected to fuse multimodal information with image information into the diffusion model to assist in the image restoration task. For example, the text and other information and image information using attention and other mechanisms for multimodal fusion, multimodal alignment, and other operations, the fusion features as the input to the diffusion model, for the model comprehensive understanding of the image content to complete a higher-quality image restoration.

(6): Specific scenes

Although the current diffusion model has made significant breakthroughs in the field of image restoration, the current diffusion model mainly shows strong advantages in general image restoration tasks, and there are fewer studies on scene-specific image restoration, and the existing methods may not be able to effectively deal with image restoration problems in special scenes. For example, in the image restoration of cultural heritage, cultural relics retain unique texture structure, texture color, and also need to consider factors such as historical background in the restoration process, the existing model is difficult to learn the background information of cultural relics, and the restoration effect is very limited. Therefore, subsequent work can be carried out for the corresponding model training and development for specific scenes, effectively solving the pain points of scene-specific image restoration.

In summary, with the continuous innovation and development of the diffusion model, the method is able to solve the remaining problems in the field of image restoration and provide strong generalization, strong robustness, and high-precision image restoration results for various tasks in the field.

Author Contributions

Conceptualization: H.W. and J.L.; Methodology: H.W.; Validation: Y.L. and H.Z.; Formal Analysis: Y.L.; Data Curation: Y.L. and H.Z.; Resources: J.L.; Writing—Original Draft Preparation: H.W.; Writing—Review and Editing: H.W.; Visualization: H.W.; Supervision: J.L.; Project Administration: J.L.; Funding Acquisition: J.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Research Project of Jilin University of Finance and Economics (NO. 2023YB002).

Data Availability Statement

The original contributions presented in this study are included in the article, and further inquiries can be directed to the corresponding author.

Conflicts of Interest

All authors declare no conflicts of interest.

References

Ouyang, X.; Chen, Y.; Zhu, K.; Agam, G. Image restoration refinement with Uformer GAN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–18 June 2024; pp. 5919–5928. [Google Scholar] [CrossRef]
Rama, P.; Pandiaraj, A.; Angayarkanni, V.; Prakash, Y.; Jagadeesh, S. Advancement in Image Restoration Through GAN-based Approach. In Proceedings of the 2024 15th International Conference on Computing Communication and Networking Technologies (ICCCNT), Kamand, India, 24–28 June 2024; pp. 1–7. [Google Scholar] [CrossRef]
Masmoudi, Y.; Ramzan, M.; Khan, S.A.; Habib, M. Optimal feature extraction and ulcer classification from WCE image data using deep learning. Soft Comput. 2022, 26, 7979–7992. [Google Scholar] [CrossRef]
Habib, M.; Ramzan, M.; Khan, S.A. A deep learning and handcrafted based computationally intelligent technique for effective COVID-19 detection from X-ray/CT-scan imaging. J. Grid Comput. 2022, 20, 23. [Google Scholar] [CrossRef]
Muslim, H.S.M.; Khan, S.A.; Hussain, S.; Jamal, A.; Qasim, H.S.A. A knowledge-based image enhancement and denoising approach. Comput. Math. Organ. Theory 2019, 25, 108–121. [Google Scholar] [CrossRef]
Li, M.; Fu, Y.; Zhang, T.; Wen, G. Supervise-assisted self-supervised deep-learning method for hyperspectral image restoration. IEEE Trans. Neural Netw. Learn. Syst. 2024, 36, 7331–7344. [Google Scholar] [CrossRef]
Monroy, B.; Bacca, J.; Tachella, J. Generalized recorrupted-to-recorrupted: Self-supervised learning beyond gaussian noise. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 28155–28164. [Google Scholar]
Li, X.; Ren, Y.; Jin, X.; Lan, C.; Wang, X.; Zeng, W.; Wang, X.; Chen, Z. Diffusion Models for Image Restoration and Enhancement–A Comprehensive Survey. arXiv 2023, arXiv:2308.09388. [Google Scholar] [CrossRef]
Luo, Z.; Gustafsson, F.K.; Zhao, Z.; Sjölund, J.; Schön, T.B. Taming diffusion models for image restoration: A review. arXiv 2024, arXiv:2409.10353. [Google Scholar] [CrossRef]
Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar]
Sohl-Dickstein, J.; Weiss, E.; Maheswaranathan, N.; Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In Proceedings of the International Conference on Machine Learning, PMLR, Lille, France, 7–9 July 2015; pp. 2256–2265. Available online: https://proceedings.mlr.press/v37/sohl-dickstein15.html (accessed on 10 May 2025).
Song, Y.; Ermon, S. Generative modeling by estimating gradients of the data distribution. Adv. Neural Inf. Process. Syst. 2019, 32, 6392. Available online: https://papers.nips.cc/paper/2019/file/3001ef257407d5a371a96dcd947c7d93-MetaReview.html (accessed on 10 May 2025).
Song, Y.; Ermon, S. Improved techniques for training score-based generative models. Adv. Neural Inf. Process. Syst. 2020, 33, 12438–12448. [Google Scholar]
Song, Y.; Sohl-Dickstein, J.; Kingma, D.P.; Kumar, A.; Ermon, S.; Poole, B. Score-based generative modeling through stochastic differential equations. arXiv 2020, arXiv:2011.13456. [Google Scholar] [CrossRef]
Song, Y.; Durkan, C.; Murray, I.; Ermon, S. Maximum likelihood training of score-based diffusion models. Adv. Neural Inf. Process. Syst. 2021, 34, 1415–1428. [Google Scholar]
Rezende, D.; Mohamed, S. Variational inference with normalizing flows. In Proceedings of the International Conference on Machine Learning, PMLR, Lille, France, 7–9 July 2015; pp. 1530–1538. Available online: https://proceedings.mlr.press/v37/rezende15.html (accessed on 10 May 2025).
Welling, M.; Teh, Y.W. Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), Citeseer, Bellevue, WA, USA, 28 June–2 July 2011; pp. 681–688. Available online: https://icml.cc/virtual/2021/test-of-time/11808 (accessed on 10 May 2025).
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10684–10695. [Google Scholar] [CrossRef]
Nichol, A.Q.; Dhariwal, P. Improved denoising diffusion probabilistic models. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 8162–8171. Available online: https://proceedings.mlr.press/v139/nichol21a.html (accessed on 10 May 2025).
Yang, X.; Shih, S.M.; Fu, Y.; Zhao, X.; Ji, S. Your vit is secretly a hybrid discriminative-generative diffusion model. arXiv 2022, arXiv:2208.07791. [Google Scholar] [CrossRef]
Nichol, A.; Dhariwal, P.; Ramesh, A.; Shyam, P.; Mishkin, P.; McGrew, B.; Sutskever, I.; Chen, M. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv 2021, arXiv:2112.10741. [Google Scholar] [CrossRef]
Lee, K.; Liu, H.; Ryu, M.; Watkins, O.; Du, Y.; Boutilier, C.; Abbeel, P.; Ghavamzadeh, M.; Gu, S.S. Aligning text-to-image models using human feedback. arXiv 2023, arXiv:2302.12192. [Google Scholar] [CrossRef]
Song, J.; Meng, C.; Ermon, S. Denoising diffusion implicit models. arXiv 2020, arXiv:2010.02502. [Google Scholar] [CrossRef]
Lu, C.; Zhou, Y.; Bao, F.; Chen, J.; Li, C.; Zhu, J. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Adv. Neural Inf. Process. Syst. 2022, 35, 5775–5787. [Google Scholar]
Lyu, Z.; Xu, X.; Yang, C.; Lin, D.; Dai, B. Accelerating diffusion models via early stop of the diffusion process. arXiv 2022, arXiv:2205.12524. [Google Scholar] [CrossRef]
Xia, B.; Zhang, Y.; Wang, S.; Wang, Y.; Wu, X.; Tian, Y.; Yang, W.; Van Gool, L. Diffir: Efficient diffusion model for image restoration. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 13095–13105. [Google Scholar] [CrossRef]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 8748–8763. Available online: https://icml.cc/virtual/2021/oral/9194 (accessed on 10 May 2025).
Kingma, D.P.; Welling, M. Auto-Encoding Variational Bayes. 2013. Available online: https://iclr.cc/virtual/2024/test-of-time/21444 (accessed on 10 May 2025).
Saharia, C.; Ho, J.; Chan, W.; Salimans, T.; Fleet, D.J.; Norouzi, M. Image super-resolution via iterative refinement. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 4713–4726. [Google Scholar] [CrossRef]
Saharia, C.; Chan, W.; Chang, H.; Lee, C.; Ho, J.; Salimans, T.; Fleet, D.; Norouzi, M. Palette: Image-to-image diffusion models. In Proceedings of the ACM SIGGRAPH 2022 Conference Proceedings, Vancouver, BC, Canada, 7–11 August 2022; pp. 1–10. [Google Scholar] [CrossRef]
Zhang, Y.; Shi, X.; Li, D.; Wang, X.; Wang, J.; Li, H. A unified conditional framework for diffusion-based image restoration. Adv. Neural Inf. Process. Syst. 2023, 36, 49703–49714. [Google Scholar]
Fei, B.; Lyu, Z.; Pan, L.; Zhang, J.; Yang, W.; Luo, T.; Zhang, B.; Dai, B. Generative diffusion prior for unified image restoration and enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 9935–9946. [Google Scholar] [CrossRef]
Feng, B.T.; Smith, J.; Rubinstein, M.; Chang, H.; Bouman, K.L.; Freeman, W.T. Score-based diffusion models as principled priors for inverse imaging. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 10520–10531. [Google Scholar] [CrossRef]
Sun, H.; Bouman, K.L. Deep probabilistic imaging: Uncertainty quantification and multi-modal solution characterization for computational imaging. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtually, 2–9 February 2021; Volume 35, pp. 2628–2637. [Google Scholar] [CrossRef]
Dinh, L.; Sohl-Dickstein, J.; Bengio, S. Density estimation using real nvp. arXiv 2016, arXiv:1605.08803. [Google Scholar] [CrossRef]
Niu, A.; Zhang, K.; Pham, T.X.; Sun, J.; Zhu, Y.; Kweon, I.S.; Zhang, Y. Cdpmsr: Conditional diffusion probabilistic models for single image super-resolution. arXiv 2023, arXiv:2302.12831. [Google Scholar] [CrossRef]
Zhang, Y.; Li, K.; Li, K.; Wang, L.; Zhong, B.; Fu, Y. Image super-resolution using very deep residual channel attention networks. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 286–301. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar] [CrossRef]
Lim, B.; Son, S.; Kim, H.; Nah, S.; Mu Lee, K. Enhanced deep residual networks for single image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 136–144. [Google Scholar] [CrossRef]
Gao, S.; Liu, X.; Zeng, B.; Xu, S.; Li, Y.; Luo, X.; Liu, J.; Zhen, X.; Zhang, B. Implicit diffusion models for continuous super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 10021–10030. [Google Scholar] [CrossRef]
Guo, L.; Wang, C.; Yang, W.; Huang, S.; Wang, Y.; Pfister, H.; Wen, B. Shadowdiffusion: When degradation prior meets diffusion model for shadow removal. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 14049–14058. [Google Scholar] [CrossRef]
Chung, H.; Kim, J.; Mccann, M.T.; Klasky, M.L.; Ye, J.C. Diffusion posterior sampling for general noisy inverse problems. arXiv 2022, arXiv:2209.14687. [Google Scholar] [CrossRef]
Song, J.; Vahdat, A.; Mardani, M.; Kautz, J. Pseudoinverse-guided diffusion models for inverse problems. In Proceedings of the International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023; Available online: https://iclr.cc/virtual/2023/poster/11030 (accessed on 10 May 2025).
Chung, H.; Kim, J.; Kim, S.; Ye, J.C. Parallel diffusion models of operator and image for blind inverse problems. In Proceedings of the CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; Volume 1, p. 3. [Google Scholar] [CrossRef]
Murata, N.; Saito, K.; Lai, C.H.; Takida, Y.; Uesaka, T.; Mitsufuji, Y.; Ermon, S. Gibbsddrm: A partially collapsed gibbs sampler for solving blind inverse problems with denoising diffusion restoration. In Proceedings of the International Conference on Machine Learning, PMLR, Honolulu, HI, USA, 23–29 July 2023; pp. 25501–25522. Available online: https://icml.cc/virtual/2023/oral/25492 (accessed on 10 May 2025).
Van Dyk, D.A.; Park, T. Partially collapsed Gibbs samplers: Theory and methods. J. Am. Stat. Assoc. 2008, 103, 790–796. [Google Scholar] [CrossRef]
Luo, Z.; Gustafsson, F.K.; Zhao, Z.; Sjölund, J.; Schön, T.B. Refusion: Enabling large-size realistic image restoration with latent-space diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 1680–1691. [Google Scholar] [CrossRef]
Kawar, B.; Elad, M.; Ermon, S.; Song, J. Denoising diffusion restoration models. Adv. Neural Inf. Process. Syst. 2022, 35, 23593–23606. [Google Scholar]
Kawar, B.; Song, J.; Ermon, S.; Elad, M. Jpeg artifact correction using denoising diffusion restoration models. arXiv 2022, arXiv:2209.11888. [Google Scholar] [CrossRef]
Wang, Y.; Yu, J.; Zhang, J. Zero-shot image restoration using denoising diffusion null-space model. arXiv 2022, arXiv:2212.00490. [Google Scholar] [CrossRef]
Yang, T.; Ren, P.; Xie, X.; Zhang, L. Synthesizing realistic image restoration training pairs: A diffusion approach. arXiv 2023, arXiv:2303.06994. [Google Scholar] [CrossRef]
Wei, M.; Shen, Y.; Wang, Y.; Xie, H.; Qin, J.; Wang, F.L. Raindiffusion: When unsupervised learning meets diffusion models for real-world image deraining. arXiv 2023, arXiv:2301.09430. [Google Scholar] [CrossRef]
Zhang, S.; Ren, W.; Tan, X.; Wang, Z.J.; Liu, Y.; Zhang, J.; Zhang, X.; Cao, X. Semantic-aware dehazing network with adaptive feature fusion. IEEE Trans. Cybern. 2021, 53, 454–467. [Google Scholar] [CrossRef]
Lugmayr, A.; Danelljan, M.; Romero, A.; Yu, F.; Timofte, R.; Van Gool, L. Repaint: Inpainting using denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11461–11471. [Google Scholar] [CrossRef]
Choi, J.; Kim, S.; Jeong, Y.; Gwon, Y.; Yoon, S. Ilvr: Conditioning method for denoising diffusion probabilistic models. arXiv 2021, arXiv:2108.02938. [Google Scholar] [CrossRef]
Ho, J.; Saharia, C.; Chan, W.; Fleet, D.J.; Norouzi, M.; Salimans, T. Cascaded diffusion models for high fidelity image generation. J. Mach. Learn. Res. 2022, 23, 1–33. [Google Scholar]
Li, H.; Yang, Y.; Chang, M.; Chen, S.; Feng, H.; Xu, Z.; Li, Q.; Chen, Y. Srdiff: Single image super-resolution with diffusion probabilistic models. Neurocomputing 2022, 479, 47–59. [Google Scholar] [CrossRef]
Wang, J.; Yue, Z.; Zhou, S.; Chan, K.C.; Loy, C.C. Exploiting diffusion prior for real-world image super-resolution. Int. J. Comput. Vis. 2024, 132, 5929–5949. [Google Scholar] [CrossRef]
Shang, S.; Shan, Z.; Liu, G.; Wang, L.; Wang, X.; Zhang, Z.; Zhang, J. Resdiff: Combining cnn and diffusion model for image super-resolution. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 8975–8983. [Google Scholar] [CrossRef]
Dos Santos, M.; Laroca, R.; Ribeiro, R.O.; Neves, J.; Proença, H.; Menotti, D. Face super-resolution using stochastic differential equations. In Proceedings of the 2022 35th SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI), Natal, Brazil, 24–27 October 2022; Volume 1, pp. 216–221. [Google Scholar] [CrossRef]
Sahak, H.; Watson, D.; Saharia, C.; Fleet, D. Denoising diffusion probabilistic models for robust image super-resolution in the wild. arXiv 2023, arXiv:2302.07864. [Google Scholar] [CrossRef]
Whang, J.; Delbracio, M.; Talebi, H.; Saharia, C.; Dimakis, A.G.; Milanfar, P. Deblurring via stochastic refinement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 16293–16303. [Google Scholar] [CrossRef]
Ren, M.; Delbracio, M.; Talebi, H.; Gerig, G.; Milanfar, P. Multiscale structure guided diffusion for image deblurring. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 10721–10733. [Google Scholar] [CrossRef]
Delbracio, M.; Milanfar, P. Inversion by direct iteration: An alternative to denoising diffusion for image restoration. arXiv 2023, arXiv:2303.11435. [Google Scholar] [CrossRef]
Kawar, B.; Vaksman, G.; Elad, M. Snips: Solving noisy inverse problems stochastically. Adv. Neural Inf. Process. Syst. 2021, 34, 21757–21769. [Google Scholar]
Zhang, G.; Ji, J.; Zhang, Y.; Yu, M.; Jaakkola, T.S.; Chang, S. Towards Coherent Image Inpainting Using Denoising Diffusion Implicit Models. 2023. Available online: https://icml.cc/virtual/2023/poster/24127 (accessed on 10 May 2025).
Chung, H.; Sim, B.; Ye, J.C. Come-closer-diffuse-faster: Accelerating conditional diffusion models for inverse problems through stochastic contraction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 12413–12422. [Google Scholar] [CrossRef]
Fabian, Z.; Tinaz, B.; Soltanolkotabi, M. Diracdiffusion: Denoising and incremental reconstruction with assured data-consistency. Proc. Mach. Learn. Res. 2024, 235, 12754. [Google Scholar]
Mardani, M.; Song, J.; Kautz, J.; Vahdat, A. A variational perspective on solving inverse problems with diffusion models. arXiv 2023, arXiv:2305.04391. [Google Scholar] [CrossRef]
Özdenizci, O.; Legenstein, R. Restoring vision in adverse weather conditions with patch-based denoising diffusion models. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 10346–10357. [Google Scholar] [CrossRef]
Luo, Z.; Gustafsson, F.K.; Zhao, Z.; Sjölund, J.; Schön, T.B. Image restoration with mean-reverting stochastic differential equations. arXiv 2023, arXiv:2301.11699. [Google Scholar] [CrossRef]
Chan, M.A.; Young, S.I.; Metzler, C.A. SUD²: Supervision by Denoising Diffusion Models for Image Reconstruction. arXiv 2023, arXiv:2303.09642. [Google Scholar] [CrossRef]
Xie, Y.; Yuan, M.; Dong, B.; Li, Q. Diffusion model for generative image denoising. arXiv 2023, arXiv:2302.02398. [Google Scholar] [CrossRef]
Yue, Y.; Yu, M.; Yang, L.; Liu, T. Joint Conditional Diffusion Model for image restoration with mixed degradations. Neurocomputing 2025, 626, 129512. [Google Scholar] [CrossRef]
Yue, C.; Peng, Z.; Ma, J.; Zhang, D. Enhanced control for diffusion bridge in image restoration. In Proceedings of the ICASSP 2025—2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 6–11 April 2025; pp. 1–5. [Google Scholar] [CrossRef]
Jiang, H.; Luo, A.; Fan, H.; Han, S.; Liu, S. Low-light image enhancement with wavelet-based diffusion models. ACM Trans. Graph. (TOG) 2023, 42, 1–14. [Google Scholar] [CrossRef]
Lv, X.; Zhang, S.; Wang, C.; Zheng, Y.; Zhong, B.; Li, C.; Nie, L. Fourier priors-guided diffusion for zero-shot joint low-light enhancement and deblurring. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–18 June 2024; pp. 25378–25388. [Google Scholar] [CrossRef]
Nguyen, C.M.; Chan, E.R.; Bergman, A.W.; Wetzstein, G. Diffusion in the dark: A diffusion model for low-light text recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 4146–4157. [Google Scholar] [CrossRef]
Wang, T.; Zhang, K.; Zhang, Y.; Luo, W.; Stenger, B.; Lu, T.; Kim, T.K.; Liu, W. LLDiffusion: Learning degradation representations in diffusion models for low-light image enhancement. Pattern Recognit. 2025, 166, 111628. [Google Scholar] [CrossRef]
Luo, Z.; Gustafsson, F.K.; Zhao, Z.; Sjölund, J.; Schön, T.B. Controlling vision-language models for multi-task image restoration. arXiv 2023, arXiv:2310.01018. [Google Scholar] [CrossRef]
Zheng, D.; Wu, X.M.; Yang, S.; Zhang, J.; Hu, J.F.; Zheng, W.S. Selective hourglass mapping for universal image restoration based on diffusion model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–18 June 2024; pp. 25445–25455. [Google Scholar] [CrossRef]
Agustsson, E.; Timofte, R. Ntire 2017 challenge on single image super-resolution: Dataset and study. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 126–135. [Google Scholar] [CrossRef]
Wang, Y.; Wang, L.; Yang, J.; An, W.; Guo, Y. Flickr1024: A large-scale dataset for stereo image super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27–28 October 2019. [Google Scholar] [CrossRef]
Karras, T.; Laine, S.; Aila, T. A style-based generator architecture for generative adversarial networks. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 4217–4228. [Google Scholar] [CrossRef]
Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef]
Karras, T.; Aila, T.; Laine, S.; Lehtinen, J. Progressive growing of gans for improved quality, stability, and variation. arXiv 2017, arXiv:1710.10196. [Google Scholar] [CrossRef]
Bevilacqua, M.; Roumy, A.; Guillemot, C.; Alberi-Morel, M.L. Low-complexity single-image super-resolution based on nonnegative neighbor embedding. In Proceedings of the British Machine Vision Conference 2012, Surrey, UK, 3–7 September 2012. [Google Scholar] [CrossRef]
Zeyde, R.; Elad, M.; Protter, M. On single image scale-up using sparse-representations. In International Conference on Curves and Surfaces; Springer: Berlin/Heidelberg, Germany, 2010; pp. 711–730. [Google Scholar] [CrossRef]
Matsui, Y.; Ito, K.; Aramaki, Y.; Fujimoto, A.; Ogawa, T.; Yamasaki, T.; Aizawa, K. Sketch-based manga retrieval using manga109 dataset. Multimed. Tools Appl. 2017, 76, 21811–21838. [Google Scholar] [CrossRef]
Huang, J.B.; Singh, A.; Ahuja, N. Single image super-resolution from transformed self-exemplars. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 5197–5206. [Google Scholar] [CrossRef]
Wang, X.; Yu, K.; Dong, C.; Loy, C.C. Recovering realistic texture in image super-resolution by deep spatial feature transform. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 606–615. [Google Scholar] [CrossRef]
Cai, J.; Zeng, H.; Yong, H.; Cao, Z.; Zhang, L. Toward real-world single image super-resolution: A new benchmark and a new model. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27–28 October 2019; pp. 3086–3095. [Google Scholar] [CrossRef]
Wei, P.; Xie, Z.; Lu, H.; Zhan, Z.; Ye, Q.; Zuo, W.; Lin, L. Component divide-and-conquer for real-world image super-resolution. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; pp. 101–117. [Google Scholar] [CrossRef]
Nah, S.; Hyun Kim, T.; Mu Lee, K. Deep multi-scale convolutional neural network for dynamic scene deblurring. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3883–3891. [Google Scholar] [CrossRef]
Rim, J.; Lee, H.; Won, J.; Cho, S. Real-world blur dataset for learning and benchmarking deblurring algorithms. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; pp. 184–201. [Google Scholar] [CrossRef]
Nah, S.; Baik, S.; Hong, S.; Moon, G.; Son, S.; Timofte, R.; Mu Lee, K. Ntire 2019 challenge on video deblurring and super-resolution: Dataset and study. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 16–17 June 2019. [Google Scholar] [CrossRef]
Shen, Z.; Wang, W.; Lu, X.; Shen, J.; Ling, H.; Xu, T.; Shao, L. Human-aware motion deblurring. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27–28 October 2019; pp. 5572–5581. [Google Scholar] [CrossRef]
Wang, J.; Li, X.; Yang, J. Stacked conditional generative adversarial networks for jointly learning shadow detection and shadow removal. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1788–1797. [Google Scholar] [CrossRef]
Qu, L.; Tian, J.; He, S.; Tang, Y.; Lau, R.W. Deshadownet: A multi-context embedding deep network for shadow removal. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4067–4075. [Google Scholar] [CrossRef]
Liu, Y.; Zhu, L.; Pei, S.; Fu, H.; Qin, J.; Zhang, Q.; Wan, L.; Feng, W. From synthetic to real: Image dehazing collaborating with unlabeled real data. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual, 20–24 October 2021; pp. 50–58. [Google Scholar] [CrossRef]
Ancuti, C.O.; Ancuti, C.; Sbert, M.; Timofte, R. Dense-haze: A benchmark for image dehazing with dense-haze and haze-free images. In Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–25 September 2019; pp. 1014–1018. [Google Scholar] [CrossRef]
Li, B.; Ren, W.; Fu, D.; Tao, D.; Feng, D.; Zeng, W.; Wang, Z. Benchmarking single-image dehazing and beyond. IEEE Trans. Image Process. 2018, 28, 492–505. [Google Scholar] [CrossRef] [PubMed]
Golodetz, S.; Cavallari, T.; Lord, N.A.; Prisacariu, V.A.; Murray, D.W.; Torr, P.H. Collaborative large-scale dense 3d reconstruction with online inter-agent pose optimisation. IEEE Trans. Vis. Comput. Graph. 2018, 24, 2895–2905. [Google Scholar] [CrossRef]
Liu, Y.F.; Jaw, D.W.; Huang, S.C.; Hwang, J.N. Desnownet: Context-aware deep network for snow removal. IEEE Trans. Image Process. 2018, 27, 3064–3073. [Google Scholar] [CrossRef] [PubMed]
Chen, W.T.; Fang, H.Y.; Ding, J.J.; Tsai, C.C.; Kuo, S.Y. JSTASR: Joint size and transparency-aware snow removal algorithm based on modified partial convolution and veiling effect removal. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; pp. 754–770. [Google Scholar] [CrossRef]
Zhang, H.; Sindagi, V.; Patel, V.M. Image de-raining using a conditional generative adversarial network. IEEE Trans. Circuits Syst. Video Technol. 2019, 30, 3943–3956. [Google Scholar] [CrossRef]
Fu, X.; Huang, J.; Zeng, D.; Huang, Y.; Ding, X.; Paisley, J. Removing rain from single images via a deep detail network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3855–3863. [Google Scholar] [CrossRef]
Qian, R.; Tan, R.T.; Yang, W.; Su, J.; Liu, J. Attentive generative adversarial network for raindrop removal from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2482–2491. [Google Scholar] [CrossRef]
Li, R.; Cheong, L.F.; Tan, R.T. Heavy rain image restoration: Integrating physics model and conditional adversarial learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–17 June 2019; pp. 1633–1642. [Google Scholar] [CrossRef]
Wang, T.; Yang, X.; Xu, K.; Chen, S.; Zhang, Q.; Lau, R.W. Spatial attentive single-image deraining with a high quality real rain dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–17 June 2019; pp. 12270–12279. [Google Scholar] [CrossRef]
Gu, S.; Lugmayr, A.; Danelljan, M.; Fritsche, M.; Lamour, J.; Timofte, R. Div8k: Diverse 8k resolution image dataset. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, Republic of Korea, 27–28 October 2019; pp. 3512–3516. [Google Scholar] [CrossRef]
López-Cifuentes, A.; Escudero-Vinolo, M.; Bescós, J.; García-Martín, Á. Semantic-aware scene recognition. Pattern Recognit. 2020, 102, 107256. [Google Scholar] [CrossRef]
Huang, G.B.; Mattar, M.; Berg, T.; Learned-Miller, E. Labeled faces in the wild: A database forstudying face recognition in unconstrained environments. In Proceedings of the Workshop on Faces in ‘Real-Life’ Images: Detection, Alignment, and Recognition, Marseille, France, 17 October 2008; Available online: https://inria.hal.science/inria-00321923/ (accessed on 10 May 2025).
Choi, Y.; Uh, Y.; Yoo, J.; Ha, J.W. Stargan v2: Diverse image synthesis for multiple domains. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 8188–8197. [Google Scholar] [CrossRef]
Martin, D.; Fowlkes, C.; Tal, D.; Malik, J. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proceedings of the Proceedings Eighth IEEE International Conference on Computer Vision, ICCV 2001, Vancouver, BC, Canada, 7–14 July 2001; Volume 2, pp. 416–423. [Google Scholar] [CrossRef]
Zhang, L.; Wu, X.; Buades, A.; Li, X. Color demosaicking by local directional interpolation and nonlocal adaptive thresholding. J. Electron. Imaging 2011, 20, 023016. [Google Scholar] [CrossRef]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef]
Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 586–595. [Google Scholar] [CrossRef]
Ding, K.; Ma, K.; Wang, S.; Simoncelli, E.P. Image quality assessment: Unifying structure and texture similarity. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 2567–2581. [Google Scholar] [CrossRef]
Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Adv. Neural Inf. Process. Syst. 2017, 30, 6629–6640. Available online: https://proceedings.neurips.cc/paper/2017/hash/8a1d694707eb0fefe65871369074926d-Abstract.html (accessed on 10 May 2025).
Bińkowski, M.; Sutherland, D.J.; Arbel, M.; Gretton, A. Demystifying mmd gans. arXiv 2018, arXiv:1801.01401. [Google Scholar] [CrossRef]
Mittal, A.; Soundararajan, R.; Bovik, A.C. Making a “completely blind” image quality analyzer. IEEE Signal Process. Lett. 2012, 20, 209–212. [Google Scholar] [CrossRef]
Blau, Y.; Mechrez, R.; Timofte, R.; Michaeli, T.; Zelnik-Manor, L. The 2018 PIRM challenge on perceptual image super-resolution. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, Munich, Germany, 8–14 September 2018. [Google Scholar] [CrossRef]
Salimans, T.; Ho, J. Progressive distillation for fast sampling of diffusion models. arXiv 2022, arXiv:2202.00512. [Google Scholar] [CrossRef]
Xiang, Q.; Zhang, M.; Shang, Y.; Wu, J.; Yan, Y.; Nie, L. Dkdm: Data-free knowledge distillation for diffusion models with any architecture. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 2955–2965. Available online: https://cvpr.thecvf.com/virtual/2025/poster/33250 (accessed on 12 June 2025).
Bao, F.; Nie, S.; Xue, K.; Cao, Y.; Li, C.; Su, H.; Zhu, J. All are worth words: A vit backbone for diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 22669–22679. [Google Scholar] [CrossRef]

Figure 1. The development timeline of diffusion models and their representative works.

Figure 2. Overview of denoising diffusion probabilistic models.

Figure 3. The backbone of SR3 Network.

Figure 4. Overview of pre-trained diffusion model for image restoration.

Figure 5. Illustrating the key steps of the IIGDM method.

Figure 6. Image restoration based on synthetic data.

Figure 7. Overview of repaint.

Figure 8. Overview of RainDiffusion.

Table 1. Overview of diffusion model-based image super-resolution reconstruction methods.

Model	Year	Structure	Method	Contribute	Limitation
SR3 [29]	2022	U-Net	Channel concatenation is performed in DDPM, followed by iterative upsampling. More text here to show wrapping. Even more text.	The first use of DDPM for SR work. This contribution could be a bit longer to demonstrate how text wraps.	Cannot effectively handle severely degraded LR images. This limitation might also be a sentence or two.
SRdiff [57]	2022	CNN + ResNet	DDPM uses the LR image encoding as input for both forward diffusion and reverse SR image reconstruction.	Using a pre-trained encoder to extract information from the LR image.	Sensitive to conditional guidance.
CDPMSR [36]	2023	U-Net	The LR image is passed through a pre-trained SR model, generating a conditional vector.	First-time use of conditional diffusion models for single i-mage SR reconstruction.	Sensitive to noise in the input image.
Res-diff [59]	2024	CNN + U-Net	Adopt a residual learning strategy to fuse features from a pre-trained CNN and the diffusion model.	Fusing features extracted by two models to enhance task performance.	Limited effectiveness in real-world complex scenes.
SDE-SR [60]	2022	U-Net	Encode facial features into vectors and reconstruct them using an SDE solver.	This is the first application of SDE to facial super-resolution.	Training HR images is time-consuming.
LDM [18]	2022	U-Net + GAN	Diffusion occurs in the latent space.	Image attributes are controlled by the latent vector’s dimensions.	Image quality is limited by training data.
CDM [56]	2022	U-Net	Multiple diffusion models are cascaded, each generating different image detail levels.	Leveraging multi-level image details to generate high-quality images.	Involves multiple diffusion models
SR3+ [61]	2023	U-Net + CNN	Encode the LR image into a conditional vector, used at each time step of reverse generation.	Real-world SR image reconstruction tasks are addressed.	Sensitive to the real-world training set.
IDM [40]	2023	U-Net + CNN	Implicit diffusion processes use continuous ODEs.	Generate super-resolution images at continuous scales.	Generated images show artifacts.
Stable-SR [58]	2024	U-Net	Utilize feature maps at different scales to fine-tune the pre-trained Stable Diffusion.	Enhance the model’s adaptability to real-world images.	Additional training is required for specific scenarios.
ILVR [55]	2021	U-Net + CNN	Projection contrast, apply the error vector to the latent space, and iteratively adjust the latent variables.	Select the conditional injection method based on the task and conditional information.	The diversity of the images is insufficient.

Table 2. Overview of diffusion model-based image deblurring methods.

Model	Year	Structure	Method	Contribute	Limitation
Deblur-DPM [62]	2022	U-Net + CNN	A DDPM generates and refines a blurry image to produce the final output.	Effectively handles various types of blur.	The model exhibits weak generalization performance.
DG-DPM [63]	2022	U-Net + CNN	Conditional guidance directs a diffusion model’s deblurring process.	Processes real-world blurred images.	Image details are lost or artifacts are present.
BlindDPS [44]	2023	U-Net + CNN	Two parallel diffusion models separately model the blur kernel and the image.	Handles blind deblurring without prior knowledge.	Overly reliant on blur kernel data.
DiffIR [26]	2023	U-Net + Transformer	Image priors from pre-training guide DIRformer’s deblurring.	Transformer structure improves restoration results.	Overly reliant on specific-scenario data.
InDI [64]	2023	CNN	An inversion network learns alternative deblurring to diffusion models.	The model is simple and quick to train.	Restorations are prone to artifacts.
SNIPS [65]	2021	U-Net	Spectral SVD reduces dimensions; then a network restores images.	Without any prior knowledge.	Model training converges poorly.
DDRM [48]	2022	U-Net	Obtains accurate restored images with fewer sampling steps.	Improves the inference efficiency of the model.	Restored images still have blur.
Gibbs-DDRM [45]	2023	U-Net + CNN	Sampling divides variables into fixed and sampled subsets.	Effectively reduces the computational cost.	The model is complex and prone to overfitting.

Table 3. Overview of diffusion model-based image inpainting methods.

Model	Year	Structure	Method	Contribute	Limitation
Repaint [54]	2022	U-Net + CNN	Degraded image regions are separately noised; their fusion is the denoising input.	Diffusion models are first applied to image inpainting.	Overly reliant on occluded region complexity.
COPAINT [66]	2023	U-Net + CNN	Conditional guidance uses damaged image mask features.	DDIM is first applied to image inpainting.	Sensitive to occlusion masks.
Palette [30]	2022	U-Net + CNN	Image inpainting uses image translation conditioned on a target style.	Restored image details and texture improved.	Reliant on target style image.
CCDF [67]	2022	U-Net	Stochastic SDE contraction theory explains and optimizes conditional diffusion.	Decrease the iterations of the diffusion process.	The model fails to speed up diffusion.
Dirac-Diffusion [68]	2024	U-Net + CNN	Incremental reconstruction ensures step-wise consistency.	Improves the robustness of the model.	Sensitive to constraint settings.
RED-Diff [69]	2023	U-Net + CNN	Variational inference is used to optimizes the diffusion model.	Effectively handles linear inverse problems and improves interpretability.	The model tends to overfit easily.

Table 4. Overview of diffusion model-based image deraining, desnowing, and dehazing methods.

Model	Year	Structure	Method	Contribute	Limitation
Weather-Diff [70]	2023	U-Net + CNN	The diffusion model inputs small image patches.	First use of diffusion models for adverse-weather image restoration.	Inappropriate patch division yields poor restorations.
Rain-Diffusion [52]	2023	U-Net + CNN	Cycle-consistent pairs train a degraded conditional diffusion model.	This is the first use of diffusion models for image deraining.	The model tends to overfit easily.
Refusion [47]	2023	U-Net + CNN	Latent vectors are compressed, reduced, and then diffused.	Achieves large-scale real-world image dehazing.	The model struggles with varied restoration.
IR-SDE [71]	2023	CNN	Image generation is described by a mean-reverting SDE.	First use of mean-reverting SDE for deraining and dehazing.	Severely degraded images are challenging.
SUD² [72]	2023	U-Net	Training is conducted with fewer training data pairs.	The restoration effect is better for complex scenes.	Sensitive to DDPM training quality.

Table 5. Overview of SOTA models for five tasks.

Application	Model	PSNR	SSIM	LPIPS	FID
Image Super-Resolution	SR3 [29]	26.4	0.762	-	5.2
	Res-diff [59]	27.94	0.72	-	106.71
	LDM [18]	25.8	0.74	-	4.4
	Stable-SR [58]	24.17	0.62	0.30	24.10
Image Deblurring	DeblurDPM [62]	33.23	0.963	0.078	17.46
	DiffIR [26]	33.20	0.963	-	-
	InDI [64]	31.49	0.946	0.058	3.55
	DDRM [45]	35.64	0.95	0.71	20
Image Inpainting	Repaint [54]	-	-	0.12	-
	COPAINT [66]	-	-	0.18	-
	Palette [30]	-	-	-	5.2
	DiracDiffusion [68]	28.92	0.8958	0.1676	38.25
Image Deraining	WeatherDiff [70]	30.71	0.93	-	-
	Raindiffusion [52]	36.85	0.97	-	-
	IR-SDE [71]	38.30	0.98	0.014	7.9
Image Desnowing	WeatherDiff [70]	35.83	0.96	-	-
Image Desnowing	IR-SDE [71]	31.65	0.90	0.047	18.64
Image Dehazing	DA-CLIP [80]	31.39	0.98	-	-
	DiffUIR [81]	31.14	0.90	-	-
	IR-SDE [71]	30.70	0.90	0.064	6.32
Image Denoising	DA-CLIP [80]	24.36	0.58	0.272	64.71
	IR-SDE [71]	28.09	0.79	0.101	36.49
	DDRM [45]	25.21	0.66	12.43	20
Low-Light Enhancement	WaveDiff [76]	28.86	0.876	0.207	45.359
	DiD [78]	23.97	0.84	0.12	-
	LLDiffusion [79]	31.77	0.902	0.040	-

Table 6. Overview of widely used datasets in different diffusion model-based image restoration tasks.

Application	Dataset	Year	Training Set	Test Set	Characteristic
Image Super-Resolution	DIV2K [82]	2017	800	100	2 K resolution
	Flickr2K [83]	2017	2650	-	2 K resolution
	Set5 [86]	2012	-	5	5 image classes
	Set14 [87]	2012	-	14	14 image classes
	Manga109 [88]	2015	-	109	109 comic images
	Urban100 [89]	2015	-	100	100 urban images
	OST300 [90]	2018	-	300	Outdoor scene images
	DIV8K [111]	2019	1304	100	8 K resolution
	RealSR [91]	2019	565	30	Real-world images
	DRealSR [92]	2020	884	83	Large-scale dataset
Image Deblurring	GoPro [93]	2017	2103	1111	1280 × 720 blur images
	HIDE [96]	2019	6397	2025	Clear and blurred image pairs
	RealBlur [94]	2020	3758	980	182 different scenes
Image Inpainting	ImageNet [85]	2010	1,281,167	100,000	1000 classes of images
	Places365 [112]	2019	1,800,000	36,000	434 classes of scenes
	LFW [113]	2008	13,233	-	1080 website facial images
	FFHQ [84]	2019	70,000	-	$1024 \times 1024$ face images
	Celeba-HQ [86]	2018	30,000	-	$1024 \times 1024$ face images
	AFHQ [114]	2020	15,000	-	$512 \times 512$ animal face images
	CelebA [86]	2015	202,599	-	$178 \times 218$ images
Image Deraining Desnowing and Dehazing	RainDrop [108]	2018	1119	-	Various rainy scenes
	Outdoor-Rain [109]	2019	9000	1500	Outdoor rainy images
	DDN-data [107]	2017	9100	4900	Real-world clear/rainy image pairs
	SPA-data [110]	2019	295,000	1000	Various natural rainy scenes
	CSD [103]	2021	8000	2000	Large-scale snowy dataset
	Snow100k [104]	2017	50,000	50,000	1369 real snowy scene images
	SRRS [105]	2020	25,000	-	Online real-world scenes
	Haze-4K [100]	2021	4000	-	Indoor-outdoor hazy scenes
	Dense-Haze [101]	2019	33	-	Outdoor hazy scenes
	RESIDE [102]	2019	443,950	5342	Real-world hazy images
Image Denoising	CBSD68 [115]	2001	-	68	Various noise levels
Image Denoising	McMaster [116]	2011	-	18	Crop size: $500 \times 500$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, J.; Wang, H.; Li, Y.; Zhang, H. A Comprehensive Review of Image Restoration Research Based on Diffusion Models. Mathematics 2025, 13, 2079. https://doi.org/10.3390/math13132079

AMA Style

Li J, Wang H, Li Y, Zhang H. A Comprehensive Review of Image Restoration Research Based on Diffusion Models. Mathematics. 2025; 13(13):2079. https://doi.org/10.3390/math13132079

Chicago/Turabian Style

Li, Jun, Heran Wang, Yingjie Li, and Haochuan Zhang. 2025. "A Comprehensive Review of Image Restoration Research Based on Diffusion Models" Mathematics 13, no. 13: 2079. https://doi.org/10.3390/math13132079

APA Style

Li, J., Wang, H., Li, Y., & Zhang, H. (2025). A Comprehensive Review of Image Restoration Research Based on Diffusion Models. Mathematics, 13(13), 2079. https://doi.org/10.3390/math13132079

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Comprehensive Review of Image Restoration Research Based on Diffusion Models

Abstract

1. Introduction

2. Definition of the Diffusion Model and Its Improvement

2.1. Definition of Diffusion Models

2.1.1. Denoising Diffusion Probabilistic Models

2.1.2. Score-Based Generative Models

2.1.3. Stochastic Differential Equations

2.2. Common Improvements to Diffusion Models

2.2.1. Restructuring the Network

2.2.2. Accelerated Sampling Process

3. Diffusion Model-Based Image Restoration Method

3.1. Conditional Guided Image Restoration

3.2. Pre-Training-Based Image Restoration

3.3. Estimation-Based Image Restoration

3.3.1. Image Restoration Based on Posterior Estimation

3.3.2. Blind Image Kernel Estimate

3.4. Image Restoration Based on Image Domain Transformation

3.4.1. Image Restoration Based on Potential Space and Decomposition Space

3.4.2. Image Restoration Based on Data Domain Synthesis

3.5. Projection-Based Image Restoration

4. Applications of Diffusion Models in Super-Resolution Reconstruction and Frequency-Selective Image Restoration

4.1. Super-Resolution Reconstruction

4.2. Image Restoration Based on Frequency Selection

4.2.1. Image Deblurring

4.2.2. Image Inpainting

4.2.3. Image Deraining, Desnowing, and Dehazing

4.2.4. Image Denoising

4.2.5. Low-Light Enhancement

5. Dataset and Evaluation Metrics

5.1. Datasets

5.2. Evaluation Metrics

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI