Accelerating Facial Image Super-Resolution via Sparse Momentum and Encoder State Reuse

Cao, Kerang; Bao, Na; Zheng, Shuai; Liu, Ye; Wang, Xing

doi:10.3390/electronics14132616

Open AccessArticle

Accelerating Facial Image Super-Resolution via Sparse Momentum and Encoder State Reuse

by

Kerang Cao

^1,2

,

Na Bao

^1,2,

Shuai Zheng

^1,2,

Ye Liu

^3,* and

Xing Wang

^3,*

¹

College of Computer Science and Technology, Shenyang University of Chemical Technology, Shenyang 110142, China

²

Key Laboratory of Intelligent Technology of Chemical Process Industry in Liaoning Province, College of Computer Science and Technology, Shenyang University of Chemical Technology, Shenyang 110142, China

³

School of Information Science and Engineering, Linyi University, Linyi 276000, China

^*

Authors to whom correspondence should be addressed.

Electronics 2025, 14(13), 2616; https://doi.org/10.3390/electronics14132616

Submission received: 9 June 2025 / Revised: 22 June 2025 / Accepted: 25 June 2025 / Published: 28 June 2025

Download

Browse Figures

Versions Notes

Abstract

Single image super-resolution (SISR) aims to reconstruct high-quality images from low-resolution inputs, a persistent challenge in computer vision with critical applications in medical imaging, satellite imagery, and video enhancement. Traditional diffusion model-based (DM-based) methods, while effective in restoring fine details, suffer from computational inefficiency due to their iterative denoising process. To address this, we introduce the Sparse Momentum-based Faster Diffusion Model (SMFDM), designed for rapid and high-fidelity super-resolution. SMFDM integrates a novel encoder state reuse mechanism that selectively omits non-critical time steps during the denoising phase, significantly reducing computational redundancy. Additionally, the model employs a sparse momentum mechanism, enabling robust representation capabilities while utilizing only a fraction of the original model weights. Experiments demonstrate that SMFDM achieves an impressive 71.04% acceleration in the diffusion process, requiring only 15% of the original weights, while maintaining high-quality outputs with effective preservation of image details and textures. Our work highlights the potential of combining sparse learning and efficient sampling strategies to enhance the practical applicability of diffusion models for super-resolution tasks.

Keywords:

faster diffusion model; omitting time steps; sparse momentum; super-resolution

1. Introduction

Image super-resolution (SR) refers to the task of generating high-resolution (HR) images from given low-resolution (LR) images. This task has gained significant attention due to its critical applications in scenarios that require high image clarity, such as autonomous vehicles [1,2], facial recognition [3,4], and environmental monitoring [5,6,7]. In these domains, accurately recovering fine details from low-resolution input is essential for effective analysis and decision making. For example, in facial recognition, in scenarios such as access control systems and law enforcement investigations, high-quality facial super-resolution technology can help extract key features from blurred or low-resolution face images. This, in turn, improves the accuracy of facial recognition algorithms, enabling reliable identity verification and personnel tracking.

Since the groundbreaking work of Dong et al. [8], convolutional neural network (CNN)-based methods [9,10,11,12] have achieved remarkable progress in accuracy and efficiency for SR. For example, Shi et al. [9] introduced a real-time SR network utilizing sub-pixel convolutions. Lim et al. [13] proposed Enhanced Deep Residual Networks for Single Image Super-Resolution (EDSR), which is based on a residual learning architecture. It consists of multiple residual blocks, each containing two convolutional layers with ReLU activations, but without batch normalization to prevent information loss. The network performs image upsampling at the end using sub-pixel convolution to generate high-resolution outputs. Ref. [14] further improved detail reconstruction by incorporating attention mechanisms. These methods, although effective in many applications, still suffer from limitations when attempting to reconstruct fine details in high-magnification SR tasks. CNN-based models tend to suppress high-frequency components, which are crucial for recovering intricate textures. This makes them less suitable for applications that require highly detailed images, such as medical imaging or satellite imagery.

In addition to CNN-based methods, several other SR techniques have been explored. Generative adversarial networks (GANs) have shown great promise in generating realistic textures and fine details by simulating a min–max game between a generator and a discriminator. However, GANs can be unstable during training, leading to mode collapse or artifacts in the generated images. For example, SRGAN [15] exhibits minor reconstruction artifacts in some cases; meanwhile, DeSRA [16] found that LDL [17] still shows obvious artifacts when inferring real-world super-resolution. Moreover, GANs struggle to consistently produce high-frequency details, especially when applied to high-resolution reconstructions.

Another approach to SR involves the use of recurrent neural networks (RNNs) [18,19,20], which refine images over multiple iterations to improve detail. However, RNNs suffer from slow processing times and are computationally intensive, which limits their applicability to time-sensitive tasks. Furthermore, RNNs often face challenges in capturing long-range dependencies across images, leading to suboptimal reconstruction of intricate textures and fine details over large areas.

Transformer-based super-resolution methods leverage attention mechanisms to model long-range dependencies and complex visual patterns. CAMixer [21] adjusts attention via a learnable predictor to enhance convolutional representations, but increases the computational cost for high-resolution images. CFAT [22] combines local attention with global channel-wise attention, and while the Overlapping Cross-Fusion Attention Block (OCFAB) improves efficiency, its MAC complexity grows quadratically with increased pixels, limiting high-resolution applicability. ATDSR [23] uses a token dictionary for adaptive optimization, but its grouping strategy adds extra parameters, increasing computational overhead.

In recent years, diffusion models have demonstrated remarkable performance in image super-resolution tasks [24,25,26]. These models generate high-quality, high-resolution images through iterative denoising processes and exhibit strong capabilities in detail restoration. For example, SR3 [27] enhances image fidelity by conditioning the generation process on low-resolution inputs, while IDM [28] enables continuous resolution outputs through scale-aware modeling, offering greater flexibility. Such models are widely applied in domains requiring high image fidelity, such as medical imaging and remote sensing.

However, the low inference efficiency of diffusion probabilistic models (DPMs) remains a significant barrier to their practical deployment (see Figure 1). Due to their iterative generation nature, even with some optimization techniques, these models struggle to meet the computational demands of real-time inference, which is particularly problematic in time-sensitive applications like autonomous driving and large-scale image analysis.

To address this issue, numerous studies have proposed acceleration strategies. For example, LDM [29] accelerates inference by denoising in a latent space and adopting a scheduler such as Denoising Diffusion Implicit Models (DDIM) [30] to reduce the number of sampling steps. The inherent loss of high-frequency information during latent space compression fundamentally limits the pixel-level detail restoration capability of latent diffusion models (LDMs), as rigorously confirmed in prior research. Rombach et al. [29] explicitly stated that autoencoder-based compression “sacrifices high-frequency details”: when the downsampling factor

f \geq 16

of the autoencoder, the perceptual compression stage significantly removes high-frequency information, leading to detail loss in high-resolution synthesis. The work [31] proposed a sampling space mixture of experts to extend diffusion models and utilized a frequency compensation decoder to enhance details in super-resolution images; however, it did not fully address the distortion caused by latent space compression. LSRNA [32] effectively mitigates manifold deviations caused by direct upsampling methods (e.g., bicubic) by learning a mapping from low-resolution latent representations to the high-resolution manifold. It maintains strong performance even with only 30 denoising steps. However, the optimal strength of RNA for detail enhancement is not fixed and requires adaptive adjustment based on the specific method and noise schedule. Other works aim to optimize the time step schedules of numerical ODE solvers [33] within the diffusion process to improve both computational efficiency and generative quality. BlockDance [34] identifies and caches structurally stable spatiotemporal features to enable fast inference with minimal quality degradation. Its enhanced variant, BlockDance-Ada [34], further leverages reinforcement learning to dynamically allocate computation based on instance-specific policies, achieving a better quality–speed trade-off. DeepCache [35] introduces a training-free general acceleration paradigm by exploiting temporal redundancy in high-level features across adjacent denoising steps. Despite the efficiency gains achieved by the above methods, they still face several limitations, such as increased strategy complexity, quality degradation caused by latent space compression, or optimization restricted solely to the inference stage.

Compared to traditional sparse learning methods [36], which typically rely on iterative pruning/retraining or dynamic architecture adjustments, we directly adopt the sparse momentum method [37]. This approach maintains structural sparsity throughout the entire training process, leveraging momentum information to guide weight pruning and regeneration. As a result, it significantly reduces computational costs while maintaining model performance.

In this study, we further analyze feature dynamics in DM-based SR tasks and observe that feature variations in the encoder are relatively stable, while those in the decoder are significantly more volatile. Inspired by fast diffusion modeling methods [38], we use a state reuse mechanism, where time steps with substantial feature changes are designated as key time steps and others as non-key time steps. By skipping selected non-key time steps, we achieve faster inference. Compared to the LDM-based [29] SR method, our approach leverages past states as perturbations to improve texture information and enhance detail recovery, without sacrificing speed. Furthermore, by processing multiple time steps in a single iteration, it reduces feature redundancy and accelerates the process. Additionally, we adopt a sparse momentum strategy that maintains structural sparsity during training to reduce computational complexity. Unlike traditional sparsification methods that compress pretrained dense models, our approach preserves sparsity throughout training, making it natively compatible with sparse inference frameworks and offering a superior balance between performance and efficiency. We summarize the contributions of our work as follows.

We have developed a model named SMFDM to rapidly generate high-fidelity facial images through an encoder state reuse mechanism. SMFDM divides all the time steps in advance into key time steps and non-key time steps during the sampling phase. It performs the encoding operation of the U-Net encoder at the key time steps and reuses the features of the encoder from the key time steps at the non-key time steps, which reduces a large number of computational operations, enhances the computational efficiency of the U-Net architecture, and accelerates convergence while preserving image details.
We also introduce a sparse momentum method for dynamic weight sparsity management during training. Starting with random weight pruning, we assess the importance of remaining weights layer by layer using momentum means, reassigning them accordingly. We prioritize the growth or recovery of weights that most contribute to error reduction, guided by the momentum of zero-value weights within the same layer. Through iterative cycles of magnitude pruning and momentum-driven regrowth, we optimize the weight distribution, preserving critical weights while maintaining overall network sparsity. As a result, our model reduces computational demands without sacrificing feature representation, effectively cutting costs while handling complex image features. This dual strategy enhances both efficiency and model performance.
Experimental results on benchmark datasets show that SMFDM outperforms existing super-resolution methods in both speed and image quality. Specifically, SMFDM accelerates processing time by 71.04% compared to existing methods, while maintaining high-fidelity output with exceptional detail preservation.

The rest of this paper is structured as follows. Section 2 describes the proposed Sparse Momentum-based Faster Diffusion Model. Section 3 describes the experiments and results of super-resolution. Section 4 and Section 5 provide a brief discussion and conclusion.

2. Proposed Methods

2.1. Problem Formulation

Given a low-resolution facial image

x \in R^{\frac{H}{s} \times \frac{W}{s} \times C}

, the goal is to reconstruct a high-resolution image

y_{0} \in R^{H \times W \times C}

(where s is the scaling factor). The diffusion model achieves this through a bidirectional process. The forward diffusion process gradually adds Gaussian noise to the original image

y_{0}

to form a Markov chain:

y_{t} = \sqrt{1 - β_{t}} y_{t - 1} + \sqrt{β_{t}} ϵ, ϵ \sim N (0, I)

(1)

where

β_{t}

controls the noise intensity and the reverse denoising process learns the mapping:

p_{θ} (y_{t - 1} ∣ y_{t}) = N (y_{t - 1}; μ_{θ} (y_{t}, t), σ_{t}^{2} I)

(2)

This network, via a parameterized network

μ_{θ}

(we employ a U-Net architecture here), predicts the original signal based on the noisy image

y_{t}

and time step t. Starting from

y_{T} \sim N (0, I)

, the model generates

y_{0}

through T-step iterative refinement. However, standard diffusion models suffer from high computational costs due to repeated encoder evaluations at each step. To address this, our SMFDM method presents two key improvements: First, a sparse momentum optimization strategy is employed to dynamically maintain the sparse topology of the U-Net during the training phase (retaining 15% weight density), where a pruning–regeneration algorithm (Algorithm 1) ensures the retention of critical connections. Second, an encoder state reuse mechanism is introduced to cache intermediate features, thereby reducing redundant computations (Algorithm 2).

Algorithm 1 Sparse Momentum Algorithm with Weight Redistribution

1:: Input: Layers L, Momentum M, Weights W, Mask $M a s k$ , Prune rate $γ$ , Density d
2:: Output: Sparse weights W, Mask $M a s k$ , Momentum M
3:: for each layer $l \in {1, \dots, L}$ do
4:: Initialize $W_{l}$ , $M_{l}$ , $M a s k_{l}$ based on d
5:: ApplyMask( $W_{l}$ , $M a s k_{l}$ )
6:: end for
7:: for each epoch $e \in {1, \dots, E}$ do ▹ Redistribute excess regenerated weights
8:: for each layer $l \in {1, \dots, L}$ do
9:: $W_{l}^{r e} \leftarrow CalcRegenerations (W_{l}, d)$ ▹ Set of weights to be regenerated
10:: $W_{l}^{a v a i l} \leftarrow CountAvailable (W_{l})$ ▹ Set of available weights
11:: $n_{l}^{r e} \leftarrow | W_{l}^{r e} |$ , $n_{l}^{a v a i l} \leftarrow | W_{l}^{a v a i l} |$ ▹ Compute counts
12:: if $n_{l}^{r e} > n_{l}^{a v a i l}$ then
13:: $Δ W_{l} \leftarrow SelectExcess (W_{l}^{r e}, n_{l}^{r e} - n_{l}^{a v a i l})$
14:: $RedistributeEqually (Δ W_{l}, l)$
15:: end if
16:: end for
17:: for each batch $b \in {1, \dots, B}$ do
18:: gradients ← ComputeGradients(W, FetchBatch(b))
19:: UpdateMomentum(M, gradients), UpdateWeights(W, M)
20:: for each layer $l \in {1, \dots, L}$ do
21:: ApplyMask( $W_{l}$ , $M a s k_{l}$ )
22:: end for
23:: end for
24:: totalMomentum ← CalcTotalMomentum(M), totalPruned ← CalcPruningQuota(W, $γ$ )
25:: for each layer $l \in {1, \dots, L}$ do
26:: momentumRatio ← CalcLayerMomentumRatio( $M_{l}$ , $M a s k_{l}$ , totalMomentum)
27:: PruneLowWeights( $W_{l}$ , $M a s k_{l}$ , $γ_{l}$ ), RegrowByMomentum( $W_{l}$ , $M a s k_{l}$ , momentumRatio · totalPruned)
28:: AnnealRate( $γ_{l}$ ), ApplyMask( $W_{l}$ , $M a s k_{l}$ )
29:: end for
30:: end for

Algorithm 2 SMFDM Inference with Encoder State Reuse

Require: Low-resolution image x, diffusion step T, reuse interval k
Ensure: Super-resolved image

y_{0}

1:: $f^{0} \leftarrow EDSR (x, s)$
2:: $y_{T} \sim N (0, I)$
3:: for $t = T, T - k, \dots, k$ do
4:: if $t / T < 0.5$ then
5:: $y_{t} \leftarrow Concat (y_{t}, f^{0}) + α \cdot u^{0}$
6:: end if
7:: $u^{0} \leftarrow Concat (f^{0}, y_{t})$
8:: for $i = 1$ to N do
9:: $u^{i} \leftarrow Down (u^{i - 1})$
10:: $f^{i} \leftarrow Conv (f^{i - 1})$
11:: Replicate $u^{i}$ and $f^{i}$ k times
12:: end for
13:: $h^{i} \leftarrow Up (h^{i - 1})$
14:: $p^{i} \leftarrow α_{1}^{(i)} f^{i} + α_{2}^{(i)} h^{i}$
15:: $ϵ^{T^{k}} = {ϵ_{1}, \dots, ϵ_{k}} \leftarrow U - Net (p^{i}, T^{k} = {t, \dots, t - k + 1})$
16:: for $j = 1$ to k do
17:: $y_{T [j]} \leftarrow μ_{θ} (y_{T [j - 1]}, T [j - 1]) + σ_{T [j - 1]} \cdot ϵ_{j - 1}$
18:: end for
19:: end for return $y_{0}$

2.2. Overview

As shown in Figure 2 and Figure 3, SMFDM accelerates DM-based SR through two synergistic components.

Sparse Momentum: This component dynamically prunes 85% of model weights during training using momentum-guided sparsity, maintaining performance with only 15% of weights (Section 2.3).
Encoder State Reuse: This component groups T diffusion steps into periods of size k. Only the first step (key step) in each period runs the full encoder; subsequent $k - 1$ steps (non-key steps) reuse the encoder features from the key step (Section 2.4).

Specifically, Figure 2a illustrates the sparse encoder encoding features at time step T for reuse over the subsequent k time steps. The resulting feature maps are ordered from left to right according to their corresponding time steps. Figure 2b shows the decoder leveraging the cached encoder states at time step T to perform denoising, generating intermediate outputs

{y_{T - 1}, \dots, y_{T - k}}

. Finally, Figure 2c presents an overview of the complete T-step denoising process, which is divided into segments. Each segment begins with a full encoding step followed by k decoding steps. Dashed lines in the figure indicate the omission of redundant operations from

T - k

to T, effectively reducing the total number of encoding operations.

To simplify the description, we hereinafter focus on the variables

u^{i} \in R^{B_{i} \times H_{i} \times W_{i} \times C_{i}}

,

f^{i} \in R^{B_{i} \times H_{i} \times W_{i} \times C_{i}}

,

p^{i} \in R^{B_{i} \times H_{i} \times W_{i} \times C_{i}}

, and

h^{i} \in R^{B_{i} \times H_{i} \times W_{i} \times C_{i}}

, while omitting all subscripts—including the time step subscript t—for brevity and clarity. The U-Net backbone employs

N = 4

downsampling/upsampling stages, starting with

C = 64

channels in the first stage and doubling the channel count at each downsampling step (i.e.,

64 \to 128 \to 256 \to 512

), following the design in IDM [28]. The encoder utilizes residual blocks with self-attention for downsampling (Equation (3)), while the decoder processes the fused features

p_{i}

through sparse convolutional layers (Equation (7)) to produce the denoised outputs.

To extract hierarchical features from the low-resolution input image x, we adopt the improved EDSR network as our backbone feature extractor. This model optimizes the residual network architecture by removing BN layers and replaces the original sub-pixel convolution upsampling module with bilinear interpolation, thus obtaining the initial feature map

f^{0}

. Concurrently, a noise image

y_{T}

sampled from a Gaussian distribution is generated to align with the target spatial dimensions. Next, the feature obtained by concatenating

f^{0}

and

y_{T}

, along with the original

f^{0}

, is fed into the sparse encoder to generate multi-scale feature maps

u^{i}

and

f^{i}

, where

i \in {1, \dots, N}

. At each stage, the intermediate feature

u^{i}

is obtained by downsampling the previous stage’s feature

u^{i - 1}

:

u^{i} = D o w n (u^{i - 1})

(3)

where Down consists of two residual blocks with self-attention and a downsampling module. Each residual block includes a 3 × 3 convolutional layer, a normalization layer, an activation function, and optionally a self-attention mechanism. The downsampling module consists of a 3 × 3 convolutional layer with a stride of 2 and padding of 1, which halves the spatial resolution of the feature map. Similarly,

f^{i}

is produced by applying a downsampling operation to

f^{i - 1}

:

f^{i} = C o n v (f^{i - 1})

(4)

where Conv includes a 3 × 3 convolutional layer (with stride 1 and padding 1), a bilinear downsampling operation, and a leaky ReLU activation function.

The feature maps

u^{i}

and

f^{i}

are reused k times to generate k sets of features. In each reuse iteration,

u^{i}

is upsampled to obtain the feature map

h^{i}

, which is then fused with the corresponding

f^{i}

. The scaling factor s is passed through a five-layer multilayer perceptron (MLP) with an input dimension of 66 and an output dimension of 3. This network consists of 4 hidden layers with 256 units each, all activated by ReLU functions, followed by a linear layer that produces the final output. The outputs

s_{1}

and

s_{2}

generated by the MLP are then used to determine the fusion ratio. After normalization, the fusion coefficients

α {(1)}^{i}

and

α {(2)}^{i}

are computed, and the fused feature map

p^{i}

is calculated as follows:

{α (1)}^{i} = \frac{| s_{1}^{i} |}{\sqrt{{(s_{1}^{i})}^{2} + {(s_{2}^{i})}^{2} + 10^{- 8}}}

(5)

{α (2)}^{i} = \frac{| s_{2}^{i} |}{\sqrt{{(s_{1}^{i})}^{2} + {(s_{2}^{i})}^{2} + 10^{- 8}}}

(6)

p^{i} = {α (1)}^{i} \cdot f^{i} + {α (2)}^{i} \cdot h^{i}

(7)

Each fused feature map

p^{i}

is then processed by a sparse decoder network to produce an upsampled feature map corresponding to the current time step. This procedure is repeated k times, resulting in k upsampled feature maps. Finally, an iterative denoising process is applied to progressively refine the results until the final high-resolution image

y_{0}

is reconstructed.

2.3. Sparse Momentum

As illustrated in Figure 3, the sparse momentum mechanism optimizes U-Net parameters via dynamic pruning during training. The term “sparse encoder” refers to an encoder that undergoes iterative weight pruning and regeneration during training (indicated by dashed lines). Dashed arrows represent the momentum-based weight update cycles during training as well as omitted diffusion steps for clarity. The term “sparse decoder” refers to the decoder module that is also trained using the sparse momentum mechanism. Specifically, when the U-Net performs noise prediction on the noisy input image

y_{t}

(blue arrows denote forward information flow), the network outputs the predicted noise

\bar{z}

. We introduce a sparse momentum-driven masking mechanism to balance network connectivity and performance under a predefined sparsity constraint. This mechanism maintains a fixed sparsity level throughout the training process, enabling the model to achieve performance comparable to that of dense models. Specifically, we apply the sparsity constraint independently to each layer to construct an initial sparse topology, retaining only 15% of the weights while randomly pruning the remaining 85%. The mask is initialized as follows:

M a s k_{l} \sim Bernoulli (d e n s i t y, W_{l})

(8)

where

M a s k_{l}

has the same shape as the corresponding weight

W_{l}

, a value of 1 indicates that the connection is active, and the density indicates that 15% of the weights are retained, corresponding to an 85% sparsity. To guide the adaptive evolution of the sparse structure during training, we maintain an exponentially smoothed momentum for each layer, and its update formula is

M_{l}^{τ + 1} = β M_{l}^{τ} + (1 - β) {\frac{\partial E}{\partial W}}_{l}^{τ}

(9)

where

β

is a smoothing factor that controls the influence of historical gradient information.

M_{l}^{τ + 1}

represents the momentum corresponding to the weights

w_{l}

in the l-th layer in the

(τ + 1)

-th iteration, and

E^{τ + 1}

represents the value of the error function in the

(τ + 1)

-th iteration, measuring the prediction error of the model after the

(τ + 1)

-th parameter update. Specifically,

M_{l}

is used to store the historical gradient information of the weights in this layer, so as to update the weights more stably during the training process. The momentum can capture the long-term gradient contributions and stabilize the weight selection in the presence of noisy updates.

At the end of each training epoch, we apply a sparse momentum update process consisting of three steps: (a) inter-layer weight redistribution, (b) intra-layer pruning, and (c) momentum-guided weight regeneration. In step (a), we calculate the average absolute momentum of the active weights in each layer and normalize it across layers to obtain the momentum contribution ratio

r_{l}

, which determines the distribution of the number of weight regenerations in the next step. The specific calculation process is to first calculate the average absolute momentum of the active weights in each layer:

m_{l} = \frac{1}{| {p o s : M a s k_{l, p o s} = 1} |} \sum_{p o s : M a s k_{l, p o s} = 1} | M_{l, p o s} |

(10)

where

M a s k_{l, p o s}

represents the mask value corresponding to the

p o s

-th weight in the l-th layer, and

M_{l, p o s}

is the momentum of the

p o s

-th weight in the l-th layer. Then, the contribution ratio

r_{l}

of each layer is calculated through

r_{l} = \frac{m_{l}}{\sum_{o} m_{o}}

, where

\sum_{o} m_{o}

is the sum of

m_{o}

for all layers.

In step (b), we sort the currently active weights in each layer according to their momentum magnitudes and set the mask values of the 50% of the weights with the lowest momentum to zero for pruning. For layers approaching the sparsity limit, the unutilized pruning quota is redistributed to other layers to maintain global sparsity. In step (c), a fixed number of pruned weights are re-activated in each layer according to

r_{l}

, that is, we select the zero-valued weights with the highest accumulated momentum. If the number of available weights for selection in a certain layer is insufficient, the surplus number of weight regenerations will be redistributed to other layers to ensure the efficient utilization of structural resources.

If the number of available weights for selection in a certain layer is insufficient, the surplus number of weight regenerations will be redistributed to other layers to ensure the efficient utilization of structural resources. Specifically, if the planned number of regenerated weights for a layer exceeds the available capacity in that layer, we redistribute the surplus regeneration weights to other layers. This is determined by comparing the planned regeneration number

W_{r e}^{l} \in R^{d_{1} \times d_{2} \times \dots \times d_{n}}

with the available weights

W_{a v a i l}^{l} \in R^{d_{1} \times d_{2} \times \dots \times d_{n}}

for each layer, where

d_{1}, d_{2}, \dots, d_{n}

vary depending on the type of the

l^{th}

layer. If

W_{r e}^{l} > W_{a v a i l}^{l}

, the surplus regeneration weights

W_{r e s t}^{l} = W_{r e}^{l} - W_{a v a i l}^{l}

are redistributed to other layers equally, ensuring the overall efficiency of the sparse structure while maintaining the sparsity constraint. The details of the training procedure of sparse momentum are shown in Algorithm 1.

As shown in Figure 2, through repeated pruning (in red) and regeneration (in green) iterations, the model gradually evolves from a fully connected network to a sparse yet expressive topology. To stabilize the training process in the later stages, we adopt a cosine annealing schedule to gradually reduce the pruning rate. This mechanism enables our model to achieve reconstruction performance comparable to that of dense models under high sparsity, improving both the training efficiency and the scalability of the model. By applying the sparse learning mechanism, we have achieved an efficient and effective network structure, ensuring that the sparse network can provide performance comparable to that of dense networks.

2.4. Encoder State Reuse

To accelerate inference speed without compromising reconstruction quality, we use an encoder state reuse framework for the super-resolution task. Specifically, the reverse diffusion time step sequence

T = {T, T - 1, \dots, 0}

is divided into multiple periods, each containing k time steps. The first time step in each period (i.e., the largest time index) is designated as the key time step

T^{key}

, where the encoder performs a full forward pass to extract features. The remaining

k - 1

steps within the same period are defined as non-key time steps

T^{non_key}

, where the encoder state computed at the key time step is reused to avoid redundant computation.

For example, when

k = 5

, the first period includes the key time step T and non-key time steps

T - 1, T - 2, T - 3, T - 4

. The time step set is partitioned into key and non-key parts, satisfying

T^{key} \cup T^{non_key} = {T, \dots, 0}

. During inference, the entire diffusion process T is split into

⌈ T / k ⌉

effective steps, each corresponding to k consecutive original time steps processed together in one iteration. This state reuse mechanism leverages the strong temporal correlation of encoder features between adjacent steps: features computed at key time steps are reused in subsequent non-key steps, significantly reducing computation while preserving reconstruction quality.

When applying the encoder propagation strategy, for non-key time steps

t \in T^{non_key}

, the decoder does not receive newly computed encoder features but instead reuses the encoder state from the most recent key time step. This design enables efficient reuse and parallelization, thereby reducing the computational cost of generating high-resolution images. The encoder reuse strategy is formulated as follows:

f_{T^{non_key}} = R e u s e (f_{T^{key}})

(11)

u_{T^{non_key}} = R e u s e (u_{T^{key}})

(12)

where Reuse (·) duplicates the feature maps of

T^{key}

and assigns them to each

T^{non_key}

as their feature maps. To enable multi-step denoising within a single iteration, we perform k replication operations on the features extracted by the U-Net encoder and stack these replicated features along the channel dimension. This results in k identical feature maps being provided to the decoder as input. Leveraging these enhanced features, the decoder is capable of simultaneously predicting noise estimates corresponding to k distinct time steps. This design maximally reuses encoder-side information, thereby enriching the temporal context available for denoising.

After the sampling phase, we enter the denoising phase. In this phase, the model starts from a pure-noise image and iteratively refines it using the information obtained from the sampling phase. At each iteration, the decoder outputs k noise predictions, which are used to sequentially refine the noisy image across k time steps:

ϵ^{T^{k}} = μ_{θ} (y_{t}, T^{k})

(13)

y_{t - 1} = μ_{θ} (y_{t}, t) + σ_{t} \cdot ϵ

(14)

where

T^{k} = {t, t - 1, \dots, t - k + 1}

includes a period of key time steps and non-key time steps in a fixed-length window of size k;

ϵ^{T^{k}} = {ϵ_{t}, ϵ_{t - 1}, \dots, ϵ_{t - k + 1}}

represents the set of predicted noise components corresponding to each time step in

t^{k}

;

μ_{θ} (y_{t}, T^{k})

is a noise prediction function parameterized by

θ

, which outputs

ϵ^{T^{k}}

given input

y_{t}

and time embedding

T^{k}

;

σ_{t}

denotes the predefined noise scale (i.e., standard deviation) at time step t; and

ϵ \sim N (0, I)

is the standard Gaussian noise used in the stochastic reverse process.

To mitigate the potential detail loss caused by encoder state reuse, we incorporate current and past data states to preserve information during the early stages. Specifically, we update the data state at time t as follows:

y_{t} = C o n c a t (y_{t}, f^{0}) + α \cdot u^{0}

(15)

where Concat(·) refers to the concatenation of feature maps along the channel dimension,

y_{t}

is the latent data state,

f^{0}

is a feature map extracted from a shallow EDSR network, and

α

is a small positive scalar (set to 0.003) that controls the influence of

u^{0}

. This strategy preserves fine details in earlier steps and prevents information from being prematurely lost during denoising. The details of the inference procedure of our state reuse are shown in Algorithm 2.

3. Results

3.1. Implementation Detail

Datasets: We train our experiments on the FFHQ (Faces in the Wild High-Quality) [39] dataset and test on the CelebA-HQ [40] dataset and Helen dataset [41]. The FFHQ dataset consists of over 70,000 high-quality images of human faces, each with a resolution of $1024 \times 1024$ pixels. It is diverse in terms of age, ethnicity, and background, making it suitable for training deep learning models for face generation and recognition. The CelebA-HQ dataset is an extended version of CelebA, containing 30,000 high-resolution images of celebrity faces, each with a resolution of $1024 \times 1024$ pixels. It is annotated with various attributes such as gender, smile, and eyeglasses, making it useful for tasks like face attribute manipulation and super-resolution. The Helen dataset has 384 images of human faces and provides unique value for facial feature-related research.
Training details: We train the SMFDM in an end-to-end manner with a total of 1.5 M iterations. In the first 1 M iterations, we train the model using a fixed downsampling factor $s \times$ . In the next 0.5 M iterations, the size of the high-resolution image will be randomly adjusted based on the uniform distribution $U (1, s)$ . We use the Adam optimizer, where the learning rate is set to $1 \times 10^{- 4}$ . Balancing speed and quality, we set the encoder reuse period $k = 10$ . We consider the FFHQ dataset as the training dataset and select the first 100 images from the CelebA-HQ dataset as the validation set and the first 150 images from the CelebA-HQ dataset as the test set. For the Helen dataset, we choose the first 150 images as the test set. During the sparse momentum training, we set the number of training epochs to 610 and select the stochastic gradient descent (SGD) optimizer with a learning rate of $10^{- 6}$ . Meanwhile, we adopt the cosine annealing strategy with a decay coefficient of 0.99 to adjust the learning rate more effectively and improve the training performance.
Metrics: To evaluate the performance of the proposed method, the PSNR (Peak Signal-to-Noise Ratio) [42], SSIM (Structural Similarity Index) [43], Speedup, and inference time are used as evaluation metrics. The PSNR and SSIM help in understanding the quality and detail preservation of the generated images, while Speedup and inference time validate the computational efficiency during the training phase and the speed of the proposed method. During the training phase, the method proposed in [37] was referred to in order to calculate the actual acceleration ratio. First, convolutional channels with all-zero weights were removed. Then, based on the proportion of these empty channels in the total number of channels, the computational amount of each convolutional operation was adjusted, thus obtaining the acceleration ratio. Specifically, higher PSNR values indicate better image quality; SSIM values closer to 1 indicate higher structural similarity between the generated and ground truth images; higher speedup values indicate higher computational efficiency during the training phase; and shorter inference times indicate faster image generation. These metrics provide a comprehensive evaluation of the method’s performance.

3.2. Comparisons with Other Methods

Quantitative Comparison: Table 1 and Table 2 present quantitative comparisons for $16 \times 16 \to 128 \times 128$ super-resolution tasks on CelebA-HQ and Helen datasets. On CelebA-HQ, our method achieved superior PSNR (23.81 dB vs. 23.34 dB) and computational efficiency (6.73 s vs. 12.77 s) compared to DeepCache, despite a slight SSIM deficit (0.684 vs. 0.699). Similar advantages were observed on the Helen dataset, with notable improvements in PSNR (22.29 dB vs. 21.60 dB) and time efficiency (6.72 s vs. 13.23 s), while maintaining acceptable SSIM variance (0.616 vs. 0.628). These results demonstrate our method’s balanced strengths in reconstruction accuracy and computational performance. Notably, DDIM exhibits the fastest processing time on the Helen and CelebA-HQ datasets (4.48 s and 4.56 s, respectively), but its PSNR is lower than that of our method (Helen: 22.29 dB vs. 21.51 dB, CelebA-HQ: 23.81 dB vs. 23.16 dB). In summary, our method has the best overall performance in the image super-resolution task. It can not only ensure high-quality image reconstruction but can also maintain a relatively fast processing speed. Although DeepCache has the highest SSIM, its image reconstruction quality is not ideal, and its processing speed is slow.
Qualitative visual comparison of $\times 8$ SR: Figure 4 presents the qualitative comparison with other methods on the same SR task with $16 \times 16 \to 128 \times 128$ faces from the CelebA-HQ dataset. After being processed by the Regression method, there is a certain degree of clarification, but it is still blurry. The SR3 method has improved clarity. The IDM method can further enhance clarity, and the details are more obvious. The SMFDM method performs best in restoring image clarity and detail and is closest to the high-resolution images. In conclusion, although super-resolving LR images that need significant magnification is inherently difficult because of the loss of detailed information and edges, our method is outstanding in preserving fine details, such as the texture of tassels on earrings and the crevices in hair. More details can be seen in the magnified regions of Figure 4.

3.3. Comparisons of Continuous SR

Quantitative comparison: Table 3 presents the PSNR, SSIM, and inference time results for multi-scale super-resolution performance $(\times 2 \sim \times 8)$ on the CelebA-HQ dataset. Across all scale combinations, SMFDM achieved improvements in both PSNR and inference time compared to sub-optimal models. Specifically, at $\times 2$ , SMFDM outperformed the sub-optimal models by 3.84 dB in PSNR and 0.138 in SSIM and reduced the inference time by 9.55 s; at $\times 7$ , SMFDM maintained 1.44 dB PSNR gain over DeepCache despite a marginal SSIM dip (0.005). Although minor SSIM fluctuations occurred at $\times 7$ , the overall SSIM performance of SMFDM remained stable and was comparable to or even better than the other models at other scales. These results indicate scalability and robustness across diverse magnification tasks.
Qualitative visual comparison of continuous SR: We visualize some results with arbitrary testing magnifications when training on ×8 face SR in Figure 5. From left to right are the LR images, ×2 magnified images, ×3.5 magnified images, ×5 magnified images, ×6.5 magnified images, and HR images. The LR images are relatively blurry. As the magnification factor increased, the image clarity gradually improved, from still having obvious blurriness when magnified by ×2 to being relatively clear when magnified by ×6.5.

3.4. Ablation Study

We conduct comprehensive ablation experiments to evaluate the individual contributions and effectiveness of the key components of our proposed method, using the Helen dataset. The primary focus of this study is to demonstrate how each module impacts the performance in terms of both image quality and computational efficiency.

Baseline network setting: Our baseline experiment is built upon the IDM architecture. The first baseline, referred to as baseline, represents the standard IDM without any modifications. The second configuration, baseline_ESR, integrates the encoder state reuse component into the baseline, providing additional information during the decoding process, which allows for more efficient feature processing. The third baseline, baseline_SM, incorporates the sparse momentum mechanism into the original model. This component helps stabilize training and reduces the risk of overfitting by enforcing sparsity in the momentum updates. Finally, our proposed method combines both the encoder state reuse component and the sparse momentum mechanism, resulting in a configuration that benefits from the advantages of both modules.
Quantitative comparison: The results of our experiments are summarized in Table 4, where we present the metrics of PSNR, SSIM, and inference time for each baseline and our proposed method. By comparison, we can draw the following conclusions.

Encoder state reuse component (baseline_ESR vs. baseline): Adding the encoder state reuse mechanism provides an improvement in PSNR, achieving an increase of 0.78 dB over the baseline. Additionally, this modification reduces the inference time by 16.76 s, indicating a more efficient model with less computational overhead.
Sparse momentum mechanism (baseline_SM vs. baseline): The introduction of sparse momentum led to an increase in the PSNR, and the SSIM also improved by 0.006. This indicates that this mechanism can effectively enhance the image quality. In addition, the training process achieved an acceleration of 1.29×, which is an increase of 0.29× compared to the baseline model, demonstrating the effectiveness of this mechanism in improving computational efficiency and accelerating the training process.
Combined approach (ours vs. baseline): When both the encoder state reuse and sparse momentum mechanisms were combined in our method, we observed a significant increase of 0.80 dB in the PSNR, albeit with a slight decrease of 0.01 in the SSIM. This outcome demonstrates that our dual-improvement approach can maintain a high level of image quality while moderately sacrificing structural similarity. The most remarkable enhancement caused by our method is the substantial reduction of computation time. This underscores the effectiveness of our approach in reducing computational complexity, achieving this with only a minimal compromise in accuracy.

Overall, these results validate the effectiveness of our proposed components. The encoder state reuse component enhanced the quality of the output, while the sparse momentum mechanism contributed to a reduction in computational cost during the training. Our method strikes a balance between image fidelity and model efficiency, making it a promising solution for real-world applications where both accuracy and computational efficiency are critical.

3.5. Additional Insights

Effect of reuse period k: We train the proposed model with different reuse periods, k = 1, 5, 10, 15, 20, and show the trade-offs between performance and time in Table 5. In general, a small reuse period performs better at the expense of increased computational cost. We choose k = 10 for our 8× SR models to strike a balance between performance and speed. We show that the speed of our SMFDM with k = 10 is faster than most of the existing diffusion-based SR algorithms (see Figure 1).
Limitations: While our model is capable of generating clean and sharp HR images on a large scale, it still fails in scenes with severe loss of details. As shown in Figure 6, the outline of the eyes is significantly blurred in the 8×-downscaled LR image, and our method fails in generating high-resolution details. This phenomenon also occurs when using other super-resolution methods.

4. Discussion

This study proposes SMFDM, which integrates a sparse momentum mechanism with an encoder state reuse strategy to effectively improve both inference speed and reconstruction quality in diffusion-based image super-resolution tasks. While numerous studies have demonstrated the powerful capabilities of diffusion models in detail restoration and image enhancement [24,25,26], traditional diffusion models generally suffer from substantial computational resource demands and slow inference due to their iterative nature. Prior works, such as [27,28], have mainly focused on designing increasingly complex network architectures to enhance performance, often overlooking computational efficiency. In contrast, SMFDM achieves a more balanced trade-off between quality and speed by introducing sparse momentum and selective encoder state reuse, thus significantly enhancing the practical applicability of diffusion models.

Specifically, the encoder state reuse mechanism in SMFDM significantly reduces redundant computation within the U-Net architecture, promoting faster model convergence while ensuring the accurate restoration of image details and improved inference efficiency. This strategy effectively avoids the heavy trade-offs between computational load and network complexity seen in prior studies. Furthermore, the sparse momentum mechanism greatly reduces computational overhead while preserving strong representational capacity, achieving a well-balanced compromise between performance and efficiency.

Notably, our comparative experiments against DDIM clearly demonstrate the significant acceleration benefits achieved by SMFDM, further validating the practical contribution of the proposed method. Moreover, as discussed, the encoder state reuse strategy is not limited to high-dimensional pixel-space processing but can also be applied to latent-space diffusion models such as LDM [29]. Applying weight reuse in low-dimensional latent spaces may further reduce computational complexity and expand the applicability of SMFDM to a broader range of diffusion-based vision tasks.

Despite SMFDM’s strong performance across multiple benchmark datasets, further validation is needed to assess its robustness under more extreme conditions, such as ultra-low-resolution or high-noise scenarios. As model scales continue to increase, maintaining the computational efficiency of the sparse momentum mechanism also remains a critical challenge for future research. Additionally, exploring the applicability of both sparse momentum and encoder state reuse in other diffusion-based visual tasks may further accelerate the generation process and unlock broader application potential.

5. Conclusions

This paper presents an SMFDM method that integrates an encoder state reuse method within the diffusion denoising model to significantly reduce redundant computations and improve efficiency. Additionally, the sparse momentum paradigm balances performance and speed, allowing the model to maintain high fidelity while accelerating processing. Extensive evaluations on benchmark datasets show that SMFDM outperforms other DM-based super-resolution methods in both speed and image quality. Our model accelerates the processing time by 71.04% while preserving image details and sharpness. These results demonstrate SMFDM’s significant advantages in real-world applications. SMFDM is not only effective for super-resolution but also holds potential for other image enhancement tasks, such as denoising and inpainting. Future research could further optimize SMFDM’s performance on low-resolution images and under high-noise conditions and explore its scalability to larger tasks. In conclusion, SMFDM offers an innovative, efficient, and high-quality solution for image restoration, with promising applications in various vision tasks.

Author Contributions

Conceptualization, K.C., Y.L. and N.B.; methodology, N.B.; software, N.B.; validation, S.Z. and Y.L.; formal analysis, Y.L.; investigation, S.Z.; resources, K.C.; data curation, N.B.; writing—original draft preparation, N.B., Y.L., S.Z., and K.C.; writing—review and editing, K.C.; visualization, N.B.; supervision, S.Z.; project administration, X.W.; funding acquisition, K.C. and X.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the 2023 Liaoning Provincial Education Department’s Basic Research General Project (JYTMS20231518) and Introduction and Cultivation Program for Young Innovative Talents of Universities in Shandong Province (2021QCYY003).

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are publicly available and can be obtained from online repositories.

Acknowledgments

We sincerely thank L.D., whose field of specialisation is different from ours but who brought a fresh and invaluable perspective to our study, inspiring us to look at and think about the issues in the study from a whole new angle.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

SISR	Single image super-resolution
DM	Diffusion model
SMFDM	Sparse Momentum-based Faster Diffusion Model
SR	Super-resolution
HR	High-resolution
LR	Low-resolution
CNN	Convolutional neural network
GANs	Generative adversarial networks
RNNs	Recurrent neural networks
OCFAB	Overlapping cross-fusion attention block
DPMs	diffusion probabilistic models
LDMs	Latent Diffusion Models
DDIM	Denoising Diffusion Implicit Models
EDSR	Enhanced Deep Residual Networks for Single Image Super-Resolution
MLP	multilayer perceptron
SSIM	Structural similarity index
PSNR	Peak signal-to-noise ratio
CelebA-HQ	Large-scale Celeb Faces Attributes
FFHQ	Faces in the Wild High-Quality

References

Intodia, S.; Gupta, S.; Yeramalli, Y.; Bhat, A. Literature Review: Super Resolution for Autonomous Vehicles Using Generative Adversarial Networks. In Proceedings of the 2023 7th International Conference on Intelligent Computing and Control Systems (ICICCS), Madurai, India, 17–19 May 2023; pp. 1466–1472. [Google Scholar] [CrossRef]
Wan, L.; Sun, Y.; Sun, L.; Ning, Z.; Rodrigues, J.J. Deep Learning-based Super-resolution DOA Estimation for Autonomous Vehicle Safety Driving. IEEE Trans. Intell. Transp. Syst. 2020, 22, 4301–4315. [Google Scholar] [CrossRef]
Hammouche, R.; Attia, A.; Akhrouf, S.; Akhtar, Z. Gabor Filter Bank with a Deep Autoencoder-Based Face Recognition System. Expert Syst. Appl. 2022, 197, 116743. [Google Scholar] [CrossRef]
Du, H.; Shi, H.; Zeng, D.; Zhang, X.P.; Mei, T. The Elements of End-to-End Deep Face Recognition: A Survey of Recent Advances. ACM Comput. Surv. (CSUR) 2022, 54, 212. [Google Scholar] [CrossRef]
Xu, W.; Guangluan, X.; Wang, Y.; Sun, X.; Lin, D.; Yirong, W. High Quality Remote Sensing Image Super-Resolution Using Deep Memory Connected Network. In Proceedings of the IEEE International Geoscience and Remote Sensing Symposium, Valencia, Spain, 22–27 July 2018; pp. 8889–8892. [Google Scholar] [CrossRef]
Chen, W.; Tian, J.; Song, J.; Li, X.; Ke, Y.; Zhu, L.; Yu, Y.; Ou, Y.; Gong, H. A Novel Super-Resolution Model for 10-m Mangrove Mapping With Landsat-5. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4409412. [Google Scholar] [CrossRef]
Wang, J.; Wang, N.; Wang, H.; Cao, K.; El-Sherbeeny, A.M. GCP: A Multi-Strategy Improved Wireless Sensor Network Model for Environmental Monitoring. Comput. Netw. 2024, 254, 110807. [Google Scholar] [CrossRef]
Dong, C.; Loy, C.C.; He, K.; Tang, X. Image Super-Resolution Using Deep Convolutional Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 38, 295–307. [Google Scholar] [CrossRef] [PubMed]
Shi, W.; Caballero, J.; Huszár, F.; Totz, J.; Aitken, A.P.; Bishop, R.; Rueckert, D.; Wang, Z. Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 1874–1883. [Google Scholar] [CrossRef]
Shi, Y.; Jiang, C.; Liu, C.; Li, W.; Wu, Z. A Super-Resolution Reconstruction Network of Space Target Images Based on Dual Regression and Deformable Convolutional Attention Mechanism. Electronics 2023, 12, 2995. [Google Scholar] [CrossRef]
Shao, G.; Sun, Q.; Gao, Y.; Zhu, Q.; Gao, F.; Zhang, J. Sub-Pixel Convolutional Neural Network for Image Super-Resolution Reconstruction. Electronics 2023, 12, 3572. [Google Scholar] [CrossRef]
Long, Y.; Ruan, H.; Zhao, H.; Liu, Y.; Zhu, L.; Zhang, C.; Zhu, X. Adaptive Dynamic Shuffle Convolutional Parallel Network for Image Super-Resolution. Electronics 2024, 13, 4613. [Google Scholar] [CrossRef]
Lim, B.; Son, S.; Kim, H.; Nah, S.; Mu Lee, K. Enhanced Deep Residual Networks for Single Image Super-Resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Honolulu, HI, USA, 21–26 July 2017; pp. 1132–1140. [Google Scholar] [CrossRef]
Wang, L.; Wang, Y.; Lin, Z.; Yang, J.; An, W.; Guo, Y. Learning a Single Network for Scale-Arbitrary Super-Resolution. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 4781–4790. [Google Scholar] [CrossRef]
Ledig, C.; Theis, L.; Huszár, F.; Caballero, J.; Cunningham, A.; Acosta, A.; Aitken, A.; Tejani, A.; Totz, J.; Wang, Z.; et al. Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 105–114. [Google Scholar] [CrossRef]
Xie, L.; Wang, X.; Chen, X.; Li, G.; Shan, Y.; Zhou, J.; Dong, C. DeSRA: Detect and Delete the Artifacts of GAN-Based Real-World Super-Resolution Models. arXiv 2023, arXiv:2307.02457. [Google Scholar] [CrossRef]
Liang, J.; Zeng, H.; Zhang, L. Details or Artifacts: A Locally Discriminative Learning Approach to Realistic Image Super-Resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 5647–5656. [Google Scholar] [CrossRef]
Tai, Y.; Yang, J.; Liu, X. Image Super-Resolution via Deep Recursive Residual Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2790–2798. [Google Scholar] [CrossRef]
Isobe, T.; Jia, X.; Gu, S.; Li, S.; Wang, S.; Tian, Q. Video Super-Resolution with Recurrent Structure-Detail Network. In Proceedings of the the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; pp. 645–660. [Google Scholar] [CrossRef]
Alom, M.Z.; Hasan, M.; Yakopcic, C.; Taha, T.M.; Asari, V.K. Recurrent Residual Convolutional Neural Network Based on U-Net (R2U-Net) for Medical Image Segmentation. arXiv 2018, arXiv:1802.06955. [Google Scholar]
Wang, Y.; Liu, Y.; Zhao, S.; Li, J.; Zhang, L. CAMixerSR: Only Details Need More “Attention”. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 25837–25846. [Google Scholar] [CrossRef]
Ray, A.; Kumar, G.; Kolekar, M.H. CFAT: Unleashing Triangular Windows for Image Super-Resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 26120–26129. [Google Scholar] [CrossRef]
Zhang, L.; Li, Y.; Zhou, X.; Zhao, X.; Gu, S. Transcending the Limit of Local Window: Advanced Super-Resolution Transformer with Adaptive Token Dictionary. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 2856–2865. [Google Scholar] [CrossRef]
Wang, Y.; Yang, W.; Chen, X.; Wang, Y.; Guo, L.; Chau, L.P.; Liu, Z.; Qiao, Y.; Kot, A.C.; Wen, B. Sinsr: Diffusion-Based Image Super-Resolution in a Single Step. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 25796–25805. [Google Scholar] [CrossRef]
Yuan, Y.; Yuan, C. Efficient Conditional Diffusion Model with Probability Flow Sampling for Image Super-Resolution. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 6862–6870. [Google Scholar] [CrossRef]
Shang, S.; Shan, Z.; Liu, G.; Wang, L.; Wang, X.; Zhang, Z.; Zhang, J. ResDiff: Combining CNN and Diffusion Model for Image Super-Resolution. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), West Building, Vancouver, BC, Canada, 20–27 February 2024. [Google Scholar] [CrossRef]
Saharia, C.; Ho, J.; Chan, W.; Salimans, T.; Fleet, D.J.; Norouzi, M. Image Super-Resolution via Iterative Refinement. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 4713–4726. [Google Scholar] [CrossRef] [PubMed]
Gao, S.; Liu, X.; Zeng, B.; Xu, S.; Li, Y.; Luo, X.; Liu, J.; Zhen, X.; Zhang, B. Implicit Diffusion Models for Continuous Super-Resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 10021–10030. [Google Scholar] [CrossRef]
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-Resolution Image Synthesis With Latent Diffusion Models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 10674–10685. [Google Scholar] [CrossRef]
Song, J.; Meng, C.; Ermon, S. Denoising Diffusion Implicit Models. arXiv 2010. [Google Scholar] [CrossRef]
Luo, F.; Xiang, J.; Zhang, J.; Han, X.; Yang, W. Image Super-Resolution via Latent Diffusion: A Sampling-Space Mixture of Experts and Frequency-Augmented Decoder Approach. arXiv 2023, arXiv:2310.12004. [Google Scholar] [CrossRef]
Jeong, J.; Han, S.; Kim, J.; Kim, S.J. Latent Space Super-Resolution for Higher-Resolution Image Generation with Diffusion Models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025; pp. 2355–2365. [Google Scholar] [CrossRef]
Zhou, Z.; Chen, D.; Wang, C.; Chen, C. Fast ODE-Based Sampling for Diffusion Models in Around 5 Steps. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 7777–7786. [Google Scholar] [CrossRef]
Zhang, H.; Gao, T.; Shao, J.; Wu, Z. BlockDance: Reuse Structurally Similar Spatio-Temporal Features to Accelerate Diffusion Transformers. arXiv 2025, arXiv:2503.15927. [Google Scholar] [CrossRef]
Ma, X.; Fang, G.; Wang, X. DeepCache: Accelerating Diffusion Models for Free. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 15762–15772. [Google Scholar] [CrossRef]
Sokar, G.; Mocanu, E.; Mocanu, D.C.; Pechenizkiy, M.; Stone, P. Dynamic Sparse Training for Deep Reinforcement Learning. arXiv 2021, arXiv:2106.04217. [Google Scholar] [CrossRef]
Dettmers, T.; Zettlemoyer, L. Sparse Networks from Scratch: Faster Training without Losing Performance. arXiv 2019, arXiv:1907.04840. [Google Scholar] [CrossRef]
Li, S.; Hu, T.; Khan, F.S.; Li, L.; Yang, S.; Wang, Y.; Cheng, M.M.; Yang, J. Faster Diffusion: Rethinking the Role of UNet Encoder in Diffusion Models. CoRR 2023, abs/2312.09608. [Google Scholar]
Karras, T.; Laine, S.; Aila, T. A Style-based Generator Architecture for Generative Adversarial Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 4217–4228. [Google Scholar] [CrossRef] [PubMed]
Karras, T.; Aila, T.; Laine, S.; Lehtinen, J. Progressive Growing of GANs for Improved Quality, Stability, and Variation. arXiv 2017, arXiv:1710.10196. [Google Scholar] [CrossRef]
Zhou, F.; Brandt, J.; Lin, Z. Exemplar-Based Graph Matching for Robust Facial Landmark Localization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Sydney, NSW, Australia, 1–8 December 2013; pp. 1025–1032. [Google Scholar] [CrossRef]
Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 586–595. [Google Scholar] [CrossRef]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image Quality Assessment: From Error Visibility to Structural Similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Time and PSNR comparisons with existing DM-based SR methods. PSNR and running time for ×2, ×3, ×4, ×5,×6, ×7, and ×8 super-resolution on the CelebA-HQ (Large-scale Celeb Faces Attributes) dataset in comparison to other methods. Additionally, each method is represented by an object whose area corresponds to the ratio of PSNR to time. A larger object indicates better performance.

Figure 2. Architecture of our proposed SMFDM. (a) Sparse encoder with state reuse at

t = T

, aggregating multi-step decoder’s features. (b) Sparse decoder architecture for decoding and intermediate state generation at time step

t = T

. (c) Denoising process of image super-resolution based on encoder feature reuse. In all subfigures, solid arrows indicate the flow of feature information. (a) At time step

t = T

, the sparse encoder aggregates features from

{T, T - 1, \dots, T - k}

via an encoder state reuse mechanism to produce k groups of sparse features. (b) At time step

t = T

, the sparse decoder decodes encoder features from

{T, \dots, T - k}

and generates enhanced intermediate representations via iterative denoising. (c) The subfigure explicitly demonstrates the denoising process occurring in an image super-resolution model integrated with encoder feature reuse mechanism during the inference phase.

Figure 2. Architecture of our proposed SMFDM. (a) Sparse encoder with state reuse at

t = T

, aggregating multi-step decoder’s features. (b) Sparse decoder architecture for decoding and intermediate state generation at time step

t = T

. (c) Denoising process of image super-resolution based on encoder feature reuse. In all subfigures, solid arrows indicate the flow of feature information. (a) At time step

t = T

, the sparse encoder aggregates features from

{T, T - 1, \dots, T - k}

via an encoder state reuse mechanism to produce k groups of sparse features. (b) At time step

t = T

, the sparse decoder decodes encoder features from

{T, \dots, T - k}

and generates enhanced intermediate representations via iterative denoising. (c) The subfigure explicitly demonstrates the denoising process occurring in an image super-resolution model integrated with encoder feature reuse mechanism during the inference phase.

Figure 3. Sparse momentum pruning flowchart during training.

Figure 4. Qualitative comparison of 8× on the CelebA-HQ dataset.

Figure 5. Visualization of continuous SR results on CelebA-HQ.

Figure 6. Case of failure on 8× image from the CelebA-HQ dataset.

Table 1. Results of 8× super-resolution on the CelebA-HQ dataset.

Method	PSNR (dB)	SSIM	Time (s)
SR3 [27]	23.24	0.675	14.37
Regression [27]	23.23	0.675	14.37
IDM [28]	23.22	0.695	23.24
DeepCache [35]	23.34	0.699	12.77
DDIM [30]	23.16	0.685	4.56
SMFDM	23.81	0.684	6.73

Note: Bold values indicate the best performance in each column.

Table 2. Results of 8× super-resolution on the Helen dataset.

Method	PSNR (dB)	SSIM	Time (s)
SR3 [27]	19.61	0.543	14.75
Regression [27]	19.62	0.540	14.39
IDM [28]	21.49	0.626	23.28
DeepCache [28]	21.60	0.628	13.23
DDIM [30]	21.51	0.617	4.48
SMFDM	22.29	0.616	6.72

Note: Bold values indicate the best performance in each column.

Table 3. PSNR, SSIM, and inference time results of the CelebA-HQ dataset for upscaling factors from

\times 2

to

\times 8

.

Table 3. PSNR, SSIM, and inference time results of the CelebA-HQ dataset for upscaling factors from

\times 2

to

\times 8

.

Method	Year	Scale	PSNR (dB)	SSIM	Time (s)
IDM [28]	2023	2×	20.08	0.720	20.96
DeepCache [28]	2024	2×	20.12	0.722	11.72
SMFDM	-	2×	23.96	0.860	2.17
IDM [28]	2023	3×	20.71	0.700	21.15
DeepCache [28]	2024	3×	20.77	0.699	11.70
SMFDM	-	3×	24.11	0.810	2.30
IDM [28]	2023	4×	21.97	0.713	21.08
DeepCache [28]	2024	4×	22.05	0.716	11.49
SMFDM	-	4×	24.23	0.762	2.58
IDM	2023	5×	21.73	0.677	21.48
DeepCache [28]	2024	5×	21.81	0.681	11.67
SMFDM	-	5×	24.09	0.721	3.20
IDM [28]	2023	6×	22.22	0.675	23.28
DeepCache [28]	2024	6×	22.30	0.679	11.69
SMFDM	-	6×	23.92	0.688	4.07
IDM [28]	2023	7×	22.28	0.666	11.82
DeepCache [28]	2024	7×	22.37	0.670	23.28
SMFDM	-	7×	23.81	0.665	5.01

Note: Bold values indicate the best performance in each column.

Table 4. Ablation study of 8× on the Helen dataset.

Method	ESR	SM	PSNR (dB)	SSIM	Speedup	Inference Time (s)
baseline	✗	✗	21.49	0.626	1×	23.28
baseline_ESR	✓	✗	22.27	0.617	1×	6.52
baseline_SM	✗	✓	22.07	0.635	1.29×	23.53
ours	✓	✓	22.29	0.616	1.29×	6.72

Table 5. Comparison of different values of k on PSNR, SSIM, and running time.

k	PSNR (dB)	SSIM	Time (s)
1	21.49	0.626	23.28
5	22.25	0.628	8.27
10	22.29	0.616	6.72
15	22.26	0.602	6.29
20	22.17	0.578	6.24

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cao, K.; Bao, N.; Zheng, S.; Liu, Y.; Wang, X. Accelerating Facial Image Super-Resolution via Sparse Momentum and Encoder State Reuse. Electronics 2025, 14, 2616. https://doi.org/10.3390/electronics14132616

AMA Style

Cao K, Bao N, Zheng S, Liu Y, Wang X. Accelerating Facial Image Super-Resolution via Sparse Momentum and Encoder State Reuse. Electronics. 2025; 14(13):2616. https://doi.org/10.3390/electronics14132616

Chicago/Turabian Style

Cao, Kerang, Na Bao, Shuai Zheng, Ye Liu, and Xing Wang. 2025. "Accelerating Facial Image Super-Resolution via Sparse Momentum and Encoder State Reuse" Electronics 14, no. 13: 2616. https://doi.org/10.3390/electronics14132616

APA Style

Cao, K., Bao, N., Zheng, S., Liu, Y., & Wang, X. (2025). Accelerating Facial Image Super-Resolution via Sparse Momentum and Encoder State Reuse. Electronics, 14(13), 2616. https://doi.org/10.3390/electronics14132616

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Accelerating Facial Image Super-Resolution via Sparse Momentum and Encoder State Reuse

Abstract

1. Introduction

2. Proposed Methods

2.1. Problem Formulation

2.2. Overview

2.3. Sparse Momentum

2.4. Encoder State Reuse

3. Results

3.1. Implementation Detail

3.2. Comparisons with Other Methods

3.3. Comparisons of Continuous SR

3.4. Ablation Study

3.5. Additional Insights

4. Discussion

5. Conclusions

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI