1. Introduction
Image super-resolution (SR) refers to the task of generating high-resolution (HR) images from given low-resolution (LR) images. This task has gained significant attention due to its critical applications in scenarios that require high image clarity, such as autonomous vehicles [
1,
2], facial recognition [
3,
4], and environmental monitoring [
5,
6,
7]. In these domains, accurately recovering fine details from low-resolution input is essential for effective analysis and decision making. For example, in facial recognition, in scenarios such as access control systems and law enforcement investigations, high-quality facial super-resolution technology can help extract key features from blurred or low-resolution face images. This, in turn, improves the accuracy of facial recognition algorithms, enabling reliable identity verification and personnel tracking.
Since the groundbreaking work of Dong et al. [
8], convolutional neural network (CNN)-based methods [
9,
10,
11,
12] have achieved remarkable progress in accuracy and efficiency for SR. For example, Shi et al. [
9] introduced a real-time SR network utilizing sub-pixel convolutions. Lim et al. [
13] proposed Enhanced Deep Residual Networks for Single Image Super-Resolution (EDSR), which is based on a residual learning architecture. It consists of multiple residual blocks, each containing two convolutional layers with ReLU activations, but without batch normalization to prevent information loss. The network performs image upsampling at the end using sub-pixel convolution to generate high-resolution outputs. Ref. [
14] further improved detail reconstruction by incorporating attention mechanisms. These methods, although effective in many applications, still suffer from limitations when attempting to reconstruct fine details in high-magnification SR tasks. CNN-based models tend to suppress high-frequency components, which are crucial for recovering intricate textures. This makes them less suitable for applications that require highly detailed images, such as medical imaging or satellite imagery.
In addition to CNN-based methods, several other SR techniques have been explored. Generative adversarial networks (GANs) have shown great promise in generating realistic textures and fine details by simulating a min–max game between a generator and a discriminator. However, GANs can be unstable during training, leading to mode collapse or artifacts in the generated images. For example, SRGAN [
15] exhibits minor reconstruction artifacts in some cases; meanwhile, DeSRA [
16] found that LDL [
17] still shows obvious artifacts when inferring real-world super-resolution. Moreover, GANs struggle to consistently produce high-frequency details, especially when applied to high-resolution reconstructions.
Another approach to SR involves the use of recurrent neural networks (RNNs) [
18,
19,
20], which refine images over multiple iterations to improve detail. However, RNNs suffer from slow processing times and are computationally intensive, which limits their applicability to time-sensitive tasks. Furthermore, RNNs often face challenges in capturing long-range dependencies across images, leading to suboptimal reconstruction of intricate textures and fine details over large areas.
Transformer-based super-resolution methods leverage attention mechanisms to model long-range dependencies and complex visual patterns. CAMixer [
21] adjusts attention via a learnable predictor to enhance convolutional representations, but increases the computational cost for high-resolution images. CFAT [
22] combines local attention with global channel-wise attention, and while the Overlapping Cross-Fusion Attention Block (OCFAB) improves efficiency, its MAC complexity grows quadratically with increased pixels, limiting high-resolution applicability. ATDSR [
23] uses a token dictionary for adaptive optimization, but its grouping strategy adds extra parameters, increasing computational overhead.
In recent years, diffusion models have demonstrated remarkable performance in image super-resolution tasks [
24,
25,
26]. These models generate high-quality, high-resolution images through iterative denoising processes and exhibit strong capabilities in detail restoration. For example, SR3 [
27] enhances image fidelity by conditioning the generation process on low-resolution inputs, while IDM [
28] enables continuous resolution outputs through scale-aware modeling, offering greater flexibility. Such models are widely applied in domains requiring high image fidelity, such as medical imaging and remote sensing.
However, the low inference efficiency of diffusion probabilistic models (DPMs) remains a significant barrier to their practical deployment (see
Figure 1). Due to their iterative generation nature, even with some optimization techniques, these models struggle to meet the computational demands of real-time inference, which is particularly problematic in time-sensitive applications like autonomous driving and large-scale image analysis.
To address this issue, numerous studies have proposed acceleration strategies. For example, LDM [
29] accelerates inference by denoising in a latent space and adopting a scheduler such as Denoising Diffusion Implicit Models (DDIM) [
30] to reduce the number of sampling steps. The inherent loss of high-frequency information during latent space compression fundamentally limits the pixel-level detail restoration capability of latent diffusion models (LDMs), as rigorously confirmed in prior research. Rombach et al. [
29] explicitly stated that autoencoder-based compression “sacrifices high-frequency details”: when the downsampling factor
of the autoencoder, the perceptual compression stage significantly removes high-frequency information, leading to detail loss in high-resolution synthesis. The work [
31] proposed a sampling space mixture of experts to extend diffusion models and utilized a frequency compensation decoder to enhance details in super-resolution images; however, it did not fully address the distortion caused by latent space compression. LSRNA [
32] effectively mitigates manifold deviations caused by direct upsampling methods (e.g., bicubic) by learning a mapping from low-resolution latent representations to the high-resolution manifold. It maintains strong performance even with only 30 denoising steps. However, the optimal strength of RNA for detail enhancement is not fixed and requires adaptive adjustment based on the specific method and noise schedule. Other works aim to optimize the time step schedules of numerical ODE solvers [
33] within the diffusion process to improve both computational efficiency and generative quality. BlockDance [
34] identifies and caches structurally stable spatiotemporal features to enable fast inference with minimal quality degradation. Its enhanced variant, BlockDance-Ada [
34], further leverages reinforcement learning to dynamically allocate computation based on instance-specific policies, achieving a better quality–speed trade-off. DeepCache [
35] introduces a training-free general acceleration paradigm by exploiting temporal redundancy in high-level features across adjacent denoising steps. Despite the efficiency gains achieved by the above methods, they still face several limitations, such as increased strategy complexity, quality degradation caused by latent space compression, or optimization restricted solely to the inference stage.
Compared to traditional sparse learning methods [
36], which typically rely on iterative pruning/retraining or dynamic architecture adjustments, we directly adopt the sparse momentum method [
37]. This approach maintains structural sparsity throughout the entire training process, leveraging momentum information to guide weight pruning and regeneration. As a result, it significantly reduces computational costs while maintaining model performance.
In this study, we further analyze feature dynamics in DM-based SR tasks and observe that feature variations in the encoder are relatively stable, while those in the decoder are significantly more volatile. Inspired by fast diffusion modeling methods [
38], we use a state reuse mechanism, where time steps with substantial feature changes are designated as key time steps and others as non-key time steps. By skipping selected non-key time steps, we achieve faster inference. Compared to the LDM-based [
29] SR method, our approach leverages past states as perturbations to improve texture information and enhance detail recovery, without sacrificing speed. Furthermore, by processing multiple time steps in a single iteration, it reduces feature redundancy and accelerates the process. Additionally, we adopt a sparse momentum strategy that maintains structural sparsity during training to reduce computational complexity. Unlike traditional sparsification methods that compress pretrained dense models, our approach preserves sparsity throughout training, making it natively compatible with sparse inference frameworks and offering a superior balance between performance and efficiency. We summarize the contributions of our work as follows.
We have developed a model named SMFDM to rapidly generate high-fidelity facial images through an encoder state reuse mechanism. SMFDM divides all the time steps in advance into key time steps and non-key time steps during the sampling phase. It performs the encoding operation of the U-Net encoder at the key time steps and reuses the features of the encoder from the key time steps at the non-key time steps, which reduces a large number of computational operations, enhances the computational efficiency of the U-Net architecture, and accelerates convergence while preserving image details.
We also introduce a sparse momentum method for dynamic weight sparsity management during training. Starting with random weight pruning, we assess the importance of remaining weights layer by layer using momentum means, reassigning them accordingly. We prioritize the growth or recovery of weights that most contribute to error reduction, guided by the momentum of zero-value weights within the same layer. Through iterative cycles of magnitude pruning and momentum-driven regrowth, we optimize the weight distribution, preserving critical weights while maintaining overall network sparsity. As a result, our model reduces computational demands without sacrificing feature representation, effectively cutting costs while handling complex image features. This dual strategy enhances both efficiency and model performance.
Experimental results on benchmark datasets show that SMFDM outperforms existing super-resolution methods in both speed and image quality. Specifically, SMFDM accelerates processing time by 71.04% compared to existing methods, while maintaining high-fidelity output with exceptional detail preservation.
The rest of this paper is structured as follows.
Section 2 describes the proposed Sparse Momentum-based Faster Diffusion Model.
Section 3 describes the experiments and results of super-resolution.
Section 4 and
Section 5 provide a brief discussion and conclusion.
2. Proposed Methods
2.1. Problem Formulation
Given a low-resolution facial image
, the goal is to reconstruct a high-resolution image
(where s is the scaling factor). The diffusion model achieves this through a bidirectional process. The forward diffusion process gradually adds Gaussian noise to the original image
to form a Markov chain:
where
controls the noise intensity and the reverse denoising process learns the mapping:
This network, via a parameterized network
(we employ a U-Net architecture here), predicts the original signal based on the noisy image
and time step t. Starting from
, the model generates
through T-step iterative refinement. However, standard diffusion models suffer from high computational costs due to repeated encoder evaluations at each step. To address this, our SMFDM method presents two key improvements: First, a sparse momentum optimization strategy is employed to dynamically maintain the sparse topology of the U-Net during the training phase (retaining 15% weight density), where a pruning–regeneration algorithm (Algorithm 1) ensures the retention of critical connections. Second, an encoder state reuse mechanism is introduced to cache intermediate features, thereby reducing redundant computations (Algorithm 2).
Algorithm 1 Sparse Momentum Algorithm with Weight Redistribution |
- 1:
Input: Layers L, Momentum M, Weights W, Mask , Prune rate , Density d - 2:
Output: Sparse weights W, Mask , Momentum M - 3:
for each layer do - 4:
Initialize , , based on d - 5:
ApplyMask(, ) - 6:
end for - 7:
for each epoch do ▹ Redistribute excess regenerated weights - 8:
for each layer do - 9:
▹ Set of weights to be regenerated - 10:
▹ Set of available weights - 11:
, ▹ Compute counts - 12:
if then - 13:
- 14:
- 15:
end if - 16:
end for - 17:
for each batch do - 18:
gradients ← ComputeGradients(W, FetchBatch(b)) - 19:
UpdateMomentum(M, gradients), UpdateWeights(W, M) - 20:
for each layer do - 21:
ApplyMask(, ) - 22:
end for - 23:
end for - 24:
totalMomentum ← CalcTotalMomentum(M), totalPruned ← CalcPruningQuota(W, ) - 25:
for each layer do - 26:
momentumRatio ← CalcLayerMomentumRatio(, , totalMomentum) - 27:
PruneLowWeights(, , ), RegrowByMomentum(, , momentumRatio · totalPruned) - 28:
AnnealRate(), ApplyMask(, ) - 29:
end for - 30:
end for
|
Algorithm 2 SMFDM Inference with Encoder State Reuse |
Require: Low-resolution image x, diffusion step T, reuse interval k Ensure: Super-resolved image
- 1:
- 2:
- 3:
for
do - 4:
if then - 5:
- 6:
end if - 7:
- 8:
for to N do - 9:
- 10:
- 11:
Replicate and k times - 12:
end for - 13:
- 14:
- 15:
- 16:
for to k do - 17:
- 18:
end for - 19:
end for
return
|
2.2. Overview
As shown in
Figure 2 and
Figure 3, SMFDM accelerates DM-based SR through two synergistic components.
Sparse Momentum: This component dynamically prunes 85% of model weights during training using momentum-guided sparsity, maintaining performance with only 15% of weights (
Section 2.3).
Encoder State Reuse: This component groups T diffusion steps into periods of size
k. Only the first step (key step) in each period runs the full encoder; subsequent
steps (non-key steps) reuse the encoder features from the key step (
Section 2.4).
Specifically,
Figure 2a illustrates the sparse encoder encoding features at time step
T for reuse over the subsequent
k time steps. The resulting feature maps are ordered from left to right according to their corresponding time steps.
Figure 2b shows the decoder leveraging the cached encoder states at time step
T to perform denoising, generating intermediate outputs
. Finally,
Figure 2c presents an overview of the complete
T-step denoising process, which is divided into segments. Each segment begins with a full encoding step followed by
k decoding steps. Dashed lines in the figure indicate the omission of redundant operations from
to
T, effectively reducing the total number of encoding operations.
To simplify the description, we hereinafter focus on the variables
,
,
, and
, while omitting all subscripts—including the time step subscript
t—for brevity and clarity. The U-Net backbone employs
downsampling/upsampling stages, starting with
channels in the first stage and doubling the channel count at each downsampling step (i.e.,
), following the design in IDM [
28]. The encoder utilizes residual blocks with self-attention for downsampling (Equation (
3)), while the decoder processes the fused features
through sparse convolutional layers (Equation (
7)) to produce the denoised outputs.
To extract hierarchical features from the low-resolution input image
x, we adopt the improved EDSR network as our backbone feature extractor. This model optimizes the residual network architecture by removing BN layers and replaces the original sub-pixel convolution upsampling module with bilinear interpolation, thus obtaining the initial feature map
. Concurrently, a noise image
sampled from a Gaussian distribution is generated to align with the target spatial dimensions. Next, the feature obtained by concatenating
and
, along with the original
, is fed into the sparse encoder to generate multi-scale feature maps
and
, where
. At each stage, the intermediate feature
is obtained by downsampling the previous stage’s feature
:
where Down consists of two residual blocks with self-attention and a downsampling module. Each residual block includes a 3 × 3 convolutional layer, a normalization layer, an activation function, and optionally a self-attention mechanism. The downsampling module consists of a 3 × 3 convolutional layer with a stride of 2 and padding of 1, which halves the spatial resolution of the feature map. Similarly,
is produced by applying a downsampling operation to
:
where Conv includes a 3 × 3 convolutional layer (with stride 1 and padding 1), a bilinear downsampling operation, and a leaky ReLU activation function.
The feature maps
and
are reused
k times to generate
k sets of features. In each reuse iteration,
is upsampled to obtain the feature map
, which is then fused with the corresponding
. The scaling factor
s is passed through a five-layer multilayer perceptron (MLP) with an input dimension of 66 and an output dimension of 3. This network consists of 4 hidden layers with 256 units each, all activated by ReLU functions, followed by a linear layer that produces the final output. The outputs
and
generated by the MLP are then used to determine the fusion ratio. After normalization, the fusion coefficients
and
are computed, and the fused feature map
is calculated as follows:
Each fused feature map is then processed by a sparse decoder network to produce an upsampled feature map corresponding to the current time step. This procedure is repeated k times, resulting in k upsampled feature maps. Finally, an iterative denoising process is applied to progressively refine the results until the final high-resolution image is reconstructed.
2.3. Sparse Momentum
As illustrated in
Figure 3, the sparse momentum mechanism optimizes U-Net parameters via dynamic pruning during training. The term “sparse encoder” refers to an encoder that undergoes iterative weight pruning and regeneration during training (indicated by dashed lines). Dashed arrows represent the momentum-based weight update cycles during training as well as omitted diffusion steps for clarity. The term “sparse decoder” refers to the decoder module that is also trained using the sparse momentum mechanism. Specifically, when the U-Net performs noise prediction on the noisy input image
(blue arrows denote forward information flow), the network outputs the predicted noise
. We introduce a sparse momentum-driven masking mechanism to balance network connectivity and performance under a predefined sparsity constraint. This mechanism maintains a fixed sparsity level throughout the training process, enabling the model to achieve performance comparable to that of dense models. Specifically, we apply the sparsity constraint independently to each layer to construct an initial sparse topology, retaining only 15% of the weights while randomly pruning the remaining 85%. The mask is initialized as follows:
where
has the same shape as the corresponding weight
, a value of 1 indicates that the connection is active, and the density indicates that 15% of the weights are retained, corresponding to an 85% sparsity. To guide the adaptive evolution of the sparse structure during training, we maintain an exponentially smoothed momentum for each layer, and its update formula is
where
is a smoothing factor that controls the influence of historical gradient information.
represents the momentum corresponding to the weights
in the
l-th layer in the
-th iteration, and
represents the value of the error function in the
-th iteration, measuring the prediction error of the model after the
-th parameter update. Specifically,
is used to store the historical gradient information of the weights in this layer, so as to update the weights more stably during the training process. The momentum can capture the long-term gradient contributions and stabilize the weight selection in the presence of noisy updates.
At the end of each training epoch, we apply a sparse momentum update process consisting of three steps: (a) inter-layer weight redistribution, (b) intra-layer pruning, and (c) momentum-guided weight regeneration. In step (a), we calculate the average absolute momentum of the active weights in each layer and normalize it across layers to obtain the momentum contribution ratio
, which determines the distribution of the number of weight regenerations in the next step. The specific calculation process is to first calculate the average absolute momentum of the active weights in each layer:
where
represents the mask value corresponding to the
-th weight in the
l-th layer, and
is the momentum of the
-th weight in the
l-th layer. Then, the contribution ratio
of each layer is calculated through
, where
is the sum of
for all layers.
In step (b), we sort the currently active weights in each layer according to their momentum magnitudes and set the mask values of the 50% of the weights with the lowest momentum to zero for pruning. For layers approaching the sparsity limit, the unutilized pruning quota is redistributed to other layers to maintain global sparsity. In step (c), a fixed number of pruned weights are re-activated in each layer according to , that is, we select the zero-valued weights with the highest accumulated momentum. If the number of available weights for selection in a certain layer is insufficient, the surplus number of weight regenerations will be redistributed to other layers to ensure the efficient utilization of structural resources.
If the number of available weights for selection in a certain layer is insufficient, the surplus number of weight regenerations will be redistributed to other layers to ensure the efficient utilization of structural resources. Specifically, if the planned number of regenerated weights for a layer exceeds the available capacity in that layer, we redistribute the surplus regeneration weights to other layers. This is determined by comparing the planned regeneration number with the available weights for each layer, where vary depending on the type of the layer. If , the surplus regeneration weights are redistributed to other layers equally, ensuring the overall efficiency of the sparse structure while maintaining the sparsity constraint. The details of the training procedure of sparse momentum are shown in Algorithm 1.
As shown in
Figure 2, through repeated pruning (in red) and regeneration (in green) iterations, the model gradually evolves from a fully connected network to a sparse yet expressive topology. To stabilize the training process in the later stages, we adopt a cosine annealing schedule to gradually reduce the pruning rate. This mechanism enables our model to achieve reconstruction performance comparable to that of dense models under high sparsity, improving both the training efficiency and the scalability of the model. By applying the sparse learning mechanism, we have achieved an efficient and effective network structure, ensuring that the sparse network can provide performance comparable to that of dense networks.
2.4. Encoder State Reuse
To accelerate inference speed without compromising reconstruction quality, we use an encoder state reuse framework for the super-resolution task. Specifically, the reverse diffusion time step sequence is divided into multiple periods, each containing k time steps. The first time step in each period (i.e., the largest time index) is designated as the key time step , where the encoder performs a full forward pass to extract features. The remaining steps within the same period are defined as non-key time steps , where the encoder state computed at the key time step is reused to avoid redundant computation.
For example, when , the first period includes the key time step T and non-key time steps . The time step set is partitioned into key and non-key parts, satisfying . During inference, the entire diffusion process T is split into effective steps, each corresponding to k consecutive original time steps processed together in one iteration. This state reuse mechanism leverages the strong temporal correlation of encoder features between adjacent steps: features computed at key time steps are reused in subsequent non-key steps, significantly reducing computation while preserving reconstruction quality.
When applying the encoder propagation strategy, for non-key time steps
, the decoder does not receive newly computed encoder features but instead reuses the encoder state from the most recent key time step. This design enables efficient reuse and parallelization, thereby reducing the computational cost of generating high-resolution images. The encoder reuse strategy is formulated as follows:
where Reuse (·) duplicates the feature maps of
and assigns them to each
as their feature maps. To enable multi-step denoising within a single iteration, we perform
k replication operations on the features extracted by the U-Net encoder and stack these replicated features along the channel dimension. This results in
k identical feature maps being provided to the decoder as input. Leveraging these enhanced features, the decoder is capable of simultaneously predicting noise estimates corresponding to
k distinct time steps. This design maximally reuses encoder-side information, thereby enriching the temporal context available for denoising.
After the sampling phase, we enter the denoising phase. In this phase, the model starts from a pure-noise image and iteratively refines it using the information obtained from the sampling phase. At each iteration, the decoder outputs
k noise predictions, which are used to sequentially refine the noisy image across
k time steps:
where
includes a period of key time steps and non-key time steps in a fixed-length window of size
k;
represents the set of predicted noise components corresponding to each time step in
;
is a noise prediction function parameterized by
, which outputs
given input
and time embedding
;
denotes the predefined noise scale (i.e., standard deviation) at time step
t; and
is the standard Gaussian noise used in the stochastic reverse process.
To mitigate the potential detail loss caused by encoder state reuse, we incorporate current and past data states to preserve information during the early stages. Specifically, we update the data state at time
t as follows:
where Concat(·) refers to the concatenation of feature maps along the channel dimension,
is the latent data state,
is a feature map extracted from a shallow EDSR network, and
is a small positive scalar (set to 0.003) that controls the influence of
. This strategy preserves fine details in earlier steps and prevents information from being prematurely lost during denoising. The details of the inference procedure of our state reuse are shown in Algorithm 2.
4. Discussion
This study proposes SMFDM, which integrates a sparse momentum mechanism with an encoder state reuse strategy to effectively improve both inference speed and reconstruction quality in diffusion-based image super-resolution tasks. While numerous studies have demonstrated the powerful capabilities of diffusion models in detail restoration and image enhancement [
24,
25,
26], traditional diffusion models generally suffer from substantial computational resource demands and slow inference due to their iterative nature. Prior works, such as [
27,
28], have mainly focused on designing increasingly complex network architectures to enhance performance, often overlooking computational efficiency. In contrast, SMFDM achieves a more balanced trade-off between quality and speed by introducing sparse momentum and selective encoder state reuse, thus significantly enhancing the practical applicability of diffusion models.
Specifically, the encoder state reuse mechanism in SMFDM significantly reduces redundant computation within the U-Net architecture, promoting faster model convergence while ensuring the accurate restoration of image details and improved inference efficiency. This strategy effectively avoids the heavy trade-offs between computational load and network complexity seen in prior studies. Furthermore, the sparse momentum mechanism greatly reduces computational overhead while preserving strong representational capacity, achieving a well-balanced compromise between performance and efficiency.
Notably, our comparative experiments against DDIM clearly demonstrate the significant acceleration benefits achieved by SMFDM, further validating the practical contribution of the proposed method. Moreover, as discussed, the encoder state reuse strategy is not limited to high-dimensional pixel-space processing but can also be applied to latent-space diffusion models such as LDM [
29]. Applying weight reuse in low-dimensional latent spaces may further reduce computational complexity and expand the applicability of SMFDM to a broader range of diffusion-based vision tasks.
Despite SMFDM’s strong performance across multiple benchmark datasets, further validation is needed to assess its robustness under more extreme conditions, such as ultra-low-resolution or high-noise scenarios. As model scales continue to increase, maintaining the computational efficiency of the sparse momentum mechanism also remains a critical challenge for future research. Additionally, exploring the applicability of both sparse momentum and encoder state reuse in other diffusion-based visual tasks may further accelerate the generation process and unlock broader application potential.