1. Introduction
Single-image super-resolution (SISR) aims to reconstruct a high-resolution (HR) image from a low-resolution (LR) observation and has been widely applied in video streaming and broadcasting, satellite imaging, and medical imaging. In scenarios where the quality of the original data is inherently limited, super-resolution (SR) plays a critical role in recovering important visual details or generating visually plausible HR outputs. The SISR problem is inherently ill-posed, as multiple plausible HR solutions may correspond to the same LR input. Consequently, the characteristics of the reconstructed results largely depend on the restoration objective adopted by the model.
Deep learning-based super-resolution has rapidly evolved, primarily driven by convolutional neural network (CNN) architectures. SRCNN [
1] demonstrated the feasibility of learning-based SR using a relatively simple three-layer network. Subsequently, deeper and more expressive models such as EDSR [
2] and RCAN [
3] significantly improved distortion-oriented performance, particularly in terms of PSNR.
More recently, transformer architectures have introduced another turning point in SR research by effectively modeling long-range dependencies and global context. For instance, SwinIR [
4] combines window-based self-attention with hierarchical representation learning, enabling simultaneous modeling of local details and global structures. Compared to CNN-based approaches, transformer-based models have achieved superior PSNR performance and improved generalization across multiple benchmarks.
These CNN- and transformer-based SR models are typically trained by minimizing pixel-wise reconstruction errors, such as L1 or MSE losses, with respect to ground-truth (GT) images. As a result, they are strongly aligned with distortion minimization objectives. However, high-frequency components such as textures and edges inherently involve ambiguity, as they cannot be uniquely determined from the LR observation alone. In such cases, distortion-driven optimization tends to favor statistically averaged solutions. This phenomenon reflects the well-known perception–distortion trade-off in SR: optimizing for distortion-oriented metrics does not necessarily lead to improved perceptual realism or texture fidelity. Consequently, while quantitative metrics may improve, the reconstructed images often appear overly smooth and lack rich textures. This tendency becomes more pronounced in real-world SR scenarios, where degradations are complex and less well defined.
To address these limitations, SR research has evolved beyond distortion-oriented reconstruction toward directly improving visual realism and perceptual quality. GAN-based approaches, such as SRGAN [
5], demonstrated that adversarial training can generate sharper and more realistic textures. Although such models often yield lower PSNR values, they tend to produce visually more natural results according to human perception.
However, GAN-based training suffers from instability and limited controllability over fine-grained textures, which restricts its practical applicability. These challenges become more pronounced in real-world SR scenarios, where degradations are complex and not explicitly defined. In such cases, achieving stable and high-quality reconstruction requires more robust generative modeling paradigms.
In this context, diffusion models have emerged as a powerful alternative for enhancing perceptual quality in SR [
6]. By progressively denoising a noisy signal, diffusion models can generate or restore images in a stable manner while preserving rich texture details. Compared to GAN-based methods, diffusion-based SR models provide more stable training dynamics and superior high-frequency reconstruction, including fine textures and natural edge structures. Recent studies report strong perceptual performance not only under synthetic degradation benchmarks but also in real-world settings.
Despite these advantages, diffusion-based SR models incur substantial computational and memory costs due to their multi-step denoising process. The iterative sampling procedure significantly increases inference time, making real-time deployment or lightweight applications challenging.
To mitigate the slow inference of diffusion models, recent studies have actively explored knowledge distillation (KD) strategies that reduce the number of denoising steps. In particular, one-step or few-step distillation methods have been proposed to approximate the iterative denoising process of a diffusion teacher using a compact student model [
7,
8]. These approaches significantly reduce the number of sampling steps while largely preserving perceptual quality.
However, in most cases, the student model remains within the diffusion paradigm. As a result, even with fewer denoising steps, the inherent architectural complexity and computational overhead of diffusion models persist. Consequently, their inference efficiency still lags behind that of transformer-based SR models.
In contrast, transformer-based SR models have benefited from homogeneous KD techniques [
9,
10,
11], which focus on compressing models within the same architectural family. Such approaches improve structural efficiency and enable faster inference, making them suitable for real-time applications. Nevertheless, since both teacher and student share similar deterministic restoration biases, these methods do not explicitly incorporate the generative texture prior learned by diffusion models. Therefore, the perceptual advantages offered by diffusion-based SR cannot be fully transferred in homogeneous transformer distillation settings.
In summary, alleviating the perception–distortion trade-off under practical computational constraints requires a new design that directly combines the inference efficiency of transformers with the perceptual strengths of diffusion models. This motivates a heterogeneous distillation framework that bridges fundamentally different generative and deterministic paradigms.
From this perspective, homogeneous KD alone is insufficient to fully address the limitations of diffusion-based SR. Diffusion-to-diffusion KD can effectively transfer the perceptual prior learned by the teacher within the same generative paradigm. However, since the student must still retain the iterative denoising process and timestep-dependent generation mechanism, the structural overhead caused by repeated sampling cannot be fundamentally eliminated.
In contrast, transformer-to-transformer KD preserves the single-path restoration architecture, offering advantages in computational efficiency. Nevertheless, because both teacher and student share similar deterministic restoration biases, the generative texture prior and perceptual realism inherent in diffusion models cannot be sufficiently transferred.
These limitations become more pronounced in real-world SR scenarios, where degradations are complex and often unknown. In such cases, distortion minimization alone is insufficient to produce natural and perceptually realistic textures. Therefore, a heterogeneous KD framework that combines the perceptual strengths of diffusion models with the efficient single-pass inference of transformers is particularly necessary.
In this paper, we propose a heterogeneous KD framework that transfers knowledge from a diffusion teacher to a transformer student for super-resolution. The proposed framework leverages the rich high-frequency texture modeling and perceptual reconstruction capability learned by the diffusion model as teacher knowledge, and distills it into a computationally efficient transformer-based student. The goal is to enhance perceptual quality while maintaining inference costs comparable to standard transformer-based SR models. In other words, we treat the high-quality texture generation capability of diffusion models as a source of teacher knowledge and adopt an efficient single-pass transformer backbone as the student, seeking a practical balance between perceptual realism and computational efficiency.
Furthermore, we observe that conventional output-level alignment based solely on pixel-wise losses such as L1 or MSE may be insufficient to effectively transfer the high-frequency texture knowledge learned by diffusion teachers. To address this limitation, we introduce a frequency-aware distillation loss based on the Discrete Wavelet Transform (DWT). Specifically, we decompose the output difference between teacher and student into frequency subbands using a two-level DWT, resulting in seven subbands (LL2, LH2, HL2, HH2, LH, HL, HH). We then design a weighted decomposed DWT loss by grouping subbands according to their structural roles—global structure (LL2), intermediate details (LH2, HL2, HH2), and fine textures (LH, HL, HH)—and assigning different importance weights to each group.
This wavelet-based decomposition enables the separation of structural components, edges, and fine textures that are entangled in the pixel domain, facilitating more targeted knowledge transfer of the high-frequency texture components where diffusion models exhibit strong advantages. As a result, the proposed loss allows more refined control over the balance between distortion and perceptual quality.
In addition, we highlight that KD in SR exhibits different characteristics from that in classification tasks. In classification, teacher logits typically provide reliable supervisory signals that can be strongly followed by the student. In contrast, in SR, the most reliable high-frequency reference is the GT, rather than the teacher output. While the diffusion teacher may produce sharper results than distortion-oriented models, its outputs may also contain noise, artifacts, or hallucinated textures. This issue can become more pronounced in real-world SR scenarios with complex degradations.
Therefore, naively enforcing strong and simultaneous alignment with both the GT and teacher outputs may lead to suboptimal learning dynamics. The student may either be prematurely biased toward imperfect high-frequency signals from the teacher or fail to sufficiently absorb the perceptual prior due to dominance of the task loss. To mitigate this issue, we introduce a linear progressive weighting strategy for the distillation loss. While keeping the task loss (L1 with respect to GT) fixed, the distillation loss weight is gradually increased during training. This curriculum-like strategy allows the student backbone to first establish stable restoration capability before progressively incorporating high-frequency perceptual knowledge, thereby improving both training stability and knowledge transfer effectiveness.
The main contributions of this paper can be summarized as follows:
We present a heterogeneous diffusion-to-transformer knowledge distillation framework for super-resolution that transfers the perceptual prior learned by diffusion models into an efficient transformer-based student, aiming to alleviate the perception–distortion trade-off under practical computational constraints.
We introduce a frequency-group-aware decomposed DWT distillation loss, which decomposes teacher–student discrepancies into frequency subbands and assigns group-specific importance weights for more effective high-frequency knowledge transfer. In addition, we employ a curriculum-inspired progressive KD scheduling strategy to gradually enhance perceptual quality while maintaining training stability.
Through extensive quantitative and qualitative evaluations, we demonstrate that the proposed framework achieves a favorable balance among distortion, perceptual quality, and computational efficiency, improving the practical trade-off compared to existing approaches.
3. Methods
3.1. Preliminary
Given a low-resolution (LR) image
, a single-image super-resolution (SISR) model
generates a super-resolved output as follows:
where
denote the height and width of the LR image,
s represents the upscaling factor, and
denotes the learnable model parameters.
Let the teacher and student models be denoted as
,
, respectively. For the same LR input
, the super-resolved outputs of the two models are defined as follows:
In super-resolution, the most commonly adopted distillation setup is response distillation, where the student model is trained not only to accurately reconstruct the ground truth
, but also to mimic the teacher output
. First, the reconstruction loss (task loss) between the ground truth and the student output is typically defined using the
distance.
Next, the basic distillation loss (KD loss), which aims to reduce the discrepancy between the teacher and student outputs in the pixel domain, can likewise be formulated using the
distance.
Therefore, the overall distillation objective in conventional SR settings can be formulated as a weighted combination of the task loss and the KD loss:
Here,
is a hyperparameter that controls the influence of the teacher signal. Building upon this basic formulation, many SR distillation methods extend the objective by incorporating additional components, such as feature-level distillation [
22], multi-teacher learning [
10], and contrastive objectives [
11].
3.2. Problem Formulation
Recent diffusion-based SR models achieve high perceptual quality through iterative denoising processes. However, the multi-step denoising procedure requires substantial computational and memory resources, and the repeated sampling steps impose significant practical limitations in terms of inference speed.
To reduce this cost, prior works have reformulated the diffusion denoising process as a KD problem, accelerating inference via one-step [
8] or few-step [
7] approximations while largely preserving perceptual quality. Nevertheless, such approaches still incur higher computational overhead and slower inference compared to transformer-based SR models. Moreover, even when the diffusion denoiser (e.g., U-Net) is reduced in size, it tends to remain less efficient and less favorable in terms of speed–performance trade-offs compared to transformer-based architectures.
These observations indicate that both homogeneous KD within transformer-based SR and step-reduction KD within diffusion-based SR exhibit limitations when targeting lightweight or real-time deployment scenarios.
Therefore, as illustrated in
Figure 1, we propose a heterogeneous KD framework that retains a perceptually strong diffusion teacher while adopting a transformer as the student model. The objective is to effectively transfer the perceptual prior of the diffusion teacher to the student while constraining inference cost to that of a single-pass transformer. During training, both the diffusion teacher output and the GT image are utilized to guide the student, aiming to alleviate the distortion–perception trade-off. During inference, however, only the transformer student is employed, producing high-quality results through a single forward pass and thereby significantly reducing inference latency.
3.3. Why Transformer as Student?
The objective of this work is to transfer the high perceptual quality and rich texture reconstruction capability of a diffusion-based teacher into an efficient student model that operates with a single forward pass. To this end, we adopt a transformer-based architecture as the student backbone.
This choice is motivated not only by the strong distortion-oriented performance of transformer-based SR models but also by several characteristics of the SR problem. First, super-resolution requires modeling long-range dependencies and maintaining structural consistency across the entire image, where global contextual reasoning plays a critical role. Second, the texture knowledge generated by diffusion teachers often exhibits directional and repetitive patterns that benefit from flexible relational modeling. Third, practical deployment scenarios impose strict latency and resource constraints, favoring architectures that provide high computational efficiency during inference.
In this section, we summarize the advantages of employing a transformer student from three perspectives: global context modeling capability, the representational flexibility of multi-head attention, and inference efficiency.
3.3.1. Global Receptive Field
Super-resolution is not merely a local interpolation problem from LR to HR, but requires generating plausible missing details while preserving global structural consistency across the image. Due to the ill-posed nature of SR, multiple HR solutions may correspond to the same LR observation. Without sufficiently modeling global structures—such as long-range edge continuity, periodic patterns, or symmetry—locally plausible textures may lead to globally inconsistent results, including structural distortions, phase misalignment in repetitive patterns, broken edges, or warped lines. Such issues become more pronounced in real-world SR scenarios, where degradations are complex and observable information is limited.
To effectively mimic the diffusion teacher’s outputs—which combine globally coherent structures with locally convincing textures—the student model must actively exploit global contextual information. Transformer-based students are well suited for this purpose, as self-attention directly models long-range dependencies among input tokens. This enables the preservation of global structural consistency while reconstructing fine local details.
3.3.2. Multi-Head Attention
The multi-head attention mechanism in transformers allows multiple attention heads to learn different interaction patterns in parallel, enriching the model’s representational capacity. In SR, important visual cues are diverse in nature, including directional edges and contours, repetitive textures and periodic patterns, as well as subtle noise and material-specific details.
Through multiple attention heads, the model can form distinct reference regions and interaction patterns within the input, offering greater flexibility than a single attention mechanism that treats all cues uniformly. This parallel modeling of heterogeneous visual relationships provides a potential advantage in capturing complex structures.
Meanwhile, diffusion-based teachers generate perceptually natural textures through progressive denoising. These textures are not merely amplified high-frequency signals, but often include directional, repetitive, and multi-scale correlation structures. From this perspective, the flexibility of multi-head attention may provide a suitable representational space for the student to absorb and approximate the structural characteristics of textures produced by the diffusion teacher. In this work, we hypothesize that multi-head attention plays a supportive role in facilitating the transfer of such perceptual texture knowledge.
3.3.3. Inference Time
As discussed earlier, transformers are practical candidates for student backbones when targeting improved inference efficiency compared to diffusion models. Diffusion teachers typically require iterative denoising steps, with U-Net-based computation performed at each step. Consequently, inference cost accumulates proportionally to the number of sampling steps. Even when the underlying network architecture remains unchanged, repeated execution dominates overall latency. Increasing the number of denoising steps to improve perceptual quality further amplifies this computational burden.
In contrast, transformer-based SR models usually perform restoration in a single forward pass. Although the per-pass computation may be substantial, the absence of iterative sampling provides a structural advantage in terms of latency. Single-pass inference also yields more predictable execution time, which simplifies system design and resource allocation in real-time or large-scale deployment scenarios.
Moreover, transformer operations are largely composed of matrix multiplications, making them amenable to hardware acceleration and parallelization. Practical optimization techniques, such as mixed-precision computation and kernel-level acceleration, can be readily applied. Therefore, if the goal is to preserve the perceptual strength of diffusion models while enabling practically deployable inference speed, compressing the teacher knowledge into a single-pass student architecture is essential. From this perspective, adopting a transformer student is justified not only in terms of restoration quality but also under time and resource constraints.
3.4. Loss Function Design
This section describes the design of our loss function, which aims to effectively transfer texture and high-frequency details from the diffusion teacher to the transformer student. In standard super-resolution training, the model is optimized by minimizing the pixel-wise difference between the generated image and the GT image, typically using an loss. Following this formulation, most knowledge distillation frameworks in SR also begin with a pixel-domain alignment, where an loss is applied between the teacher output and the student output.
However, directly applying an
-based KD loss is not well suited to our heterogeneous framework. As illustrated in
Figure 2, using an
loss for distillation tends to alter the overall color tone of the image and fails to adequately recover edge and fine structural information. This limitation arises from the fundamental differences in image generation mechanisms: diffusion models generate outputs through a stochastic denoising process, whereas transformer-based SR models perform deterministic pixel-level restoration. The resulting representational discrepancy between teacher and student makes simple pixel-wise alignment less effective for transferring perceptual texture knowledge.
To address this issue, we introduce a frequency-based distillation loss that preserves global structures while better capturing textures and fine details generated by the diffusion teacher. In super-resolution, it is important not only to distinguish which frequency components are lost, but also to consider where these components occur spatially.
Fourier-based representations effectively analyze global frequency distributions, but lack spatial localization, making them less suitable for directly modeling localized structures such as edges, corners, and repetitive textures. In contrast, wavelet transforms provide both spatial localization and frequency decomposition, making them better suited for modeling region-specific high-frequency characteristics in SR.
This property aligns well with the characteristics of diffusion-based teacher outputs. The texture information generated by diffusion models can be interpreted not merely as global frequency amplification, but as locally structured and directionally correlated high-frequency residuals formed through a probabilistic restoration process. Discrete Wavelet Transform (DWT) preserves global structural information through the LL component while separating directional details through the LH, HL, and HH components. Therefore, DWT provides a structured representation that facilitates more effective transfer of perceptual textures from the diffusion teacher to the student.
Furthermore, the decomposition depth is closely related to the trade-off between representational granularity and training stability. A single-level decomposition may group high-frequency components too coarsely, failing to sufficiently separate intermediate details from fine textures. On the other hand, deeper decompositions (e.g., three levels or more) provide finer frequency separation but reduce the spatial resolution of each sub-band and increase the complexity of loss design, which may adversely affect restoration performance.
Considering the balance between frequency granularity, spatial interpretability, and training stability, we adopt a two-level wavelet decomposition in this work. Specifically, using a two-level wavelet decomposition, we decompose the image into seven sub-bands: LL2, LH2, HL2, HH2, LH, HL, HH. These sub-bands are further organized into three frequency groups according to their structural characteristics, corresponding to coarse structures, intermediate details, and fine textures. Group-wise weighted distillation is then performed based on this categorization.
The LL2 component represents the overall low-frequency structure of the image. The intermediate detail group consists of the second-level high-frequency components LH2, HL2, HH2, obtained by further decomposing the first-level low-frequency band. Finally, the first-level high-frequency bands LH, HL, HH capture fine-grained textures and directional details.
After grouping the sub-bands into three frequency groups, we analyze the discrepancy between the teacher and student outputs within each group, as illustrated in
Figure 3 and
Figure 4. Specifically, we measure the magnitude of differences in each frequency group to identify where the representational gap is most significant.
Our analysis reveals that the mid-frequency group, denoted as (corresponding to LH2, HL2, HH2), exhibits larger discrepancies compared to the coarse-structure and fine-texture groups. This suggests that intermediate structural details are not sufficiently captured by the student when trained with uniform weighting.
Based on this observation, we assign a higher loss weight to the mid-frequency group in our distillation objective. By emphasizing the frequency components with larger discrepancies, the proposed loss encourages stronger alignment in these regions, facilitating more balanced knowledge transfer across the frequency spectrum.
The resulting distillation loss is formulated as follows:
The complete training objective is formulated as
Our KD framework adopts output-level distillation only and does not employ feature-level distillation. This design choice is motivated by the fundamental differences between the intermediate representations of diffusion-based teacher models and transformer-based student models, which may render feature-level supervision unstable.
In diffusion models, intermediate features are typically defined in a latent space and are conditioned on both the diffusion timestep and the input condition. Even for the same input, the semantic meaning and statistical properties of intermediate features evolve throughout the reverse diffusion process. As a result, these representations can be interpreted as dynamic, stochastic features reflecting the underlying generative prior.
In contrast, intermediate features in transformer-based student models are deterministic feed-forward representations computed sequentially across layers for a given input. Unlike diffusion models, their feature distributions do not vary with timestep-dependent stochastic sampling processes.
Due to this structural discrepancy, establishing a direct one-to-one correspondence between teacher and student intermediate features is nontrivial. Naïve feature matching would simultaneously enforce alignment across mismatched representation spaces and heterogeneous feature distributions. This can introduce high-variance gradients and potentially destabilize optimization throughout the distillation process, particularly during early training stages.
By comparison, output-level distillation operates in the shared observation space of reconstructed images, where both teacher and student produce outputs in the same domain. This provides a well-defined and stable alignment target. Furthermore, when combined with frequency decomposition, output-level supervision allows controlled transfer of specific frequency components.
Therefore, to ensure stable and effective transfer of the diffusion teacher’s perceptual prior, we adopt an output-level distillation framework without feature-level supervision.
3.5. Training Process
At the early stage of training, the student model has not yet established stable structural reconstruction capability. Introducing a strong distillation signal at this point may conflict with the gradient induced by the GT-based reconstruction loss, potentially leading to unstable optimization.
To mitigate this issue, we adopt a strategy in which the distillation weight is linearly increased during training.
The objective of this strategy is to maintain an appropriate balance between the GT supervision and the teacher guidance, thereby enabling stable extraction of high-frequency information within the proposed distillation framework. In super-resolution, the ground-truth image contains the most reliable and detailed high-frequency information. While the diffusion teacher output also includes rich high-frequency components, it may inevitably contain noise or hallucinated textures due to its probabilistic generation process. This characteristic differs fundamentally from knowledge distillation in classification, where the teacher’s output distribution typically serves as a reliable supervisory signal.
Therefore, maintaining equal weighting between the GT loss and the distillation loss from the beginning of training may cause the student to be overly influenced by imperfect teacher outputs, potentially leading to suboptimal learning dynamics.
To address this issue, we gradually increase the weight of the KD loss during training. Specifically, training begins with the distillation weight set to zero, and the weight is linearly increased to a predefined final value over the course of optimization. This strategy can be expressed as follows:
Here
t denotes the current training iteration,
T represents the total number of training iterations, and
denotes the predefined final distillation weight. Accordingly, the overall training objective can be expressed as follows:
As a result, during the early stage of training, the student model primarily focuses on learning the mapping to the ground truth, emphasizing restoration of the overall structure and low-frequency components. As training progresses and the distillation weight gradually increases, the student is increasingly guided by the high-frequency information generated by the diffusion teacher.
This progressive weighting strategy enables the student to reconstruct HR images from LR inputs without compromising global structural consistency, while gradually strengthening the emphasis on high-frequency and perceptual details. In other words, the linear increase of the distillation weight maintains stable distortion-oriented reconstruction through the task loss, while progressively enhancing perceptual influence via knowledge distillation.
4. Experiments
4.1. Experimental Settings
4.1.1. Datasets and Metrics
The transformer-based student model is trained on the standard synthetic dataset setting using the 800 training images from the DIV2K dataset [
26], following common practice in SR research. However, since this work focuses on real-world SR performance, our evaluation protocol differs from conventional synthetic benchmark settings. Instead of the traditional benchmark datasets (Set5 [
27], Set14 [
28], BSD100 [
29], and Urban100 [
30]), we evaluate on RealSR [
31] and the ImageNet-Test dataset [
32], which better reflect realistic degradation scenarios.
In addition to PSNR, a distortion-oriented metric widely used in synthetic SR evaluation, we report perceptual metrics including LPIPS and MUSIQ. By jointly analyzing distortion and perceptual metrics under real-world settings, we assess whether the proposed framework improves the distortion–perception trade-off in terms of generalization performance.
4.1.2. Teacher and Student Model
In the proposed KD framework, we adopt SwinIR-small [
4] as the student model. This variant compresses the original SwinIR configuration (180 channels with six residual blocks) into a lightweight architecture with 60 channels and four residual blocks. This setup follows the commonly used configuration in homogeneous transformer-based KD studies for SR, ensuring a fair and practical lightweight student baseline.
As the teacher model, we employ ResShift [
7], considering the characteristics of diffusion-based SR models. Diffusion approaches for SR have evolved along two main directions. One line of work retrains the diffusion process in a task-specific manner [
33,
34,
35], where the low-resolution signal strongly intervenes throughout the diffusion trajectory, enabling consistent reconstruction at the cost of substantial training and inference overhead. The other line leverages pretrained diffusion priors and controls the reverse process through conditioning [
36,
37,
38], which reduces computational cost but may struggle to maintain stable perceptual quality.
ResShift differs from conventional diffusion models by learning residual corrections rather than directly predicting denoising trajectories. In particular, noise injection during the reverse process is conditioned on the low-resolution image, resulting in a significantly narrower restoration search space compared to methods that start from pure Gaussian noise. This LR-conditioned noise design reduces unnecessary stochastic sampling and shortens the sampling process from hundreds of steps to approximately 15 steps while preserving strong high-frequency texture and structural restoration capability. Owing to these properties, the student model receives more stable and perceptually meaningful texture guidance during KD training.
4.2. Performance Evaluation
Table 1 presents the quantitative results of the proposed method. Since our framework employs output-level distillation only, we compare it with AugKD [
9], which also performs output-level distillation, as well as with a baseline trained using only the standard
loss. All reported results are averaged over three independent runs.
On both real-world datasets, the proposed method achieves the best overall performance. Notably, although AugKD improves distortion-oriented metrics compared to the -only baseline, it yields lower perceptual scores (LPIPS and MUSIQ) than the standalone SwinIR-small model. This suggests that homogeneous pixel-level KD strategies, which are effective on synthetic benchmark datasets, do not necessarily translate to perceptual gains in real-world SR scenarios.
In contrast, the proposed method achieves consistent improvements across both distortion and perceptual metrics. On the RealSR dataset, it reduces LPIPS by 0.059 and improves MUSIQ by 16.969 compared to the baseline. Furthermore, it surpasses AugKD in PSNR while simultaneously improving perceptual scores, demonstrating a more favorable distortion–perception trade-off.
Table 2 and
Table 3 compare the computational and quantitative performance of the diffusion teacher (ResShift), its one-step variant (ResShift-1 step), and the proposed transformer-based student (SwinIR-small). Due to its 15-step reverse diffusion process, ResShift incurs substantial computational cost and latency. Reducing the sampling process to a single step decreases FLOPs from 95.736 G to 48.562 G (approximately 2.0× reduction) and shortens inference time from 371.34 ms to 83.68 ms (approximately 4.4× speedup).
However, even with one-step sampling, ResShift-1 step remains computationally heavier than the transformer-based student, as it still relies on a large diffusion U-Net backbone. Compared to ResShift-1 step, SwinIR-small further reduces FLOPs from 48.562 G to 10.087 G (approximately 4.8× smaller) and decreases inference time from 83.68 ms to 19.09 ms (approximately 4.4× faster). In terms of quantitative performance, ResShift-1 step achieves a MUSIQ score that is 8.685 higher than SwinIR-small, while LPIPS remains comparable. However, its PSNR is 1.06 dB lower. These results indicate that reducing diffusion sampling steps alone does not necessarily provide an optimal balance between efficiency and distortion performance.
Overall,
Table 2 and
Table 3 demonstrate that the transformer-based student offers a more favorable trade-off between computational efficiency and reconstruction quality compared to step-reduced diffusion variants.
Figure 5 and
Figure 6 present qualitative examples corresponding to the quantitative results in
Table 1. In both cases, when distillation is performed using a pixel-wise
loss, noticeable color shifts are observed in the reconstructed images. In contrast, the proposed method better preserves overall color consistency while enhancing structural details that are not fully recovered by the standalone SwinIR-small model. Specifically,
Figure 5 shows improved reconstruction of repetitive diagonal patterns that appear blurred in other methods. Similarly,
Figure 6 illustrates clearer restoration of alphabet boundaries, which were previously indistinct.
These qualitative observations are consistent with the quantitative results, suggesting that the proposed framework achieves a more balanced distortion–perception trade-off in real-world SR scenarios.
4.3. Ablation Studies and Analysis
Table 4 presents the experimental results for different decomposition levels when equal weights are applied to all DWT sub-bands. When adopting the proposed two-level decomposition, PSNR improves by 0.02 dB and MUSIQ increases by 4.386 compared to the one-level setting. In contrast, deeper decompositions (three levels or more) show a tendency toward performance degradation despite involving a larger number of sub-bands. This suggests that excessive decomposition may over-fragment frequency components, thereby limiting the effective transfer of essential structural and texture information required for restoration.
Table 5 compares the proposed DWT-based loss with other frequency-based loss formulations. The DTCWT loss, which captures more fine-grained directional frequency information than standard DWT, shows some improvement in the distortion–perception trade-off compared to the baseline. However, it achieves lower overall performance than the proposed DWT-based loss. In contrast, the Fourier-based loss improves perceptual metrics to some extent but leads to degradation in distortion-oriented performance, indicating a less balanced trade-off between reconstruction fidelity and perceptual quality.
Table 6 and
Table 7 present the experimental results under different weighting configurations of the proposed loss function. As shown in
Figure 3 and
Figure 4, the discrepancy between teacher and student outputs is largest in the mid-frequency (MF) band group among the three groups (LF, MF, HF).
Based on this observation, assigning a higher weight to the MF group than to the LF and HF groups leads to improved performance compared to uniform weighting. This tendency is consistently observed in the experimental results. The best performance is achieved when the MF group is assigned twice the weight of the other groups. This suggests that emphasizing frequency bands with larger teacher–student discrepancies is beneficial for more effective knowledge transfer.
In addition, adopting a linearly increasing schedule for the KD loss weight during training improves performance compared to using a fixed weight. Specifically, the linear scheduling strategy yields an improvement of 0.12 dB in PSNR and 0.727 in MUSIQ. These results indicate that prioritizing GT-based structural learning in early training and gradually strengthening the influence of the teacher’s high-frequency prior is more effective than maintaining a constant distillation weight throughout training.
However, excessively increasing the teacher influence eventually degrades performance. This implies that overemphasizing high-frequency components without adequately preserving global structural information can harm the distortion–perception balance.
In summary, the ablation results in
Table 4,
Table 5,
Table 6 and
Table 7 demonstrate that frequency-based distillation alone does not automatically guarantee performance improvement. Consistent gains are achieved only when a moderate decomposition level (two-level), discrepancy-aware non-uniform weighting (emphasizing MF), and progressive KD scheduling are jointly applied. These findings indicate that the effectiveness of frequency-based KD depends not merely on frequency-domain comparison itself, but on careful band-wise importance design and controlled distillation strength.
Finally,
Table 8 reports the results on the synthetic benchmark Urban100 dataset. In synthetic settings, the degradation process is explicitly defined, and evaluation primarily emphasizes pixel-level alignment with the GT HR image. Under such conditions, the distortion-oriented student model achieves higher PSNR than the diffusion teacher.
In contrast, the diffusion teacher prioritizes perceptually plausible texture and high-frequency restoration based on its learned generative prior, rather than strictly enforcing exact pixel-wise alignment with the GT. Instead of converging to a single deterministic pixel solution, the diffusion model operates within a plausible restoration distribution conditioned on the LR input. Consequently, small pixel-wise discrepancies may accumulate in synthetic benchmarks, resulting in lower PSNR despite perceptually convincing outputs.
Accordingly, on Urban100, the student outperforms the teacher in distortion-oriented metrics, whereas the teacher achieves stronger perceptual performance. In this scenario, applying an -based KD loss enforces strict pixel-level alignment between teacher and student, which can degrade performance in both distortion and perception.
In contrast, the proposed decomposed DWT loss achieves consistent, albeit modest, improvements even in the synthetic setting, with gains of 0.04 dB in PSNR and 0.263 in MUSIQ. These results suggest that the proposed method transfers the texture and high-frequency restoration characteristics of the diffusion teacher to the student without excessively enforcing pixel-level matching, even under synthetic benchmark conditions.