DTKD: Diffusion-to-Transformer Heterogeneous Knowledge Distillation for Efficient and Perceptually Enhanced Super-Resolution

Park, Jeong Hyeok; Song, Byung Cheol

doi:10.3390/electronics15101986

Open AccessArticle

DTKD: Diffusion-to-Transformer Heterogeneous Knowledge Distillation for Efficient and Perceptually Enhanced Super-Resolution

by

Jeong Hyeok Park

and

Byung Cheol Song

^*

Department of Electrical and Computer Engineering, Inha University, Incheon 22212, Republic of Korea

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(10), 1986; https://doi.org/10.3390/electronics15101986

Submission received: 3 April 2026 / Revised: 26 April 2026 / Accepted: 6 May 2026 / Published: 7 May 2026

(This article belongs to the Topic Computer Vision and Image Processing, 3rd Edition)

Download

Browse Figures

Versions Notes

Abstract

Single-image super-resolution (SISR) aims to reconstruct high-resolution (HR) images from low-resolution (LR) inputs and remains fundamentally ill-posed due to the inherent ambiguity of missing high-frequency details. While diffusion-based SR models achieve superior perceptual quality through iterative denoising, their multi-step sampling process results in substantial computational cost and latency. In contrast, transformer-based SR models offer efficient single-forward inference but are typically optimized for distortion-oriented objectives, limiting perceptual realism. In this paper, we propose DTKD, a diffusion-to-transformer heterogeneous knowledge distillation framework that transfers the perceptual prior of a diffusion teacher into an efficient transformer student. To effectively bridge the representational gap between generative diffusion outputs and deterministic transformer reconstructions, we introduce a frequency-group-aware distillation loss based on two-level discrete wavelet transform (DWT). The loss decomposes images into structured frequency sub-bands and assigns non-uniform weights to emphasize discrepancy-sensitive mid-frequency components. Furthermore, we adopt a progressive scheduling strategy that gradually increases the distillation weight during training to stabilize optimization and balance structural fidelity with perceptual enhancement. Extensive experiments on real-world SR benchmarks demonstrate that the proposed framework consistently improves perceptual quality over a standalone transformer student while maintaining transformer-level inference efficiency. Ablation studies further validate the importance of moderate frequency decomposition, discrepancy-aware weighting, and progressive distillation scheduling. These results suggest that heterogeneous distillation provides an effective and practical approach for transferring diffusion-based generative priors into efficient super-resolution models.

Keywords:

diffusion; image super-resolution; knowledge distillation; transformer

1. Introduction

Single-image super-resolution (SISR) aims to reconstruct a high-resolution (HR) image from a low-resolution (LR) observation and has been widely applied in video streaming and broadcasting, satellite imaging, and medical imaging. In scenarios where the quality of the original data is inherently limited, super-resolution (SR) plays a critical role in recovering important visual details or generating visually plausible HR outputs. The SISR problem is inherently ill-posed, as multiple plausible HR solutions may correspond to the same LR input. Consequently, the characteristics of the reconstructed results largely depend on the restoration objective adopted by the model.

Deep learning-based super-resolution has rapidly evolved, primarily driven by convolutional neural network (CNN) architectures. SRCNN [1] demonstrated the feasibility of learning-based SR using a relatively simple three-layer network. Subsequently, deeper and more expressive models such as EDSR [2] and RCAN [3] significantly improved distortion-oriented performance, particularly in terms of PSNR.

More recently, transformer architectures have introduced another turning point in SR research by effectively modeling long-range dependencies and global context. For instance, SwinIR [4] combines window-based self-attention with hierarchical representation learning, enabling simultaneous modeling of local details and global structures. Compared to CNN-based approaches, transformer-based models have achieved superior PSNR performance and improved generalization across multiple benchmarks.

These CNN- and transformer-based SR models are typically trained by minimizing pixel-wise reconstruction errors, such as L1 or MSE losses, with respect to ground-truth (GT) images. As a result, they are strongly aligned with distortion minimization objectives. However, high-frequency components such as textures and edges inherently involve ambiguity, as they cannot be uniquely determined from the LR observation alone. In such cases, distortion-driven optimization tends to favor statistically averaged solutions. This phenomenon reflects the well-known perception–distortion trade-off in SR: optimizing for distortion-oriented metrics does not necessarily lead to improved perceptual realism or texture fidelity. Consequently, while quantitative metrics may improve, the reconstructed images often appear overly smooth and lack rich textures. This tendency becomes more pronounced in real-world SR scenarios, where degradations are complex and less well defined.

To address these limitations, SR research has evolved beyond distortion-oriented reconstruction toward directly improving visual realism and perceptual quality. GAN-based approaches, such as SRGAN [5], demonstrated that adversarial training can generate sharper and more realistic textures. Although such models often yield lower PSNR values, they tend to produce visually more natural results according to human perception.

However, GAN-based training suffers from instability and limited controllability over fine-grained textures, which restricts its practical applicability. These challenges become more pronounced in real-world SR scenarios, where degradations are complex and not explicitly defined. In such cases, achieving stable and high-quality reconstruction requires more robust generative modeling paradigms.

In this context, diffusion models have emerged as a powerful alternative for enhancing perceptual quality in SR [6]. By progressively denoising a noisy signal, diffusion models can generate or restore images in a stable manner while preserving rich texture details. Compared to GAN-based methods, diffusion-based SR models provide more stable training dynamics and superior high-frequency reconstruction, including fine textures and natural edge structures. Recent studies report strong perceptual performance not only under synthetic degradation benchmarks but also in real-world settings.

Despite these advantages, diffusion-based SR models incur substantial computational and memory costs due to their multi-step denoising process. The iterative sampling procedure significantly increases inference time, making real-time deployment or lightweight applications challenging.

To mitigate the slow inference of diffusion models, recent studies have actively explored knowledge distillation (KD) strategies that reduce the number of denoising steps. In particular, one-step or few-step distillation methods have been proposed to approximate the iterative denoising process of a diffusion teacher using a compact student model [7,8]. These approaches significantly reduce the number of sampling steps while largely preserving perceptual quality.

However, in most cases, the student model remains within the diffusion paradigm. As a result, even with fewer denoising steps, the inherent architectural complexity and computational overhead of diffusion models persist. Consequently, their inference efficiency still lags behind that of transformer-based SR models.

In contrast, transformer-based SR models have benefited from homogeneous KD techniques [9,10,11], which focus on compressing models within the same architectural family. Such approaches improve structural efficiency and enable faster inference, making them suitable for real-time applications. Nevertheless, since both teacher and student share similar deterministic restoration biases, these methods do not explicitly incorporate the generative texture prior learned by diffusion models. Therefore, the perceptual advantages offered by diffusion-based SR cannot be fully transferred in homogeneous transformer distillation settings.

In summary, alleviating the perception–distortion trade-off under practical computational constraints requires a new design that directly combines the inference efficiency of transformers with the perceptual strengths of diffusion models. This motivates a heterogeneous distillation framework that bridges fundamentally different generative and deterministic paradigms.

From this perspective, homogeneous KD alone is insufficient to fully address the limitations of diffusion-based SR. Diffusion-to-diffusion KD can effectively transfer the perceptual prior learned by the teacher within the same generative paradigm. However, since the student must still retain the iterative denoising process and timestep-dependent generation mechanism, the structural overhead caused by repeated sampling cannot be fundamentally eliminated.

In contrast, transformer-to-transformer KD preserves the single-path restoration architecture, offering advantages in computational efficiency. Nevertheless, because both teacher and student share similar deterministic restoration biases, the generative texture prior and perceptual realism inherent in diffusion models cannot be sufficiently transferred.

These limitations become more pronounced in real-world SR scenarios, where degradations are complex and often unknown. In such cases, distortion minimization alone is insufficient to produce natural and perceptually realistic textures. Therefore, a heterogeneous KD framework that combines the perceptual strengths of diffusion models with the efficient single-pass inference of transformers is particularly necessary.

In this paper, we propose a heterogeneous KD framework that transfers knowledge from a diffusion teacher to a transformer student for super-resolution. The proposed framework leverages the rich high-frequency texture modeling and perceptual reconstruction capability learned by the diffusion model as teacher knowledge, and distills it into a computationally efficient transformer-based student. The goal is to enhance perceptual quality while maintaining inference costs comparable to standard transformer-based SR models. In other words, we treat the high-quality texture generation capability of diffusion models as a source of teacher knowledge and adopt an efficient single-pass transformer backbone as the student, seeking a practical balance between perceptual realism and computational efficiency.

Furthermore, we observe that conventional output-level alignment based solely on pixel-wise losses such as L1 or MSE may be insufficient to effectively transfer the high-frequency texture knowledge learned by diffusion teachers. To address this limitation, we introduce a frequency-aware distillation loss based on the Discrete Wavelet Transform (DWT). Specifically, we decompose the output difference between teacher and student into frequency subbands using a two-level DWT, resulting in seven subbands (LL2, LH2, HL2, HH2, LH, HL, HH). We then design a weighted decomposed DWT loss by grouping subbands according to their structural roles—global structure (LL2), intermediate details (LH2, HL2, HH2), and fine textures (LH, HL, HH)—and assigning different importance weights to each group.

This wavelet-based decomposition enables the separation of structural components, edges, and fine textures that are entangled in the pixel domain, facilitating more targeted knowledge transfer of the high-frequency texture components where diffusion models exhibit strong advantages. As a result, the proposed loss allows more refined control over the balance between distortion and perceptual quality.

In addition, we highlight that KD in SR exhibits different characteristics from that in classification tasks. In classification, teacher logits typically provide reliable supervisory signals that can be strongly followed by the student. In contrast, in SR, the most reliable high-frequency reference is the GT, rather than the teacher output. While the diffusion teacher may produce sharper results than distortion-oriented models, its outputs may also contain noise, artifacts, or hallucinated textures. This issue can become more pronounced in real-world SR scenarios with complex degradations.

Therefore, naively enforcing strong and simultaneous alignment with both the GT and teacher outputs may lead to suboptimal learning dynamics. The student may either be prematurely biased toward imperfect high-frequency signals from the teacher or fail to sufficiently absorb the perceptual prior due to dominance of the task loss. To mitigate this issue, we introduce a linear progressive weighting strategy for the distillation loss. While keeping the task loss (L1 with respect to GT) fixed, the distillation loss weight is gradually increased during training. This curriculum-like strategy allows the student backbone to first establish stable restoration capability before progressively incorporating high-frequency perceptual knowledge, thereby improving both training stability and knowledge transfer effectiveness.

The main contributions of this paper can be summarized as follows:

We present a heterogeneous diffusion-to-transformer knowledge distillation framework for super-resolution that transfers the perceptual prior learned by diffusion models into an efficient transformer-based student, aiming to alleviate the perception–distortion trade-off under practical computational constraints.
We introduce a frequency-group-aware decomposed DWT distillation loss, which decomposes teacher–student discrepancies into frequency subbands and assigns group-specific importance weights for more effective high-frequency knowledge transfer. In addition, we employ a curriculum-inspired progressive KD scheduling strategy to gradually enhance perceptual quality while maintaining training stability.
Through extensive quantitative and qualitative evaluations, we demonstrate that the proposed framework achieves a favorable balance among distortion, perceptual quality, and computational efficiency, improving the practical trade-off compared to existing approaches.

2. Related Works

2.1. Distortion-Oriented Super-Resolution Models

Super-resolution research centered on distortion-oriented metrics, such as PSNR and SSIM, has primarily evolved toward maximizing reconstruction accuracy through pixel-wise error minimization (e.g.,

L_{1}

or

L_{2}

losses) with respect to GT images.

Representative CNN-based models, including EDSR [2] and RCAN [3], as well as earlier works [12,13], have established strong baselines by employing deep residual structures and channel attention mechanisms to enhance representational capacity. These approaches have long served as standard benchmarks in distortion-oriented SR research.

With the introduction of transformer architectures into SR, restoration performance has been further improved through global context modeling and long-range dependency learning. SwinIR [4] and its variants leverage window-based self-attention to maintain computational efficiency while benefiting from hierarchical representations. More recently, HAT [14] enhances attention design to further improve performance, while HiT-SR [15] and PFT [16] explore hierarchical and hybrid structures to balance efficiency and accuracy. In addition, ATD [17] strengthens token–context interactions to alleviate the limitations of window-based attention, demonstrating continued performance improvements in distortion-oriented transformer-based SR models.

2.2. Perceptual-Oriented Super-Resolution Models

Distortion-oriented optimization often converges to statistically averaged solutions, which tend to over-smooth high-frequency components such as textures and edges that are inherently difficult to reconstruct. To alleviate this issue, GAN-based approaches, including SRGAN [5] and ESRGAN [18], as well as subsequent extensions [19], introduced adversarial training and perceptual losses to enhance visual realism and texture fidelity. However, GAN-based training is often constrained by instability and limited controllability over artifacts, particularly in real-world SR scenarios where degradations are uncertain and diverse.

Against this backdrop, diffusion-based SR has emerged as a promising alternative for improving perceptual quality. By progressively denoising noisy signals, diffusion models generate high-quality textures in a stable manner. Recent works such as SeeSR [20], which incorporates semantic and structural guidance for real-world SR, ResShift [7], which reformulates diffusion from a residual restoration perspective, and FaithDiff [21], which emphasizes fidelity preservation, demonstrate that the generative prior provided by diffusion models remains effective even under realistic and complex degradation settings.

2.3. Knowledge Distillation for Super-Resolution

Knowledge distillation (KD) in super-resolution can generally be categorized into two directions: model compression within the same architectural family and distillation aimed at accelerating diffusion-based models with slow sampling procedures. In homogeneous KD settings, where teacher and student belong to the same model family, prior studies have focused on response distillation while improving data augmentation strategies, patch-level transformations, or training signal design to enhance the restoration capability of lightweight students. For example, AugKD [9] improves distillation effectiveness through data augmentation to increase input diversity, while MiPKD [22] refines distillation signals via feature-level patch mixing. In addition, contrastive variants such as DCKD [11] aim to alleviate inefficiencies in output-space alignment and concentrate distillation on informative regions.

On the other hand, distillation for accelerating diffusion-based SR models has been actively studied by approximating multi-step denoising with single-step or few-step inference. OSEDiff [23] proposes a learning framework for one-step diffusion, while TAD-SR [24] introduces time-aware distillation. More recently, DOVE [25] explores a distillation-free one-step strategy. These approaches aim to reduce inference cost while preserving the strong perceptual quality of diffusion models. However, in most cases, the student architecture remains within the diffusion paradigm, thereby retaining substantial computational overhead compared to transformer-based SR models that operate with a single forward pass.

In contrast to these two research directions, our work focuses on heterogeneous distillation, in which the perceptual prior learned by a diffusion teacher is transferred to a transformer student. By combining diffusion-based perceptual strengths with the structural efficiency of single-pass transformer inference, the proposed framework differs from existing KD approaches in SR.

3. Methods

3.1. Preliminary

Given a low-resolution (LR) image

I^{L R} \in R^{H \times W \times C}

, a single-image super-resolution (SISR) model

F (\cdot; θ)

generates a super-resolved output as follows:

I^{S R} = F (I^{L R}; θ) \in R^{s H \times s W \times C},

(1)

where

H, W

denote the height and width of the LR image, s represents the upscaling factor, and

θ

denotes the learnable model parameters.

Let the teacher and student models be denoted as

F^{T} (\cdot; θ^{T})

,

F^{S} (\cdot; θ^{S})

, respectively. For the same LR input

I^{L R}

, the super-resolved outputs of the two models are defined as follows:

I_{T}^{S R} = F^{T} (I^{L R}; θ^{T}), I_{S}^{S R} = F^{S} (I^{L R}; θ^{S}) .

(2)

In super-resolution, the most commonly adopted distillation setup is response distillation, where the student model is trained not only to accurately reconstruct the ground truth

I^{H R}

, but also to mimic the teacher output

I_{T}^{S R}

. First, the reconstruction loss (task loss) between the ground truth and the student output is typically defined using the

L_{1}

distance.

L_{task} = {∥I^{H R} - I_{S}^{S R}∥}_{1} .

(3)

Next, the basic distillation loss (KD loss), which aims to reduce the discrepancy between the teacher and student outputs in the pixel domain, can likewise be formulated using the

L_{1}

distance.

L_{KD} = {∥I_{T}^{S R} - I_{S}^{S R}∥}_{1} .

(4)

Therefore, the overall distillation objective in conventional SR settings can be formulated as a weighted combination of the task loss and the KD loss:

L = L_{task} + λ_{KD} L_{KD},

(5)

Here,

λ_{KD}

is a hyperparameter that controls the influence of the teacher signal. Building upon this basic formulation, many SR distillation methods extend the objective by incorporating additional components, such as feature-level distillation [22], multi-teacher learning [10], and contrastive objectives [11].

3.2. Problem Formulation

Recent diffusion-based SR models achieve high perceptual quality through iterative denoising processes. However, the multi-step denoising procedure requires substantial computational and memory resources, and the repeated sampling steps impose significant practical limitations in terms of inference speed.

To reduce this cost, prior works have reformulated the diffusion denoising process as a KD problem, accelerating inference via one-step [8] or few-step [7] approximations while largely preserving perceptual quality. Nevertheless, such approaches still incur higher computational overhead and slower inference compared to transformer-based SR models. Moreover, even when the diffusion denoiser (e.g., U-Net) is reduced in size, it tends to remain less efficient and less favorable in terms of speed–performance trade-offs compared to transformer-based architectures.

These observations indicate that both homogeneous KD within transformer-based SR and step-reduction KD within diffusion-based SR exhibit limitations when targeting lightweight or real-time deployment scenarios.

Therefore, as illustrated in Figure 1, we propose a heterogeneous KD framework that retains a perceptually strong diffusion teacher while adopting a transformer as the student model. The objective is to effectively transfer the perceptual prior of the diffusion teacher to the student while constraining inference cost to that of a single-pass transformer. During training, both the diffusion teacher output and the GT image are utilized to guide the student, aiming to alleviate the distortion–perception trade-off. During inference, however, only the transformer student is employed, producing high-quality results through a single forward pass and thereby significantly reducing inference latency.

3.3. Why Transformer as Student?

The objective of this work is to transfer the high perceptual quality and rich texture reconstruction capability of a diffusion-based teacher into an efficient student model that operates with a single forward pass. To this end, we adopt a transformer-based architecture as the student backbone.

This choice is motivated not only by the strong distortion-oriented performance of transformer-based SR models but also by several characteristics of the SR problem. First, super-resolution requires modeling long-range dependencies and maintaining structural consistency across the entire image, where global contextual reasoning plays a critical role. Second, the texture knowledge generated by diffusion teachers often exhibits directional and repetitive patterns that benefit from flexible relational modeling. Third, practical deployment scenarios impose strict latency and resource constraints, favoring architectures that provide high computational efficiency during inference.

In this section, we summarize the advantages of employing a transformer student from three perspectives: global context modeling capability, the representational flexibility of multi-head attention, and inference efficiency.

3.3.1. Global Receptive Field

Super-resolution is not merely a local interpolation problem from LR to HR, but requires generating plausible missing details while preserving global structural consistency across the image. Due to the ill-posed nature of SR, multiple HR solutions may correspond to the same LR observation. Without sufficiently modeling global structures—such as long-range edge continuity, periodic patterns, or symmetry—locally plausible textures may lead to globally inconsistent results, including structural distortions, phase misalignment in repetitive patterns, broken edges, or warped lines. Such issues become more pronounced in real-world SR scenarios, where degradations are complex and observable information is limited.

To effectively mimic the diffusion teacher’s outputs—which combine globally coherent structures with locally convincing textures—the student model must actively exploit global contextual information. Transformer-based students are well suited for this purpose, as self-attention directly models long-range dependencies among input tokens. This enables the preservation of global structural consistency while reconstructing fine local details.

3.3.2. Multi-Head Attention

The multi-head attention mechanism in transformers allows multiple attention heads to learn different interaction patterns in parallel, enriching the model’s representational capacity. In SR, important visual cues are diverse in nature, including directional edges and contours, repetitive textures and periodic patterns, as well as subtle noise and material-specific details.

Through multiple attention heads, the model can form distinct reference regions and interaction patterns within the input, offering greater flexibility than a single attention mechanism that treats all cues uniformly. This parallel modeling of heterogeneous visual relationships provides a potential advantage in capturing complex structures.

Meanwhile, diffusion-based teachers generate perceptually natural textures through progressive denoising. These textures are not merely amplified high-frequency signals, but often include directional, repetitive, and multi-scale correlation structures. From this perspective, the flexibility of multi-head attention may provide a suitable representational space for the student to absorb and approximate the structural characteristics of textures produced by the diffusion teacher. In this work, we hypothesize that multi-head attention plays a supportive role in facilitating the transfer of such perceptual texture knowledge.

3.3.3. Inference Time

As discussed earlier, transformers are practical candidates for student backbones when targeting improved inference efficiency compared to diffusion models. Diffusion teachers typically require iterative denoising steps, with U-Net-based computation performed at each step. Consequently, inference cost accumulates proportionally to the number of sampling steps. Even when the underlying network architecture remains unchanged, repeated execution dominates overall latency. Increasing the number of denoising steps to improve perceptual quality further amplifies this computational burden.

In contrast, transformer-based SR models usually perform restoration in a single forward pass. Although the per-pass computation may be substantial, the absence of iterative sampling provides a structural advantage in terms of latency. Single-pass inference also yields more predictable execution time, which simplifies system design and resource allocation in real-time or large-scale deployment scenarios.

Moreover, transformer operations are largely composed of matrix multiplications, making them amenable to hardware acceleration and parallelization. Practical optimization techniques, such as mixed-precision computation and kernel-level acceleration, can be readily applied. Therefore, if the goal is to preserve the perceptual strength of diffusion models while enabling practically deployable inference speed, compressing the teacher knowledge into a single-pass student architecture is essential. From this perspective, adopting a transformer student is justified not only in terms of restoration quality but also under time and resource constraints.

3.4. Loss Function Design

This section describes the design of our loss function, which aims to effectively transfer texture and high-frequency details from the diffusion teacher to the transformer student. In standard super-resolution training, the model is optimized by minimizing the pixel-wise difference between the generated image and the GT image, typically using an

L_{1}

loss. Following this formulation, most knowledge distillation frameworks in SR also begin with a pixel-domain alignment, where an

L_{1}

loss is applied between the teacher output and the student output.

However, directly applying an

L_{1}

-based KD loss is not well suited to our heterogeneous framework. As illustrated in Figure 2, using an

L_{1}

loss for distillation tends to alter the overall color tone of the image and fails to adequately recover edge and fine structural information. This limitation arises from the fundamental differences in image generation mechanisms: diffusion models generate outputs through a stochastic denoising process, whereas transformer-based SR models perform deterministic pixel-level restoration. The resulting representational discrepancy between teacher and student makes simple pixel-wise alignment less effective for transferring perceptual texture knowledge.

To address this issue, we introduce a frequency-based distillation loss that preserves global structures while better capturing textures and fine details generated by the diffusion teacher. In super-resolution, it is important not only to distinguish which frequency components are lost, but also to consider where these components occur spatially.

Fourier-based representations effectively analyze global frequency distributions, but lack spatial localization, making them less suitable for directly modeling localized structures such as edges, corners, and repetitive textures. In contrast, wavelet transforms provide both spatial localization and frequency decomposition, making them better suited for modeling region-specific high-frequency characteristics in SR.

This property aligns well with the characteristics of diffusion-based teacher outputs. The texture information generated by diffusion models can be interpreted not merely as global frequency amplification, but as locally structured and directionally correlated high-frequency residuals formed through a probabilistic restoration process. Discrete Wavelet Transform (DWT) preserves global structural information through the LL component while separating directional details through the LH, HL, and HH components. Therefore, DWT provides a structured representation that facilitates more effective transfer of perceptual textures from the diffusion teacher to the student.

Furthermore, the decomposition depth is closely related to the trade-off between representational granularity and training stability. A single-level decomposition may group high-frequency components too coarsely, failing to sufficiently separate intermediate details from fine textures. On the other hand, deeper decompositions (e.g., three levels or more) provide finer frequency separation but reduce the spatial resolution of each sub-band and increase the complexity of loss design, which may adversely affect restoration performance.

Considering the balance between frequency granularity, spatial interpretability, and training stability, we adopt a two-level wavelet decomposition in this work. Specifically, using a two-level wavelet decomposition, we decompose the image into seven sub-bands: LL2, LH2, HL2, HH2, LH, HL, HH. These sub-bands are further organized into three frequency groups according to their structural characteristics, corresponding to coarse structures, intermediate details, and fine textures. Group-wise weighted distillation is then performed based on this categorization.

The LL2 component represents the overall low-frequency structure of the image. The intermediate detail group consists of the second-level high-frequency components LH2, HL2, HH2, obtained by further decomposing the first-level low-frequency band. Finally, the first-level high-frequency bands LH, HL, HH capture fine-grained textures and directional details.

\begin{matrix} W_{LF} (I) & : = W_{LL}^{(2)} (I) (= LL 2), \\ W_{MF} (I) & : = [W_{LH}^{(2)} (I), W_{HL}^{(2)} (I), W_{HH}^{(2)} (I)], \\ W_{HF} (I) & : = [W_{LH}^{(1)} (I), W_{HL}^{(1)} (I), W_{HH}^{(1)} (I)] . \end{matrix}

(6)

After grouping the sub-bands into three frequency groups, we analyze the discrepancy between the teacher and student outputs within each group, as illustrated in Figure 3 and Figure 4. Specifically, we measure the magnitude of differences in each frequency group to identify where the representational gap is most significant.

Our analysis reveals that the mid-frequency group, denoted as

W_{MF}

(corresponding to LH2, HL2, HH2), exhibits larger discrepancies compared to the coarse-structure and fine-texture groups. This suggests that intermediate structural details are not sufficiently captured by the student when trained with uniform weighting.

Based on this observation, we assign a higher loss weight to the mid-frequency group in our distillation objective. By emphasizing the frequency components with larger discrepancies, the proposed loss encourages stronger alignment in these regions, facilitating more balanced knowledge transfer across the frequency spectrum.

The resulting distillation loss is formulated as follows:

L_{DWT} = \sum_{g \in {LF, MF, HF}} γ_{g} \frac{1}{N_{g}} {∥W_{g} (I_{T}^{S R}) - W_{g} (I_{S}^{S R})∥}_{1} .

(7)

The complete training objective is formulated as

L_{total} = L_{task} + λ_{KD} L_{DWT}

(8)

Our KD framework adopts output-level distillation only and does not employ feature-level distillation. This design choice is motivated by the fundamental differences between the intermediate representations of diffusion-based teacher models and transformer-based student models, which may render feature-level supervision unstable.

In diffusion models, intermediate features are typically defined in a latent space and are conditioned on both the diffusion timestep and the input condition. Even for the same input, the semantic meaning and statistical properties of intermediate features evolve throughout the reverse diffusion process. As a result, these representations can be interpreted as dynamic, stochastic features reflecting the underlying generative prior.

In contrast, intermediate features in transformer-based student models are deterministic feed-forward representations computed sequentially across layers for a given input. Unlike diffusion models, their feature distributions do not vary with timestep-dependent stochastic sampling processes.

Due to this structural discrepancy, establishing a direct one-to-one correspondence between teacher and student intermediate features is nontrivial. Naïve feature matching would simultaneously enforce alignment across mismatched representation spaces and heterogeneous feature distributions. This can introduce high-variance gradients and potentially destabilize optimization throughout the distillation process, particularly during early training stages.

By comparison, output-level distillation operates in the shared observation space of reconstructed images, where both teacher and student produce outputs in the same domain. This provides a well-defined and stable alignment target. Furthermore, when combined with frequency decomposition, output-level supervision allows controlled transfer of specific frequency components.

Therefore, to ensure stable and effective transfer of the diffusion teacher’s perceptual prior, we adopt an output-level distillation framework without feature-level supervision.

3.5. Training Process

At the early stage of training, the student model has not yet established stable structural reconstruction capability. Introducing a strong distillation signal at this point may conflict with the gradient induced by the GT-based reconstruction loss, potentially leading to unstable optimization.

To mitigate this issue, we adopt a strategy in which the distillation weight is linearly increased during training.

The objective of this strategy is to maintain an appropriate balance between the GT supervision and the teacher guidance, thereby enabling stable extraction of high-frequency information within the proposed distillation framework. In super-resolution, the ground-truth image contains the most reliable and detailed high-frequency information. While the diffusion teacher output also includes rich high-frequency components, it may inevitably contain noise or hallucinated textures due to its probabilistic generation process. This characteristic differs fundamentally from knowledge distillation in classification, where the teacher’s output distribution typically serves as a reliable supervisory signal.

Therefore, maintaining equal weighting between the GT loss and the distillation loss from the beginning of training may cause the student to be overly influenced by imperfect teacher outputs, potentially leading to suboptimal learning dynamics.

To address this issue, we gradually increase the weight of the KD loss during training. Specifically, training begins with the distillation weight set to zero, and the weight is linearly increased to a predefined final value over the course of optimization. This strategy can be expressed as follows:

λ_{KD} (t) = \{\begin{matrix} 0, & t = 0, \\ \frac{t}{T} λ_{KD}, & 1 \leq t \leq T, \end{matrix}

(9)

Here t denotes the current training iteration, T represents the total number of training iterations, and

λ_{KD}

denotes the predefined final distillation weight. Accordingly, the overall training objective can be expressed as follows:

L_{total} (t) = L_{task} + λ_{KD} (t) L_{DWT} .

(10)

As a result, during the early stage of training, the student model primarily focuses on learning the mapping to the ground truth, emphasizing restoration of the overall structure and low-frequency components. As training progresses and the distillation weight gradually increases, the student is increasingly guided by the high-frequency information generated by the diffusion teacher.

This progressive weighting strategy enables the student to reconstruct HR images from LR inputs without compromising global structural consistency, while gradually strengthening the emphasis on high-frequency and perceptual details. In other words, the linear increase of the distillation weight maintains stable distortion-oriented reconstruction through the task loss, while progressively enhancing perceptual influence via knowledge distillation.

4. Experiments

4.1. Experimental Settings

4.1.1. Datasets and Metrics

The transformer-based student model is trained on the standard synthetic dataset setting using the 800 training images from the DIV2K dataset [26], following common practice in SR research. However, since this work focuses on real-world SR performance, our evaluation protocol differs from conventional synthetic benchmark settings. Instead of the traditional benchmark datasets (Set5 [27], Set14 [28], BSD100 [29], and Urban100 [30]), we evaluate on RealSR [31] and the ImageNet-Test dataset [32], which better reflect realistic degradation scenarios.

In addition to PSNR, a distortion-oriented metric widely used in synthetic SR evaluation, we report perceptual metrics including LPIPS and MUSIQ. By jointly analyzing distortion and perceptual metrics under real-world settings, we assess whether the proposed framework improves the distortion–perception trade-off in terms of generalization performance.

4.1.2. Teacher and Student Model

In the proposed KD framework, we adopt SwinIR-small [4] as the student model. This variant compresses the original SwinIR configuration (180 channels with six residual blocks) into a lightweight architecture with 60 channels and four residual blocks. This setup follows the commonly used configuration in homogeneous transformer-based KD studies for SR, ensuring a fair and practical lightweight student baseline.

As the teacher model, we employ ResShift [7], considering the characteristics of diffusion-based SR models. Diffusion approaches for SR have evolved along two main directions. One line of work retrains the diffusion process in a task-specific manner [33,34,35], where the low-resolution signal strongly intervenes throughout the diffusion trajectory, enabling consistent reconstruction at the cost of substantial training and inference overhead. The other line leverages pretrained diffusion priors and controls the reverse process through conditioning [36,37,38], which reduces computational cost but may struggle to maintain stable perceptual quality.

ResShift differs from conventional diffusion models by learning residual corrections rather than directly predicting denoising trajectories. In particular, noise injection during the reverse process is conditioned on the low-resolution image, resulting in a significantly narrower restoration search space compared to methods that start from pure Gaussian noise. This LR-conditioned noise design reduces unnecessary stochastic sampling and shortens the sampling process from hundreds of steps to approximately 15 steps while preserving strong high-frequency texture and structural restoration capability. Owing to these properties, the student model receives more stable and perceptually meaningful texture guidance during KD training.

4.2. Performance Evaluation

Table 1 presents the quantitative results of the proposed method. Since our framework employs output-level distillation only, we compare it with AugKD [9], which also performs output-level distillation, as well as with a baseline trained using only the standard

L_{1}

loss. All reported results are averaged over three independent runs.

On both real-world datasets, the proposed method achieves the best overall performance. Notably, although AugKD improves distortion-oriented metrics compared to the

L_{1}

-only baseline, it yields lower perceptual scores (LPIPS and MUSIQ) than the standalone SwinIR-small model. This suggests that homogeneous pixel-level KD strategies, which are effective on synthetic benchmark datasets, do not necessarily translate to perceptual gains in real-world SR scenarios.

In contrast, the proposed method achieves consistent improvements across both distortion and perceptual metrics. On the RealSR dataset, it reduces LPIPS by 0.059 and improves MUSIQ by 16.969 compared to the baseline. Furthermore, it surpasses AugKD in PSNR while simultaneously improving perceptual scores, demonstrating a more favorable distortion–perception trade-off.

Table 2 and Table 3 compare the computational and quantitative performance of the diffusion teacher (ResShift), its one-step variant (ResShift-1 step), and the proposed transformer-based student (SwinIR-small). Due to its 15-step reverse diffusion process, ResShift incurs substantial computational cost and latency. Reducing the sampling process to a single step decreases FLOPs from 95.736 G to 48.562 G (approximately 2.0× reduction) and shortens inference time from 371.34 ms to 83.68 ms (approximately 4.4× speedup).

However, even with one-step sampling, ResShift-1 step remains computationally heavier than the transformer-based student, as it still relies on a large diffusion U-Net backbone. Compared to ResShift-1 step, SwinIR-small further reduces FLOPs from 48.562 G to 10.087 G (approximately 4.8× smaller) and decreases inference time from 83.68 ms to 19.09 ms (approximately 4.4× faster). In terms of quantitative performance, ResShift-1 step achieves a MUSIQ score that is 8.685 higher than SwinIR-small, while LPIPS remains comparable. However, its PSNR is 1.06 dB lower. These results indicate that reducing diffusion sampling steps alone does not necessarily provide an optimal balance between efficiency and distortion performance.

Overall, Table 2 and Table 3 demonstrate that the transformer-based student offers a more favorable trade-off between computational efficiency and reconstruction quality compared to step-reduced diffusion variants.

Figure 5 and Figure 6 present qualitative examples corresponding to the quantitative results in Table 1. In both cases, when distillation is performed using a pixel-wise

L_{1}

loss, noticeable color shifts are observed in the reconstructed images. In contrast, the proposed method better preserves overall color consistency while enhancing structural details that are not fully recovered by the standalone SwinIR-small model. Specifically, Figure 5 shows improved reconstruction of repetitive diagonal patterns that appear blurred in other methods. Similarly, Figure 6 illustrates clearer restoration of alphabet boundaries, which were previously indistinct.

These qualitative observations are consistent with the quantitative results, suggesting that the proposed framework achieves a more balanced distortion–perception trade-off in real-world SR scenarios.

4.3. Ablation Studies and Analysis

Table 4 presents the experimental results for different decomposition levels when equal weights are applied to all DWT sub-bands. When adopting the proposed two-level decomposition, PSNR improves by 0.02 dB and MUSIQ increases by 4.386 compared to the one-level setting. In contrast, deeper decompositions (three levels or more) show a tendency toward performance degradation despite involving a larger number of sub-bands. This suggests that excessive decomposition may over-fragment frequency components, thereby limiting the effective transfer of essential structural and texture information required for restoration.

Table 5 compares the proposed DWT-based loss with other frequency-based loss formulations. The DTCWT loss, which captures more fine-grained directional frequency information than standard DWT, shows some improvement in the distortion–perception trade-off compared to the baseline. However, it achieves lower overall performance than the proposed DWT-based loss. In contrast, the Fourier-based loss improves perceptual metrics to some extent but leads to degradation in distortion-oriented performance, indicating a less balanced trade-off between reconstruction fidelity and perceptual quality.

Table 6 and Table 7 present the experimental results under different weighting configurations of the proposed loss function. As shown in Figure 3 and Figure 4, the discrepancy between teacher and student outputs is largest in the mid-frequency (MF) band group among the three groups (LF, MF, HF).

Based on this observation, assigning a higher weight to the MF group than to the LF and HF groups leads to improved performance compared to uniform weighting. This tendency is consistently observed in the experimental results. The best performance is achieved when the MF group is assigned twice the weight of the other groups. This suggests that emphasizing frequency bands with larger teacher–student discrepancies is beneficial for more effective knowledge transfer.

In addition, adopting a linearly increasing schedule for the KD loss weight during training improves performance compared to using a fixed weight. Specifically, the linear scheduling strategy yields an improvement of 0.12 dB in PSNR and 0.727 in MUSIQ. These results indicate that prioritizing GT-based structural learning in early training and gradually strengthening the influence of the teacher’s high-frequency prior is more effective than maintaining a constant distillation weight throughout training.

However, excessively increasing the teacher influence eventually degrades performance. This implies that overemphasizing high-frequency components without adequately preserving global structural information can harm the distortion–perception balance.

In summary, the ablation results in Table 4, Table 5, Table 6 and Table 7 demonstrate that frequency-based distillation alone does not automatically guarantee performance improvement. Consistent gains are achieved only when a moderate decomposition level (two-level), discrepancy-aware non-uniform weighting (emphasizing MF), and progressive KD scheduling are jointly applied. These findings indicate that the effectiveness of frequency-based KD depends not merely on frequency-domain comparison itself, but on careful band-wise importance design and controlled distillation strength.

Finally, Table 8 reports the results on the synthetic benchmark Urban100 dataset. In synthetic settings, the degradation process is explicitly defined, and evaluation primarily emphasizes pixel-level alignment with the GT HR image. Under such conditions, the distortion-oriented student model achieves higher PSNR than the diffusion teacher.

In contrast, the diffusion teacher prioritizes perceptually plausible texture and high-frequency restoration based on its learned generative prior, rather than strictly enforcing exact pixel-wise alignment with the GT. Instead of converging to a single deterministic pixel solution, the diffusion model operates within a plausible restoration distribution conditioned on the LR input. Consequently, small pixel-wise discrepancies may accumulate in synthetic benchmarks, resulting in lower PSNR despite perceptually convincing outputs.

Accordingly, on Urban100, the student outperforms the teacher in distortion-oriented metrics, whereas the teacher achieves stronger perceptual performance. In this scenario, applying an

L_{1}

-based KD loss enforces strict pixel-level alignment between teacher and student, which can degrade performance in both distortion and perception.

In contrast, the proposed decomposed DWT loss achieves consistent, albeit modest, improvements even in the synthetic setting, with gains of 0.04 dB in PSNR and 0.263 in MUSIQ. These results suggest that the proposed method transfers the texture and high-frequency restoration characteristics of the diffusion teacher to the student without excessively enforcing pixel-level matching, even under synthetic benchmark conditions.

5. Conclusions

In this paper, we proposed DTKD, a diffusion-to-transformer heterogeneous knowledge distillation framework designed to preserve the superior perceptual quality of diffusion-based SR models while alleviating the high inference cost induced by multi-step sampling. By combining the strong texture restoration capability of a diffusion teacher with the single-forward efficiency of a transformer student, DTKD improves the perception–distortion trade-off under practical computational constraints.

Experimental results demonstrate that DTKD consistently enhances perceptual quality compared to a standalone transformer student, while maintaining transformer-level inference speed. These findings indicate that heterogeneous distillation provides an effective approach for transferring diffusion-based generative priors into efficient super-resolution models.

The proposed method adopts a fixed two-level DWT decomposition with predefined frequency-group weights. However, real-world degradations vary significantly in type and severity (e.g., blur, noise, compression artifacts), and the most informative frequency bands may differ depending on degradation characteristics. Therefore, a fixed weighting scheme may not always be optimal. Future work could explore adaptive strategies that dynamically adjust sub-band importance and distillation strength according to degradation conditions.

In addition, DTKD depends on the quality of the diffusion teacher output. Under extreme real-world degradations, the teacher may produce residual noise or unrealistic textures, which could be transferred to the student and potentially introduce artifacts. Future research may investigate confidence-aware distillation mechanisms that selectively suppress unreliable teacher signals, as well as alternative frequency representations beyond DWT to further enhance robustness and flexibility.

Author Contributions

Conceptualization, J.H.P. and B.C.S.; methodology, J.H.P. and B.C.S.; software, J.H.P.; validation, J.H.P.; formal analysis, J.H.P. and B.C.S.; investigation, J.H.P.; resources, B.C.S.; data curation, J.H.P.; writing—original draft preparation, J.H.P.; writing—review and editing, J.H.P. and B.C.S.; visualization, J.H.P. and B.C.S.; supervision, B.C.S.; project administration, B.C.S.; funding acquisition, B.C.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Inha University Research Grant.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Dong, C.; Loy, C.C.; He, K.; Tang, X. Learning a deep convolutional network for image super-resolution. In Proceedings of the European Conference on Computer Vision (ECCV); Springer: Berlin/Heidelberg, Germany, 2014; pp. 184–199. [Google Scholar]
Lim, B.; Son, S.; Kim, H.; Nah, S.; Mu Lee, K. Enhanced deep residual networks for single image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 136–144. [Google Scholar]
Zhang, Y.; Li, K.; Li, K.; Wang, L.; Zhong, B.; Fu, Y. Image super-resolution using very deep residual channel attention networks. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 286–301. [Google Scholar]
Liang, J.; Cao, J.; Sun, G.; Zhang, K.; Van Gool, L.; Timofte, R. Swinir: Image restoration using swin transformer. In Proceedings of the IEEE/CVF Conference on International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021; pp. 1833–1844. [Google Scholar]
Ledig, C.; Theis, L.; Huszár, F.; Caballero, J.; Cunningham, A.; Acosta, A.; Aitken, A.; Tejani, A.; Totz, J.; Wang, Z.; et al. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 4681–4690. [Google Scholar]
Moser, B.B.; Shanbhag, A.S.; Raue, F.; Frolov, S.; Palacio, S.; Dengel, A. Diffusion models, image super-resolution, and everything: A survey. IEEE Trans. Neural Netw. Learn. Syst. (TNNLS) 2024, 36, 11793–11813. [Google Scholar] [CrossRef] [PubMed]
Yue, Z.; Wang, J.; Loy, C.C. Resshift: Efficient diffusion model for image super-resolution by residual shifting. Adv. Neural Inf. Process. Syst. (NeurIPS) 2023, 36, 13294–13307. [Google Scholar]
Wang, Y.; Yang, W.; Chen, X.; Wang, Y.; Guo, L.; Chau, L.P.; Liu, Z.; Qiao, Y.; Kot, A.C.; Wen, B. Sinsr: Diffusion-based image super-resolution in a single step. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 25796–25805. [Google Scholar]
Zhang, Y.; Li, W.; Li, S.; Chen, H.; Tu, Z.; Jing, B.; Lin, S.; Hu, J.; Wang, W. AugKD: Ingenious Augmentations Empower Knowledge Distillation for Image Super-Resolution. In Proceedings of the International Conference on Learning Representations (ICLR), Singapore, 24–28 April 2025. [Google Scholar]
Jiang, Y.; Feng, C.; Zhang, F.; Bull, D. Mtkd: Multi-teacher knowledge distillation for image super-resolution. In Proceedings of the European Conference on Computer Vision (ECCV); Springer: Berlin/Heidelberg, Germany, 2024; pp. 364–382. [Google Scholar]
Zhou, Y.; Qiao, J.; Liao, J.; Li, W.; Li, S.; Xie, J.; Shen, Y.; Hu, J.; Lin, S. Dynamic Contrastive Knowledge Distillation for Efficient Image Restoration. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 10861–10869. [Google Scholar]
Dong, C.; Loy, C.C.; Tang, X. Accelerating the super-resolution convolutional neural network. In Proceedings of the European Conference on Computer Vision (ECCV); Springer: Berlin/Heidelberg, Germany, 2016; pp. 391–407. [Google Scholar]
Kim, J.; Lee, J.K.; Lee, K.M. Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 1646–1654. [Google Scholar]
Chen, X.; Wang, X.; Zhou, J.; Qiao, Y.; Dong, C. Activating more pixels in image super-resolution transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 22367–22377. [Google Scholar]
Zhang, X.; Zhang, Y.; Yu, F. HiT-SR: Hierarchical Transformer for Efficient Image Super-Resolution. In Proceedings of the European Conference on Computer Vision (ECCV), Milan, Italy, 29 September–4 October 2024. [Google Scholar]
Long, W.; Zhou, X.; Zhang, L.; Gu, S. Progressive Focused Transformer for Single Image Super-Resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025; pp. 2279–2288. [Google Scholar]
Zhang, L.; Li, Y.; Zhou, X.; Zhao, X.; Gu, S. Transcending the limit of local window: Advanced super-resolution transformer with adaptive token dictionary. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 2856–2865. [Google Scholar]
Wang, X.; Yu, K.; Wu, S.; Gu, J.; Liu, Y.; Dong, C.; Qiao, Y.; Change Loy, C. Esrgan: Enhanced super-resolution generative adversarial networks. In Proceedings of the the European Conference on Computer Vision Workshops (ECCVW), Munich, Germany, 8–14 September 2018. [Google Scholar]
Wang, X.; Xie, L.; Dong, C.; Shan, Y. Real-esrgan: Training real-world blind super-resolution with pure synthetic data. In Proceedings of the IEEE/CVF Conference on International Conference on Computer Vision (ICCV), Virtual, 11–17 October 2021; pp. 1905–1914. [Google Scholar]
Wu, R.; Yang, T.; Sun, L.; Zhang, Z.; Li, S.; Zhang, L. Seesr: Towards semantics-aware real-world image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 25456–25467. [Google Scholar]
Chen, J.; Pan, J.; Dong, J. Faithdiff: Unleashing diffusion priors for faithful image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025; pp. 28188–28197. [Google Scholar]
Li, S.; Zhang, Y.; Li, W.; Chen, H.; Wang, W.; Jing, B.; Lin, S.; Hu, J. Knowledge distillation with multi-granularity mixture of priors for image super-resolution. In Proceedings of the International Conference on Learning Representations (ICLR), Singapore, 24–28 April 2025; pp. 27216–27232. [Google Scholar]
Wu, R.; Sun, L.; Ma, Z.; Zhang, L. One-step effective diffusion network for real-world image super-resolution. Adv. Neural Inf. Process. Syst. (NeurIPS) 2024, 37, 92529–92553. [Google Scholar]
He, X.; Tang, H.; Tu, Z.; Zhang, J.; Cheng, K.; Chen, H.; Guo, Y.; Zhu, M.; Hu, J.; Wang, N.; et al. One step diffusion-based super-resolution with time-aware distillation. IEEE Trans. Image Process. 2026, 35, 2928–2940. [Google Scholar] [CrossRef] [PubMed]
Chen, Z.; Zou, Z.; Zhang, K.; Su, X.; Yuan, X.; Guo, Y.; Zhang, Y. DOVE: Efficient One-Step Diffusion Model for Real-World Video Super-Resolution. arXiv 2025, arXiv:2505.16239. [Google Scholar]
Agustsson, E.; Timofte, R. Ntire 2017 challenge on single image super-resolution: Dataset and study. In Proceedings of the the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Honolulu, HI, USA, 21–26 July 2017; pp. 126–135. [Google Scholar]
Bevilacqua, M.; Roumy, A.; Guillemot, C.; Alberi-Morel, M.L. Low-complexity single-image super-resolution based on nonnegative neighbor embedding. In Proceedings of the 23rd British Machine Vision Conference (BMVC), Surrey, UK, 3–7 September 2012. [Google Scholar]
Zeyde, R.; Elad, M.; Protter, M. On single image scale-up using sparse-representations. In Proceedings of the International Conference on Curves and Surfaces; Springer: Berlin/Heidelberg, Germany, 2010; pp. 711–730. [Google Scholar]
Martin, D.; Fowlkes, C.; Tal, D.; Malik, J. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proceedings of the IEEE Conference on International Conference on Computer Vision (ICCV); IEEE: Piscataway, NJ, USA, 2001; Volume 2, pp. 416–423. [Google Scholar]
Huang, J.B.; Singh, A.; Ahuja, N. Single image super-resolution from transformed self-exemplars. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 5197–5206. [Google Scholar]
Cai, J.; Zeng, H.; Yong, H.; Cao, Z.; Zhang, L. Toward real-world single image super-resolution: A new benchmark and a new model. In Proceedings of the IEEE/CVF Conference on International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27–28 October 2019; pp. 3086–3095. [Google Scholar]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2009; pp. 248–255. [Google Scholar]
Saharia, C.; Ho, J.; Chan, W.; Salimans, T.; Fleet, D.J.; Norouzi, M. Image super-resolution via iterative refinement. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 2022, 45, 4713–4726. [Google Scholar] [CrossRef] [PubMed]
Sahak, H.; Watson, D.; Saharia, C.; Fleet, D. Denoising Diffusion Probabilistic Models for Robust Image Super-Resolution in the Wild. arXiv 2023, arXiv:2302.07864. [Google Scholar] [CrossRef]
Gao, S.; Liu, X.; Zeng, B.; Xu, S.; Li, Y.; Luo, X.; Liu, J.; Zhen, X.; Zhang, B. Implicit diffusion models for continuous superresolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 10021–10030. [Google Scholar]
Kawar, B.; Elad, M.; Ermon, S.; Song, J. Denoising diffusion restoration models. Adv. Neural Inf. Process. Syst. (NeurIPS) 2022, 35, 23593–23606. [Google Scholar]
Wang, Y.; Yu, J.; Zhang, J. Zero-shot image restoration using denoising diffusion null-space model. arXiv 2022, arXiv:2212.00490. [Google Scholar]
Wang, J.; Yue, Z.; Zhou, S.; Chan, K.C.; Loy, C.C. Exploiting diffusion prior for real-world image super-resolution. Int. J. Comput. Vis. (IJCV) 2024, 132, 5929–5949. [Google Scholar] [CrossRef]

Figure 1. Overview of the proposed DTKD framework. We conduct KD with DWT loss on the output level.

Figure 2. Qualitative results of knowledge distillation using the

L_{1}

loss. A noticeable shift in image color tones can be observed.

Figure 2. Qualitative results of knowledge distillation using the

L_{1}

loss. A noticeable shift in image color tones can be observed.

Figure 3. Analysis of the difference of band group weight between output images, example 1.

Figure 4. Analysis of the difference of band group weight between output images, example 2.

Figure 5. Qualitative comparison on the RealSR dataset. The proposed DWT-based distillation better preserves color consistency and restores fine repetitive diagonal structures compared to pixel-wise L1 distillation, which exhibits noticeable color shifts. Zoomed regions (red box) highlight improved texture sharpness and structural fidelity.

Figure 6. Qualitative comparison on the ImageNet-test dataset. The proposed method produces clearer boundary reconstruction and sharper high-frequency details than the baseline SwinIR-small and L1-based KD. The improvements are consistent with the quantitative gains in MUSIQ and LPIPS reported in Table 1. The red boxes indicate the zoomed regions.

Table 1. Quantitative results of the proposed method and other output-level distillation methods on the RealSR and ImageNet-test datasets. Boldface indicates the best results and underlining indicates the second-best results.

Dataset	Model	KD Loss	PSNR↑	LPIPS↓	MUSIQ↑
RealSR	Teacher	–	27.75	0.365	61.563
	SwinIR-s	–	27.64	0.440	29.262
	Student	$L_{1}$	27.66	0.400	33.123
		AugKD	27.84	0.453	29.817
		DWT (Ours)	27.88 ± 0.02	0.381 ± 0.003	46.231 ± 0.233
ImageNet-test	Teacher	–	29.60	0.275	50.799
	SwinIR-s	–	28.70	0.389	39.882
	Student	$L_{1}$	28.12	0.394	39.618
		AugKD	28.79	0.405	34.278
		DWT (Ours)	28.81 ± 0.01	0.342 ± 0.011	40.590 ± 0.178

Table 2. Model complexity and inference latency comparison in RealSR dataset.

Model	Sampling Step	Params	FLOPs	Inference Time
Teacher	15	118.59 M	95.736 G	371.34 ms
ResShift-1 step	1	118.59 M	48.562 G	83.68 ms
SwinIR-small	-	0.930 M	10.087 G	19.09 ms
Student w/KD	-	0.930 M	10.087 G	19.09 ms

Table 3. Performance comparison between ResShift-1 step and the proposed method on the RealSR dataset. Here, bold indicates the best performance.

Dataset	Method	PSNR↑	LPIPS↓	MUSIQ↑
RealSR	ResShift-1 step	26.82	0.380	54.916
RealSR	DWT (Ours)	27.88	0.381	46.231

Table 4. Ablation study of the decomposed level on our DWT loss. Boldface indicates the best results and underlining indicates the second-best results.

Dataset	Model	DWT Level	PSNR↑	MUSIQ↑
RealSR	Teacher	–	27.75	61.563
	SwinIR-s	–	27.64	29.262
	Student	Level 1	27.80	41.152
		Level 2	27.82	45.538
		Level 3	27.20	40.323
		Level 4	26.51	39.275

Table 5. Comparison of the frequency-based KD loss function. Boldface indicates the best results and underlining indicates the second-best results.

Dataset	Model	KD	PSNR↑	MUSIQ↑
RealSR	Teacher	–	27.75	61.563
	SwinIR-s	–	27.64	29.262
	Student	DWT (Ours)	27.88	46.231
		Fourier	27.60	38.690
		DTCWT	27.71	40.125

Table 6. Ablation study of the band group weight (LF:MF:HF). Boldface indicates the best results and underlining indicates the second-best results.

Dataset	Model	LF:MF:HF	PSNR↑	MUSIQ↑
RealSR	Teacher	–	27.75	61.563
	SwinIR-s	–	27.64	29.262
	Student	1:1:1	27.82	45.538
		2:3:2	27.83	45.583
		1:2:1	27.88	46.231
		2:5:2	27.82	45.561

Table 7. Ablation study of the linearly increasing weight schedule strategy. Boldface indicates the best results and underlining indicates the second-best results.

Dataset	Model	Task Weight	KD Weight	PSNR↑	MUSIQ↑
RealSR	Teacher	–	–	27.75	61.563
	SwinIR-s	–	–	27.64	29.262
	Student	1	1	27.69	44.952
		1	0–0.5	27.80	45.538
		1	0–1	27.81	45.679
		1	0–2	27.88	46.231
		1	0–3	27.74	45.132

Table 8. Quantitative results of the proposed method on the Urban100 dataset (synthetic benchmark). Bold indicates the best performance.

Dataset	Model	KD	PSNR↑	MUSIQ↑
Urban100	Teacher	–	23.65	71.203
	SwinIR-s	–	26.27	67.297
	Student	$L_{1}$	25.45	65.020
	Student	DWT (Ours)	26.31	67.560

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Park, J.H.; Song, B.C. DTKD: Diffusion-to-Transformer Heterogeneous Knowledge Distillation for Efficient and Perceptually Enhanced Super-Resolution. Electronics 2026, 15, 1986. https://doi.org/10.3390/electronics15101986

AMA Style

Park JH, Song BC. DTKD: Diffusion-to-Transformer Heterogeneous Knowledge Distillation for Efficient and Perceptually Enhanced Super-Resolution. Electronics. 2026; 15(10):1986. https://doi.org/10.3390/electronics15101986

Chicago/Turabian Style

Park, Jeong Hyeok, and Byung Cheol Song. 2026. "DTKD: Diffusion-to-Transformer Heterogeneous Knowledge Distillation for Efficient and Perceptually Enhanced Super-Resolution" Electronics 15, no. 10: 1986. https://doi.org/10.3390/electronics15101986

APA Style

Park, J. H., & Song, B. C. (2026). DTKD: Diffusion-to-Transformer Heterogeneous Knowledge Distillation for Efficient and Perceptually Enhanced Super-Resolution. Electronics, 15(10), 1986. https://doi.org/10.3390/electronics15101986

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DTKD: Diffusion-to-Transformer Heterogeneous Knowledge Distillation for Efficient and Perceptually Enhanced Super-Resolution

Abstract

1. Introduction

2. Related Works

2.1. Distortion-Oriented Super-Resolution Models

2.2. Perceptual-Oriented Super-Resolution Models

2.3. Knowledge Distillation for Super-Resolution

3. Methods

3.1. Preliminary

3.2. Problem Formulation

3.3. Why Transformer as Student?

3.3.1. Global Receptive Field

3.3.2. Multi-Head Attention

3.3.3. Inference Time

3.4. Loss Function Design

3.5. Training Process

4. Experiments

4.1. Experimental Settings

4.1.1. Datasets and Metrics

4.1.2. Teacher and Student Model

4.2. Performance Evaluation

4.3. Ablation Studies and Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI