Kernel Adaptive Swin Transformer for Image Restoration

Zhen Ni; Jingyu Wang; Aniruddha Bhattacharjya; Le Yan

doi:10.3390/sym17122161

,

and

¹

School of Information Engineering, Nanjing XiaoZhuang University, Nanjing 211171, China

²

School of Information Engineering, Nanjing Audit University, Nanjing 211815, China

³

Department of Electronic Engineering, Tsinghua University, Beijing 100190, China

^*

Authors to whom correspondence should be addressed.

Symmetry2025, 17(12), 2161;https://doi.org/10.3390/sym17122161

This article belongs to the Section Computer

Version Notes

Order Reprints

Abstract

In this modern era, attention has been devoted to blind super-resolution design, which improves image restoration performance by combining self-attention networks and explicitly introducing degradation information. This paper proposes a novel model called Kernel Adaptive Swin Transformer (KAST) to address the ill-posedness in image super-resolution and the resulting irregular difficulties in restoration, including asymmetrical degradation problems. KAST introduces four key innovations: (1) local degradation-aware modeling, (2) parallel attention-based feature fusion, (3) log-space continuous position bias, and (4) comprehensive validation on diverse datasets. The model captures degraded information in different regions of low-resolution images, effectively encodes and distinguishes these degraded features using self-attention mechanisms, and accurately restores image details. The proposed approach innovatively integrates degraded features with image features through a parallel attention fusion strategy, enhancing the network’s ability to capture pixel relationships and achieving denoising, deblurring, and high-resolution image reconstruction. Experimental results demonstrate that our model performs well on multiple datasets, effectively verifying the effectiveness of the proposed method.

Keywords:

attention mechanism; blind super-resolution; denoising and deblurring; image restoration

1. Introduction

Single-image super-resolution is a fundamental task in low-level vision, aiming to reconstruct a high-resolution (HR) image from a single low-resolution (LR) image. It is widely used in cameras, movies, medical imaging, and video games. In recent years, deep learning-based methods [1,2,3,4] have achieved remarkable results by designing network structures for specific degraded paired HR–LR samples (e.g., bicubic). However, these methods usually fail when dealing with real-world images because there is a domain gap between synthesized training samples and real images. Real-world images are typically degraded by unknown blur kernels and noise, which differs from ideal degradation. This situation is usually called Blind Image Super-Resolution (BISR). To close the dataset domain gap, many works [5,6,7,8,9] have designed a degradation process to synthesize LR images, and then used the synthetic HR–LR pairs to train the model. The image degradation process can be explicitly modeled as:

{Y = (X \otimes K) ↓}_{s} + n

(1)

where X is the HR image, Y is the LR image, K is the blur kernel, ⊗ is the convolution operator,

↓_{s}

is the down-sampling operator with scaling factor s, and n is additive noise. For blind SR models, many alternative methods are available for solving these problems. Degradation features are prominent elements in solving this ill-posed problem because they contain spatial transformation information of the image. A performance comparison of the proposed method with SwinIR in ×4 blind super-resolution, with different blur widths (2, 3, 4), are shown in Figure 1.

Figure 1. Performance comparison of the proposed method with SwinIR in ×4 blind super-resolution, with different blur widths (2, 3, 4).

Modeling degradation features to help reconstruct the HR image is effective [10,11,12,13,14]. Zhang et al. have proposed SRMD [10] to handle multiple variations of blur and noise by projecting degradation features onto a space of dimension t and then stretching that into a tensor to add to the image feature map. Luo et al. [11] have adopted a dual-path group network structure to convolve image and degenerate features separately, using convolution to fuse features at the network’s end. Hui et al. have proposed AMNet [12] and applied a fully connected layer as a spatial feature transform to fuse two parts.

All these previous works have used complex strategies to fuse image features with degenerate features; most methods have considered only global degradation features and not local ones. Image features and image degradation features are entirely different semantic information. Directly adding or multiplying transformed features with the feature map is unreasonable. In the real world, the degradation process of images is very complex, including camera lens focus and CMOS performance limits, making degradation estimation complicated. Smooth regions in an LR image make this task challenging, since different blur kernels may produce similar smooth pixel values. Thus, accurately modeling the distribution of global blur kernels is difficult.

Here, we propose a novel SR framework for BISR, namely KAST, to investigate parallel processing of pixel-level degenerate features with the SR model. Based on the Swin Transformer, KAST introduces degraded attention to help the transformer network capture more relationships between pixels to reconstruct the HR image. KAST consists of four modules: (1) degradation estimation, (2) shallow feature extraction, (3) deep feature extraction, and (4) high-quality image reconstruction.The degradation estimation module learns local spatial degraded kernels and noise; shallow feature extraction is composed of residual connection blocks, preserving low-frequency information from LR images. Shallow features and degenerate features are transmitted to deep feature extraction, which mainly consists of modified Swin Transformer Blocks [15,16], containing Swin Transformer Layers and Kernel Fusion Layers (KFL) for local attention and cross-window interaction. KAST has used an attention approach to handle image features and degenerate features. Finally, for real-world SR, we have trained our model on synthetic datasets, referring to BSRNet [5] and KOALAnet [8].

The main contributions of this work are as follows:

We have proposed a multi-degrade super-resolution model called KAST by introducing degradation feature estimation, which has effectively encoded the image degradation by capturing degraded contextual information from different regions of the LR image to achieve better reconstruction performance.
Benefiting from the local self-attention mechanism and degradation estimation, we have introduced a new perspective to incorporate image degradation features into the SR model.
Extensive experiments on both paired training data and unpaired real-world data have demonstrated the effectiveness of KAST in image super-resolution.
We have introduced a log-space continuous position bias (Log-CPB) that has provided smoother and more fine-grained positional representations, enhancing the model’s ability to capture pixel relationships.

3. Proposed Method

3.1. Motivation

Swin Transformer [15] has shown an excellent performance in image super-resolution [16]. However, there are still limits for real-world images. By studying existing blind super-resolution models, we have found that introducing a degradation estimation module in the SR task is an effective way to model degradation; thus, how to fully use degradation features in the super-resolution process is a problem which is worthy of research. The essence of super-resolution is an ill-posed problem. Especially for smooth image patches, different blur kernels generate the same blur patches, making reconstruction hard. Most methods assume the image blur kernel is globally consistent and directly stitch degenerate features and image features to alleviate this problem, ignoring that different frequency parts in the image will have different results after blurring.

To solve the bottleneck of the Swin model, we have considered degradation to be globally inconsistent. To reach this conclusion, we have followed [8] to train a degradation estimation module.

As shown in Figure 2, the overall network is consist of four parts: degradation estimation, shallow feature extraction, deep feature extraction, and high-quality image reconstruction modules.

Figure 2. The overall architecture of KAST and the structure of KFTB.

The estimation result of the model is the mean of the full feature map; the intermediate feature map contains the degradation estimate result for each point. To visualize degradation features, we have used PCA to reduce the dimensionality of degenerated features and calculated cosine similarity with a fixed vector to get the final visualization result.

As shown in Figure 3, the pre-trained model can express the focal length relationship in the image clearly, referred to as different levels of blur. Furthermore, pre-trained models are focused more on structural information than content in images. During training, the estimation module only considers globally consistent degradation, but the predictive difficulty of smooth and texture-rich regions is not the same; this difficulty-level information helps connect pixels from the perspective of image degradation. Moreover, the feature itself contains local spatial transformation information. Based on this idea, we have proposed KAST for blind SR tasks. We have chosen Swin Transformer as the backbone network because, unlike convolution, self-attention easily processes features in parallel. Second, considering that explicitly introducing degenerate features during super-resolution may limit performance, we have introduced degradation features via pre-training.

Figure 3. After fine-tuning with the SR task, visualization results of degradation features in the SR model. The second line is not pre-trained with degradation; the third line is pre-trained with degradation.

3.2. Overall Pipeline

As shown in Figure 2, our proposed model consists of four parts: degradation estimation, shallow feature extraction, deep feature extraction, and high-quality image reconstruction modules. Specifically, for a given input

I_{L R} \in R^{H \times W \times C_{i n}}

, KAST first applies a U-shaped hierarchical network

H_{D E}

to extract degenerate features:

F_{D E} = H_{D E} (I_{L R}) .

(2)

where

F_{D E}

denotes pixel-level degenerate features. Then, deep feature extraction

H_{D F} (\cdot)

extracts deep features:

F_{D F} = H_{D F} (I_{L R}, F_{D E}) .

(3)

where

F_{D F} \in R^{H \times W \times C}

represents the fusion feature, and

H_{D F}

denotes deep feature extraction.

H_{D F} (\cdot)

contains N KFTBs; each KFTB contains two

1 \times 1

convolutional layers

H_{C o n v 1} (\cdot)

and one

3 \times 3

convolutional layer

H_{C o n v 3} (\cdot)

, progressively processing intermediate features as:

\begin{matrix} F_{0} & = H_{C o n v 3} (I_{L R}), \end{matrix}

(4)

\begin{matrix} F_{D E}^{'} & = H_{C o n v 1} (R e l u (H_{C o n v 1} (F_{D E}))), \end{matrix}

(5)

\begin{matrix} F_{i} & = H_{K F T B} (F_{i - 1}, F_{D E}^{'}), i = 1, 2, \dots, N, \end{matrix}

(6)

\begin{matrix} F_{D F} & = H_{C o n v 3} (F_{N}) . \end{matrix}

(7)

where

K F T B_{i} (\cdot)

represents the i-th KFTB in KAST, and

F_{i}

represents the output feature of the i-th KFTB. Here,

F_{N}

denotes the output of the final KFTB. To fuse

F_{D E}

with

F_{i}

,

F_{D E}

is projected onto

F_{i}^{'} \in R^{H \times W \times C_{i n}}

space via

H_{C o n v 1} (\cdot)

. To ensure these two features are computed independently in self-attention, we have designed a parallel structure. Following [16], we have introduced a convolutional layer at the tail to fuse shallow and deep features. The high-resolution result is reconstructed via:

I_{S R} = H_{R E} (F_{0} + F_{D F}) .

(8)

where

H_{R E}

denotes the reconstruction module. We have adopted the pixel-shuffle method [23] for up-sampling and we have optimized with

L_{1}

loss, following common practice in SR.

3.3. Degradation Estimation

Accurate degradation features can help the SR model generate more realistic images, but estimating ground-truth degradation features is hard and time-consuming. Actually, our method does not need ground truth; we only need to reflect differences in different regions. Thus, our model has learned degenerate representations of images rather than explicitly estimating ground truth. The estimation network architecture is shown in Figure 4.

Figure 4. The degradation estimation network architecture in our module; orange blocks are MaxPool modules, purple blocks are transposed convolution modules.

Previous works usually assume degenerate features in the same image are globally consistent. However, camera lens focus, CMOS performance limits, and compression information loss lead to diverse degradation in raw images. This process can be expressed as:

I_{H R} = D (I_{R A W}, K_{1}, n) .

(9)

where

K_{1} \in R^{H \times W \times k^{2}}

represents the camera blur kernel,

I_{R A W}

represents the RAW image, n represents noise, and k is filter size. The HR image is degraded by the degradation model:

I_{L R} = D (I_{H R}, K_{1} ＊ K_{2}, n_{1} + n_{2}) .

(10)

where

K_{2} \in R^{H \times W \times k^{2}}

represents the artificial blur kernel. While

K_{2}

can be globally consistent,

K_{1}

is unknown, leading to inconsistent degenerate features in the image. This explains why degradation distributions differ within the same image. To estimate degradation features, our goal is to solve:

θ_{k} = arg min_{θ_{k}} ∥ I_{L R} - (I_{H R} ⊛ H_{D E} (I_{L R}, θ_{k})) ↓_{s} ∥ + ρ ∥ K_{2} - mean (H_{D E} (I_{L R}, θ_{k})) ∥ .

(11)

Here,

∥ \cdot ∥

denotes the

L_{2}

norm, and

{⊛ ↓}_{s}

represents dynamic filtering at each pixel location with stride s, following KOALAnet [8].

H_{D E}

is a degradation estimation module considering reconstruction loss and kernel loss. This allows the model to learn pixel-level information, helping the SR model reconstruct HR images.

Since high-quality real-world LR–HR pairs are difficult to collect, synthetic image pairs are crucial for model training. Following the previous work, the image degradation process is:

\{\begin{matrix} I_{L R c l e a n} = (I_{H R} \otimes k_{b}) ↓_{s} \\ I_{L R} = I_{L R c l e a n} + n \end{matrix}

(12)

where

I_{L R c l e a n}

denotes blurred, downscaled, noise-free images;

k_{b}

is the blur kernel;

↓_{s}

denotes downsampling by scale factor s; n is noise; and

I_{L R}

is the low-quality input for SR. Algorithm 1 has described the training degradation estimation.

Algorithm 1 Training Degradation Estimation

Require: HR images $I_{H}^{n}$ , total number is N
Ensure: Reconstructed SR images $I_{S}^{n R}$
1: Initialization: Initialize all parameters with standard normal distribution, set total iterations for each model to $T_{0}$
2: for $r \leftarrow 1$ to $T_{0}$ do
3: $i_{H} \leftarrow RandomCrop (I_{H}^{n})$
4: {Randomly degrade image patches:}
5: $(i_{d}, D_{g t}) \leftarrow Degrade (i_{L}, n_{p r o b}, j_{p r o b})$
6; {Estimate degenerate features:}
7: $D_{e w} \leftarrow Estimation (i_{d})$ $D_{e w} \in K^{2} \times H \times W$
8: $i_{u n f o l d} \leftarrow Unfold (i_{H})$ $I_{u n f o l d} \in C \times K^{2} \times H \times W$
9: $i_{e l r} \leftarrow repeat (D_{e w}, C) \times i_{u n f o l d}$
10: $D_{e} \leftarrow Mean (D_{e w})$ {Get kernel mean}
11: $L_{l o s s} \leftarrow L 1_{l o s s} (D_{e}, D_{g t}) + L 1_{l o s s} (i_{e l r}, i_{d})$
12: end for

Blurring is a common form of degradation, usually from camera lens or CMOS limitations. Unlike deblurring tasks, SR tasks generally do not handle motion blur. We have chosen isotropic and anisotropic Gaussian blur kernels. Noise is ubiquitous, so noise injection is crucial. We have added Gaussian noise and JPEG compression; Gaussian noise is conservative when noise information is unavailable, and JPEG is widely used for compression. To enhance robustness, we have used random shuffling degradation as in BSRGAN [5].

Ground-truth degradation is hard to estimate, and overly strong constraint loss may not suit SR. Therefore, we have pre-trained our degradation estimation model as in Equation (11), enabling the network to learn effective degradation information early in up-sampling training.

3.4. Theoretical Rationale for U-Net-Based Estimator

Our degradation estimator has used a U-Net architecture for its ability to capture multi-scale features and preserve spatial details. Compared to global estimators, U-Net adaptively has predicted per-pixel degradation maps, allowing for local adaptation to varying blur and noise levels. This aligns with the observation that degradation is often spatially variant in real images.

3.5. Kernel Fusion Transformer Block

As mentioned in [38], SR networks are sensitive to image degradation types and degrees. Thus, fully utilizing degradation properties is crucial for blind SR tasks. SRMD [10] have shown the effectiveness of taking degradation features as input: (1) they have improved network capacity and (2) they have contained warping information, enabling spatial modeling.

The most common fusion method is projecting degradation features into latent space and appending or multiplying them channel-wise with image features. DNNs have then used nonlinear mapping to learn mapping to visual images. Algorithm 2 has described the training-blind SR model.

Algorithm 2 Training-Blind Super-Resolution Model

Require: LR image and its corresponding HR image pairs $P = {(I_{L}, I_{H})}^{n}$ , total number is N
Ensure: Reconstructed SR images $I_{S R}^{n}$
Load pre-trained estimation module parameters
2: Initialization: Initialize SR model parameters with standard normal distribution, set total iterations to $T_{1}$
for $r \leftarrow 1$ to $T_{1}$ do
4: $(i_{L}, i_{H}) \leftarrow RandomCrop (I_{L}^{n}, I_{H}^{n})$
{Random degrade image patches:}
6: $(i_{d}, D_{g t}) \leftarrow Degrade (i_{L}, n_{p r o b}, j_{p r o b})$
{Estimate degenerate features:}
8: $D_{e w} \leftarrow Conv (Estimation (i_{d}))$
$X \leftarrow i_{L}$
10: for $K S T B_{i}$ in $D E F$ do
for $K F B_{j}$ in $K S T B_{i}$ do
12: if $j = 0$ then
$(X, X_{d}) \leftarrow K F B_{j} (X, D_{e w}) + repeat (X, 2)$
14: else
$(X, X_{d}) \leftarrow K F B_{j} (X, X_{d}) + concat (X, X_{d})$
16: end if
end for
18: $X \leftarrow Conv (concat (X, X_{d}))$
end for
20: $i_{S R} \leftarrow PixelShuffle (X)$
end for

Like video tasks [38], this has introduced separate optical flow-aligned frames at pixel level, which has helped network performance. We have used degradation features to guide reconstruction by introducing a separate local degradation estimation model and a novel fusion strategy.

We have considered degenerate information at pixel level, avoiding the above problems. To fully use degradation features, we have proposed a novel fusion strategy: KAST takes local spatial degenerate features as input into the deep feature extractor. Instead of learning feature separation via convolutions, we have introduced self-attention to manually separate features. Specifically, we have proposed Kernel Fusion Layers (KFL), based on transformers to capture long-range dependencies. KFL has calculated attention separately between image features and degenerated features. For image features X and degradation information D, KFL process is:

\begin{matrix} X_{N} & = L N (X), \end{matrix}

(13)

\begin{matrix} D_{N} & = H_{C o n v 1} (R e l u (H_{C o n v 1} (D))), \end{matrix}

(14)

\begin{matrix} X_{M} & = (S) W - K M S A (X_{N}, D_{N}) + Shortcut, \end{matrix}

(15)

\begin{matrix} Y & = M L P (L N (X_{M})) + X_{N} . \end{matrix}

(16)

where

X_{N}

,

D_{N}

,

X_{M}

denote intermediate features; Y is KFL output;

L N

is LayerNorm;

M L P

is multi-layer perceptron; and

(S) W - K M S A

is the shifting-window kernel multi-head self-attention. Note that

M L P

maps

X_{M}

to

X_{N}

dimension. Residual connections stabilize training; shortcut values are concatenated by repeating

X_{N}

or

X_{N}

,

D_{N}

.

For the input feature size

H \times W \times C

in

(S) W - K M S A

, it is partitioned into

\frac{H \times W}{M^{2}}

windows, then standard self-attention is computed per window. For local window feature

X \in R^{M^{2} \times C}

, the attention matrix is:

Attention (Q_{I}, K_{I}, Q_{D}, K_{D}, V_{I}) = SoftMax (\frac{Q_{I} K_{I}^{T}}{\sqrt{d}} + B) V_{I} + SoftMax (\frac{Q_{D} K_{D}^{T}}{\sqrt{d}} + B) V_{I} .

(17)

where B is the position encoding generated by log-space continuous position bias (Log-CPB); d is the feature dimension;

Q_{I}

,

K_{I}

are the query and key for image features;

Q_{D}

,

K_{D}

for degradation features; and

V_{I}

is the value for image features. This approach allows the model learn to reconstruct SR images with degradation and image features, connecting features from different representation subspaces. For low-level vision tasks, pixels are fundamental; pixel geometric relationships are critical. Smooth relative position bias helps transformer models learn better mapping; following [38], our B is generated by Log-CPB, producing bias values more suitable for SR models.

3.6. Parallel Attention Fusion

Parallel attention fusion represents a key innovation in KAST, allowing image and degradation features to interact without direct concatenation or multiplication, thereby preserving their semantic independence while enabling effective cross-feature attention. This strategy captures complex dependencies that are typically lost in direct fusion methods, leading to improved reconstruction fidelity.

The parallel attention mechanism operates through two separate self-attention pathways: one for image features and another for degradation features. For image features X and degradation information D, the attention computation is performed as:

{Attention}_{I} = SoftMax (\frac{Q_{I} K_{I}^{T}}{\sqrt{d}} + B) V_{I}, {Attention}_{D} = SoftMax (\frac{Q_{D} K_{D}^{T}}{\sqrt{d}} + B) V_{I}

(18)

where

Q_{I}, K_{I}, V_{I}

and

Q_{D}, K_{D}

are query, key, and value projections for image and degradation features, respectively, and B is the position encoding generated by Log-CPB. The outputs from both attention pathways are then combined through a gating mechanism that adaptively weighs their contributions based on local context.

This design offers several advantages: (1) it maintains the distinct semantic nature of image content and degradation information, preventing information contamination; (2) it enables the model to selectively focus on relevant degradation cues based on local image characteristics; and (3) it facilitates better gradient flow during training by avoiding the vanishing gradient problems associated with deep concatenation operations. Experimental results demonstrate that this parallel fusion strategy contributes significantly to KAST’s ability to handle spatially variant degradations effectively.

4. Experiments

4.1. Experimental Setup

We have used 3450 2K HR images from the DF2K dataset (DIV2K + Flickr2K) as training data, synthesizing corresponding LR images using Equation (12). For images with ground truth, we have evaluated using PSNR and SSIM; otherwise, we have provided visual comparisons. First, we have evaluated KAST on isotropic Gaussian kernels: kernel size fixed at

21 \times 21

; isotropic Gaussian kernel width randomly selected from [0.2, 4]. During training, patches are randomly degraded by JPEG (quality [40, 95]) and additive Gaussian noise ([0, 40]), then blurred by random isotropic Gaussian kernel and downsampled with bicubic kernel. Testing degradation combinations: {bic, b1.0, b2.0, b3.0, b4.0, n20, n40, j60}.

Second, experiments on anisotropic Gaussian kernels follow [8]: kernel size

21 \times 21

; anisotropic Gaussian kernel width [0.6, 5], rotation [

- π, π

]. We have used the DIV2KRK dataset [6] for evaluation. We also have evaluated on NTIRE 2020 Real-World Super-Resolution Challenge Track 1.

Third, we have extended experiments to real-world datasets: RealSR and DRealSR, with quantitative results added.

4.1.1. Implementation Details

KFTB and KFL numbers: 6; channels: 180; attention heads: 6; window size: 8. Degradation estimation module pre-trained on synthetic data. LR patch size:

64 \times 64

. Adam optimizer (

β_{1} = 0.9

,

β_{2} = 0.99

). Training:

5 \times 10^{5}

iterations on 4 RTX3090 GPUs. Initial learning rate:

2 \times 10^{- 4}

, halved at [250,000, 400,000, 450,000, 475,000] iterations. Data augmentation: random horizontal flips, channel shuffles, rotations.

4.1.2. PSNR and SSIM Definitions

PSNR (Peak Signal-to-Noise Ratio) and SSIM (Structural Similarity Index) are defined as:

PSNR = 10 {log}_{10} (\frac{{MAX}^{2}}{MSE}), SSIM (x, y) = \frac{(2 μ_{x} μ_{y} + c_{1}) (2 σ_{x y} + c_{2})}{(μ_{x}^{2} + μ_{y}^{2} + c_{1}) (σ_{x}^{2} + σ_{y}^{2} + c_{2})},

where MAX is the maximum pixel value, MSE is the mean squared error,

μ

denotes the mean,

σ

denotes variance, and

c_{1}, c_{2}

are constants.

4.2. Ablation Study and Discussion

4.2.1. Effectiveness of Log-CPB

We have conducted experiments to demonstrate Log-CPB effectiveness. Quantitative performance on Set5 and Set14 for ×2 SR is shown in Table 2. Log-CPB brings a gain of 0.05–0.1 dB.

Table 2. Quantitative comparison with/without Log-CPB.

Benefiting from log-space learnable position bias, the network has learned fine-grained position representations, clearly expressing pixel position relationships. SR models use position prior information to connect pixels. As in Figure 5, Log-CPB learns smoother, more precise positional relationships maintained in deep layers, providing better interpretability.

Figure 5. Visualization of learnable position bias matrices using KAST and SwinIR (one head in blocks 3 and 23).

Quantitative comparisons on different datasets with different degradations are in Table 3 and Table 4.

Table 3. Quantitative comparison on BSD100 datasets with different degradation.

Table 4. Quantitative comparison on Urban100 datasets with different degradation.

As image features transmit to deep layers, semantic information differentiates, causing position bias to optimize in different directions; linear CPB cannot provide precise location information effectively because it is no longer pure location information.

4.2.2. Effectiveness of Pre-Trained Degradation Estimation Model

As discussed, the degradation estimation model needs pre-training before SR training, enabling the SR model to learn to reconstruct using degenerate features early. A comparative experiment has shown degradation information validity: one with pre-trained degradation estimation, one without, both trained on blurred images. As shown in Table 5, pre-trained model has obtained better performance, showing degenerate features help SR.

Table 5. Quantitative comparison with/without pre-trained degradation estimation.

As mentioned in [38], certain architectures or training methods help SR models learn different degenerate features, activating specified filters per degradation. We have investigated this in transformer backbones, shown in two ways. First, Figure 6 and Second Figure 7.

Figure 6. Projected feature representations with different blur width extracted from pre-trained KAST by using t-SNE. The x-axis represents t-SNE Dimension 1 and the y-axis represents t-SNE Dimension 2. (a) Blur features; (b) image features.

Figure 7. Projected feature representations with different blur width extracted from no pre-trained KAST by using t-SNE. The x-axis represents t-SNE Dimension 1 and the y-axis represents t-SNE Dimension 2. (a) Blur features; (b) image features.

Models without pre-training have no blur discriminative power, focusing only on image content; pre-trained models show blur discriminative ability, differentiating blur levels. Pure transformer backbones struggle to learn degradation information without additional methods. Thus, we have introduced pre-trained degradation estimation; even after fine-tuning, the model distinguishes degenerate features.

In addition to blur, we have performed experiments with noise levels. Throughout training, we never explicitly have introduced noise modeling, but pre-trained models distinguish noise levels as shown in (Figure 8). Pre-trained models have shown manifold relationships in high-dimensional space across noise levels as shown in (Figure 9).

Figure 8. Projected feature representations with different noise level extracted from pre-trained KAST by using t-SNE.The x-axis represents t-SNE Dimension 1 and the y-axis represents t-SNE Dimension 2. (a) Noise features; (b) image features.

Figure 9. Projected feature representations with different noise level extracted from no pre-trained KAST by using t-SNE.The x-axis represents t-SNE Dimension 1 and the y-axis represents t-SNE Dimension 2. (a) Noise features; (b) image features.

This has suggested different tasks limit the model’s ability to capture degradation information; it relates to loss function. Our estimation loss is similar to GAN loss [38]; both enable SR models to distinguish degradation types/degrees, even if not explicitly introduced during training. Different optimization procedures help models capture different image features.

4.2.3. Effectiveness of Kernel Fusion Blocks

Deep feature extraction uses KFL to fuse image and degraded features. Two considerations: (1) degenerate features help at pixel scale; (2) they lack image content but aid reconstruction. We have introduced KFL blocks to help reconstruct HR images. To reveal mechanisms, we have used LAM, an attribution method for SR. LAM comparisons between SwinIR and KAST are shown in Figure 10.

Figure 10. LAM result comparisons between SwinIR and KAST. (a) SwinIR results; (b) KAST results.

LAM has shown which pixels contribute most to the selected region; red points in LAM results represent pixels used to reconstruct patch marked with red box in HR image; Diffusion Index (DI) reflects range of involved pixels. Higher DI means wider pixel range used. Generally, more information leads to better performance. As shown in Figure 10, KAST has outperformed SwinIR in DI; KAST LAM attribution extends to nearly complete images, benefiting from a U-Net-based degradation estimation network, which aggregates features efficiently. KFB has helped transformers use more pixels without window size change, which pure transformers ignore. KFB has extracted degenerate features to help reconstruct HR images.

4.3. Comparison with Other Methods

4.3.1. Quantitative Results with Baseline Methods

We have compared KAST with baselines under different degradations. Quantitative results for multiple degradations are in Table 3 and Table 4. KAST has provided satisfactory PSNR/SSIM on multi-degraded datasets vs. baselines. KAST has outperformed baselines on most benchmark datasets, surpassing SwinIR by 0.2–0.4 dB on serious degradations (e.g., n40, j60, b4). BSD100 contains natural pictures degraded by noise and blur; our degradation model’s degenerate features do not work as expected, possibly due to excessive denoising or hard-to-estimate degradation. Compared to baselines, KAST has coped better with various degradations, especially blur width changes; as degradation intensifies, the gap widens. Quantitative results have verified KAST handles degradation types better than baselines.

4.3.2. Quantitative Results with Other Methods

To further verify robustness across degradations, we have compared with other methods under different degradation models. For fair comparison, we have fine-tuned models under BSRGAN setting [5]. Compare KAST with other blind SR methods.

First, the best performances come from transformer or mixed transformer models, meaning larger windows and long-range dependencies help reconstruction. Second, non-blind SR method RCAN has achieved higher performance than some blind SR methods, showing channel attention also improves performance. Finally, our approach has achieved highest average under all degradation situations; vs. reference methods, KAST has provided competitive SR performance under BSRGAN setting. Unlike previous results, we have achieved best performance on BSD100 dataset. KAST has surpassed SwinIR by 0.24–0.84 dB on blur degradation, proving effectiveness for BSRGAN setting; additional degenerate features help reconstruction. Visual comparison for ×4 SR in Figure 11.

Figure 11. Visual comparison for ×4 SR. The patches for comparison are marked with red and yellow boxes in the original images. PSNR/SSIM is calculated based on the whole image.

Specifically, KAST surpasses others by 0.3 dB on average on Urban100. Under severe degradation, our model has outperformed others by 0.4 dB, confirming KAST handles severe down-sampling degradation better. Blind SR models have higher PSNR on blurred images than bicubic images, contrary to common sense; similar to long-tail classification, performance drops when dealing with out-of-domain data. We have provided visual comparisons on Urban100 (img 4, 11, 78) and DIV2K (img 0802, 0828) as shown in Table 6.

Table 6. Quantitative comparison on BSD100 and Urban100. The best and the second-best values are highlighted in red and blue.

Compared to models without degradation training, only degradation-trained models have strong deblurring/denoising ability. BSRNet and BSRGAN have achieved good visual results but produce overly smooth images, losing high-frequency information. SwinIR and KAST have recovered better details, but KAST has stronger denoising/deblurring ability. Degradation estimation has helped KAST restore image structure and texture.

We have further evaluated on anisotropic Gaussian kernels, more general and challenging following this [6] setting. We have compared with ZSSR [32], EDSR [2], RCAN [4], DBPN, KernelGAN [6], KOALAnet [8]. Quantitative results are in Table 7. ZSSR is unsupervised, works well with degradation estimation. Our method, like KOALAnet, has used two-stage solution but handled degenerate features differently. Our model has handled anisotropic Gaussian kernels better.

Table 7. Quantitative comparison of the proposed method with other methods. The best and the second-best values are highlighted in red and blue.

Although KAST is not the best in the DIV2KRK dataset, it has achieved good results under multiple degradation models.

4.3.3. Visual Results with Real-World Images

Besides multi-degraded datasets, we have compared KAST with others on real-world datasets to demonstrate effectiveness. Some images lack ground truth, so we have provided visual comparisons. For better comparison, we have combined ZSSR and DnCNN, since ZSSR lacks denoising ability. Visual comparisons are shown in Figure 12.

Figure 12. Visual comparison for ×4 real-world images. (a) Historical images (source: DIV2K dataset [58]); (b) NTIRE 2020 Real-World Super-Resolution Challenge Track 1 images (source: NTIRE 2020 challenge dataset [53]). The patches for comparison are marked with yellow boxes in the original images.

ZSSR suffers from artifacts and blurring; BSRNet has produced overly smooth results with insufficient detail fidelity. KAST has provided sharper edges and finer textures. By considering degradation differences, our model effectively deals with blur, recovering details without over-sharpening the background. When facing noise and blur combinations, KAST effectively denoises while recovering more details.

4.3.4. Limitations and Failure Cases

While KAST demonstrates strong performance across various degradation scenarios, certain limitations warrant discussion. The method may underperform in cases where degradation patterns are extremely irregular or when the degradation estimation module fails to capture subtle local variations. Specifically, we observed reduced effectiveness in images with extremely low-light conditions combined with heavy noise, where degradation estimation becomes particularly challenging. Additionally, KAST may struggle with non-uniform motion blur that varies unpredictably across the image, as well as mixed degradation types that interact in complex ways not well-represented in the training data.

These failure cases highlight the need for more robust degradation modeling and suggest directions for future work, including the development of more adaptive estimation mechanisms and expanded training data coverage.

5. Conclusions

This paper has presented KAST, a novel Kernel Adaptive Swin Transformer designed for blind image super-resolution. The proposed method addresses the limitations of existing approaches by introducing three key innovations: local degradation-aware modeling, parallel attention-based feature fusion, and log-space continuous position bias (Log-CPB).

Our contributions significantly advance the field of image restoration in several ways. First, the local degradation-aware modeling enables KAST to capture spatially variant degradation patterns that are prevalent in real-world images, overcoming the global consistency assumption that limits many existing methods. Quantitative results demonstrate that this approach improves PSNR by 0.2–0.4 dB on severe degradation scenarios compared to state-of-the-art methods like SwinIR.

Second, the parallel attention fusion strategy preserves the semantic independence of image and degradation features while enabling effective cross-feature interaction. This design choice prevents information contamination and allows for adaptive weighting based on local context, leading to improved reconstruction fidelity. Ablation studies confirm that this fusion mechanism contributes to a 0.05–0.1 dB performance gain.

Third, Log-CPB provides smoother and more fine-grained positional representations, enhancing the model’s ability to capture pixel relationships. This innovation proves particularly valuable in maintaining positional consistency across deep layers, where conventional position encoding methods often fail.

The experimental validation across multiple datasets demonstrates that KAST consistently outperforms existing methods in handling diverse degradation types, including blur, noise, and their combinations. On the BSD100 dataset, KAST achieves an average PSNR of 25.61 dB, surpassing SwinIR-GD by 0.06 dB. More significantly, on challenging urban scenes from Urban100, KAST achieves 24.44 dB, outperforming SwinIR-GD by 0.28 dB. These improvements translate to visibly better reconstruction quality, with sharper edges and more natural textures.

Despite parallel processing not significantly increasing inference time, integrating degradation features have increased parameters. While we can reduce degradation estimation input dimension, we did not investigate its impact on performance. Future work will focus on optimizing the network architecture for efficiency without compromising performance. Additionally, exploring the application of KAST’s principles to other image restoration tasks such as denoising, dehazing, and inpainting represents a promising research direction.

Author Contributions

Conceptualization, Z.N., J.W., A.B. and L.Y.; methodology, Z.N., J.W., A.B. and L.Y.; software, Z.N., J.W., A.B. and L.Y.; validation, Z.N., J.W., A.B. and L.Y.; formal analysis, Z.N., J.W., A.B. and L.Y.; investigation, Z.N., J.W., A.B. and L.Y.; resources, Z.N., J.W., A.B. and L.Y.; data curation, Z.N., J.W., A.B. and L.Y.; writing—original draft preparation, Z.N., J.W., A.B. and L.Y.; writing—review and editing, Z.N., J.W., A.B. and L.Y.; visualization, Z.N., J.W., A.B. and L.Y.; project administration Z.N., J.W., A.B. and L.Y.; funding acquisition, Z.N., J.W., A.B. and L.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partially supported by the Major Natural Science Research Project of Higher Education Institutions in Jiangsu Province; Jiangsu Postgraduate Training Innovation Project under Grants No. KYCX22-2223.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding authors.

Acknowledgments

The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Dong, C.; Loy, C.C.; He, K.; Tang, X. Image super-resolution using deep convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 38, 295–307. [Google Scholar] [CrossRef] [PubMed]
Lim, B.; Son, S.; Kim, H.; Nah, S.; Mu Lee, K. Enhanced deep residual networks for single image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 136–144. [Google Scholar]
Kim, J.; Lee, J.K.; Lee, K.M. Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1646–1654. [Google Scholar]
Zhang, Y.; Li, K.; Li, K.; Wang, L.; Zhong, B.; Fu, Y. Image super-resolution using very deep residual channel attention networks. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 286–301. [Google Scholar]
Zhang, K.; Liang, J.; Van Gool, L.; Timofte, R. Designing a practical degradation model for deep blind image super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 4791–4800. [Google Scholar]
Bell-Kligler, S.; Shocher, A.; Irani, M. Blind super-resolution kernel estimation using an internal-gan. Adv. Neural Inf. Process. Syst. 2019, 32, 284–293. [Google Scholar]
Huang, Y.; Li, S.; Wang, L.; Tan, T. Unfolding the alternating optimization for blind super resolution. Adv. Neural Inf. Process. Syst. 2020, 33, 5632–5643. [Google Scholar]
Kim, S.Y.; Sim, H.; Kim, M. Koalanet: Blind super-resolution using kernel-oriented adaptive local adjustment. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, 19–25 June 2021; pp. 10606–10615. [Google Scholar]
Huang, M.; Wang, T.; Cai, Y.; Fan, H.; Li, Z. StainGAN: Learning a structural preserving translation for white blood cell images. J. Biophotonics 2023, 16, e202300196. [Google Scholar] [CrossRef] [PubMed]
Zhang, K.; Zuo, W.; Zhang, L. Learning a single convolutional super-resolution network for multiple degradations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 3262–3271. [Google Scholar]
Luo, Z.; Huang, H.; Yu, L.; Li, Y.; Fan, H.; Liu, S. Deep constrained least squares for blind image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–22 June 2022; pp. 17642–17652. [Google Scholar]
Hui, Z.; Li, J.; Wang, X.; Gao, X. Learning the non-differentiable optimization for blind super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 2093–2102. [Google Scholar]
Lian, W.; Peng, S. Kernel-aware raw burst blind super-resolution. arXiv 2021, arXiv:2112.07315. [Google Scholar]
Liang, J.; Zeng, H.; Zhang, L. Efficient and degradation-adaptive network for real-world image super-resolution. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer Nature: Cham, Switzerland, 2022; pp. 574–591. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
Liang, J.; Cao, J.; Sun, G.; Zhang, K.; Van Gool, L.; Timofte, R. Swinir: Image restoration using swin transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 1833–1844. [Google Scholar]
Gu, S.; Sang, N.; Ma, F. Fast image super resolution via local regression. In Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012), Tsukuba, Japan, 11–15 November 2012; pp. 3128–3131. [Google Scholar]
Michaeli, T.; Irani, M. Nonparametric blind super-resolution. In Proceedings of the IEEE International Conference on Computer Vision, Sydney, Australia, 1–8 December 2013; pp. 945–952. [Google Scholar]
Timofte, R.; De Smet, V.; Van Gool, L. Anchored neighborhood regression for fast example-based super-resolution. In Proceedings of the IEEE International Conference on Computer Vision, Sydney, Australia, 1–8 December 2013; pp. 1920–1927. [Google Scholar]
Riegler, G.; Schulter, S.; Ruther, M.; Bischof, H. Conditioned regression models for non-blind single image super-resolution. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 522–530. [Google Scholar]
Xie, L.; Wang, X.; Dong, C.; Qi, Z.; Shan, Y. Finding discriminative filters for specific degradations in blind super-resolution. Adv. Neural Inf. Process. Syst. 2021, 34, 51–61. [Google Scholar]
Liang, J.; Zhang, K.; Gu, S.; Van Gool, L.; Timofte, R. Flow-based kernel prior with application to blind super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10601–10610. [Google Scholar]
Shi, W.; Caballero, J.; Huszár, F.; Totz, J.; Aitken, A.P.; Bishop, R. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1874–1883. [Google Scholar]
Kong, X.; Zhao, H.; Qiao, Y.; Dong, C. Classsr: A general framework to accelerate super-resolution networks by data characteristic. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 12016–12025. [Google Scholar]
Muqeet, A.; Iqbal, M.T.B.; Bae, S.H. HRAN: Hybrid residual attention network for single image super-resolution. IEEE Access 2019, 7, 137020–137029. [Google Scholar] [CrossRef]
Pesavento, M.; Volino, M.; Hilton, A. Attention-based multi-reference learning for image super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 14697–14706. [Google Scholar]
Ledig, C.; Theis, L.; Huszár, F.; Caballero, J.; Cunningham, A.; Acosta, A.; Aitken, A.; Tejani, A.; Totz, J.; Shi, W. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4681–4690. [Google Scholar]
Johnson, J.; Alahi, A.; Li, F.-F. Perceptual losses for real-time style transfer and super-resolution. In Computer Vision–ECCV 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer International Publishing: Cham, Switzerland, 2016; pp. 694–711. [Google Scholar]
Mechrez, R.; Talmi, I.; Shama, F.; Zelnik-Manor, L. Maintaining natural image statistics with the contextual loss. arXiv 2019, arXiv:1803.04626. [Google Scholar]
Wang, X.; Yu, K.; Wu, S.; Gu, J.; Liu, Y.; Dong, C. Esrgan: Enhanced super-resolution generative adversarial networks. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, Glasgow, UK, 23–28 August 2019; pp. 63–79. [Google Scholar]
Efrat, N.; Glasner, D.; Apartsin, A.; Nadler, B.; Levin, A. Accurate blur models vs. image priors in single image super-resolution. In Proceedings of the IEEE International Conference on Computer Vision, Sydney, Australia, 1–8 December 2013; pp. 2832–2839. [Google Scholar]
Shocher, A.; Cohen, N.; Irani, M. “zero-shot” super-resolution using deep internal learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 3118–3126. [Google Scholar]
Ji, X.; Cao, Y.; Tai, Y.; Wang, C.; Li, J.; Huang, F. Real-world super-resolution via kernel estimation and noise injection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 466–467. [Google Scholar]
Gu, J.; Lu, H.; Zuo, W.; Dong, C. Blind super-resolution with iterative kernel correction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1604–1613. [Google Scholar]
Vaswani, A. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 1–11. [Google Scholar]
Dosovitskiy, A.; Beyer, L. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer International Publishing: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar]
Zheng, S.; Lu, J.; Zhao, H.; Zhu, X.; Luo, Z.; Wang, Y.; Fu, Y.; Feng, J.; Xiang, T.; Torr, P.H.S.; et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 6877–6886. [Google Scholar]
Zhang, K.; Zuo, W.; Chen, Y.; Meng, D.; Zhang, L. Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising. IEEE Trans. Image Process. 2017, 26, 3142–3155. [Google Scholar] [CrossRef] [PubMed]
Fan, J.; Weng, J.; Wang, K.; Yang, Y.; Qian, J.; Li, J.; Yang, J. Driving-video dehazing with non-aligned regularization for safety assistance. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–24 June 2024; pp. 26109–26119. [Google Scholar]
Liu, Y.; Wang, X.; Hu, E.; Wang, A.; Shiri, B.; Lin, W. VNDHR: Variational single nighttime image Dehazing for enhancing visibility in intelligent transportation systems via hybrid regularization. IEEE Trans. Intell. Transp. Syst. 2025, 26, 10189–10203. [Google Scholar] [CrossRef]
Zhang, S.; Zhang, X.; Shen, L.; Wan, S.; Ren, W. Wavelet-based physically guided normalization network for real-time traffic dehazing. Pattern Recognit. 2025, 2025, 112451. [Google Scholar] [CrossRef]
Fan, J.; Wang, K.; Yan, Z.; Chen, X.; Gao, S.; Li, J.; Yang, J. Depth-centric dehazing and depth-estimation from real-world hazy driving video. Aaai Conf. Artif. Intell. 2025, 39, 2852–2860. [Google Scholar] [CrossRef]
Liu, Y.; Yan, Z.; Tan, J.; Li, Y. Multi-purpose oriented single nighttime image haze removal based on unified variational retinex model. IEEE Trans. Circuits Syst. Video Technol. 2022, 33, 1643–1657. [Google Scholar] [CrossRef]
Fan, J.; Li, X.; Qian, J.; Li, J.; Yang, J. Non-aligned supervision for real image dehazing. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 10705–10715. [Google Scholar] [CrossRef]
Zhang, S.; Zhang, X.; Ren, W.; Zhao, L.; Fan, E.; Huang, F. Exploring Fuzzy Priors From Multi-Mapping GAN for Robust Image Dehazing. IEEE Trans. Fuzzy Syst. 2025, 33, 3946–3958. [Google Scholar] [CrossRef]
Mei, Y.; Fan, Y.; Zhou, Y. Image super-resolution with non-local sparse attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 3516–3525. [Google Scholar]
Jia, X.; De Brab, ere, B.; Tuytelaars, T.; Gool, L.V. Dynamic filter networks. Adv. Neural Inf. Process. Syst. 2016, 29, 1–9. [Google Scholar]
Liu, Y.; Liu, A.; Gu, J.; Zhang, Z.; Wu, W.; Qiao, Y.; Dong, C. Discovering “Semantics” in Super-Resolution Networks. arXiv 2021, arXiv:2108.00406. [Google Scholar]
Simonyan, K.; Zisserman, A. Two-stream convolutional networks for action recognition in videos. In Proceedings of the 27th International Conference on Neural Information Processing Systems, Montreal, Quebec, Canada, 8–13 December 2014; pp. 568–576. [Google Scholar]
Liu, Z.; Hu, H.; Lin, Y.; Yao, Z.; Xie, Z.; Wei, Y.; Ning, J.; Cao, Y.; Zhang, Z.; Dong, L.; et al. Swin transformer v2: Scaling up capacity and resolutionIn. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–22 June 2022; pp. 12009–12019. [Google Scholar]
Timofte, R.; Agustsson, E.; Van Gool, L.; Yang, M.H.; Zhang, L. Ntire 2017 challenge on single image super-resolution: Methods and results. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPRW), Honolulu, HI, USA, 21–26 July 2017; pp. 1110–1121. [Google Scholar]
Lugmayr, A.; Danelljan, M.; Timofte, R. Ntire 2020 challenge on real-world image super-resolution: Methods and results. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 2058–2076. [Google Scholar]
Gu, J.; Dong, C. Interpreting super-resolution networks with local attribution maps. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 9199–9208. [Google Scholar]
Zhang, W.; Shi, G.; Liu, Y.; Dong, C.; Wu, X.M. A closer look at blind super-resolution:Degradation models, baselines, and performance upper bounds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–22 June 2022; pp. 527–536. [Google Scholar]
Haris, M.; Shakhnarovich, G.; Ukita, N. Deep back-projection networks for super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 1664–1673. [Google Scholar]
Hussein, S.A.; Tirer, T.; Giryes, R. Correction filter for single image super-resolution: Robustifying off-the-shelf deep super-resolvers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 1428–1437. [Google Scholar]
Gu, S.; Lugmayr, A.; Danelljan, M.; Fritsche, M.; Lamour, J.; Timofte, R. Div8k: Diverse 8k resolution image dataset. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, Republic of Korea, 27–28 October 2019; pp. 3512–3516. [Google Scholar]

Figure 1. Performance comparison of the proposed method with SwinIR in ×4 blind super-resolution, with different blur widths (2, 3, 4).

Figure 2. The overall architecture of KAST and the structure of KFTB.

Figure 3. After fine-tuning with the SR task, visualization results of degradation features in the SR model. The second line is not pre-trained with degradation; the third line is pre-trained with degradation.

Figure 4. The degradation estimation network architecture in our module; orange blocks are MaxPool modules, purple blocks are transposed convolution modules.

Figure 5. Visualization of learnable position bias matrices using KAST and SwinIR (one head in blocks 3 and 23).

Figure 6. Projected feature representations with different blur width extracted from pre-trained KAST by using t-SNE. The x-axis represents t-SNE Dimension 1 and the y-axis represents t-SNE Dimension 2. (a) Blur features; (b) image features.

Figure 7. Projected feature representations with different blur width extracted from no pre-trained KAST by using t-SNE. The x-axis represents t-SNE Dimension 1 and the y-axis represents t-SNE Dimension 2. (a) Blur features; (b) image features.

Figure 8. Projected feature representations with different noise level extracted from pre-trained KAST by using t-SNE.The x-axis represents t-SNE Dimension 1 and the y-axis represents t-SNE Dimension 2. (a) Noise features; (b) image features.

Figure 9. Projected feature representations with different noise level extracted from no pre-trained KAST by using t-SNE.The x-axis represents t-SNE Dimension 1 and the y-axis represents t-SNE Dimension 2. (a) Noise features; (b) image features.

Figure 10. LAM result comparisons between SwinIR and KAST. (a) SwinIR results; (b) KAST results.

Figure 11. Visual comparison for ×4 SR. The patches for comparison are marked with red and yellow boxes in the original images. PSNR/SSIM is calculated based on the whole image.

Figure 12. Visual comparison for ×4 real-world images. (a) Historical images (source: DIV2K dataset [58]); (b) NTIRE 2020 Real-World Super-Resolution Challenge Track 1 images (source: NTIRE 2020 challenge dataset [53]). The patches for comparison are marked with yellow boxes in the original images.

Table 1. Comparison of KAST with existing blind SR methods.

Method	Degradation Modeling	Fusion Strategy	Attention Mechanism
SRMD [10]	Global	Concatenation	CNN
KOALAnet [8]	Local (kernel-oriented)	Adaptive adjustment	CNN
SwinIR [16]	Non-degradation-aware	None	Swin Transformer
KAST (Ours)	Local degradation-aware	Parallel attention fusion	Swin Transformer + Log-CPB

Table 2. Quantitative comparison with/without Log-CPB.

Dataset	w/o Log-CPB		w/ Log-CPB
Dataset	PSNR	SSIM	PSNR	SSIM
Set5	35.95	0.944	36.05	0.945
Set14	31.84	0.897	31.91	0.898

Table 3. Quantitative comparison on BSD100 datasets with different degradation.

Dataset	Method	Degradation Types
		b1	b2	b1j60	b2j60	b1n20	b2n20	b1n20j60	b2n20j60	b2n40j60
Set5	SwinIR	37.63	36.77	36.35	35.68	34.69	33.87	33.96	33.34	31.03
Set5	KAST	37.82	36.95	36.49	35.80	34.80	33.93	34.11	33.57	31.24
Set14	SwinIR	33.47	33.25	32.84	32.73	31.48	31.18	30.89	30.82	28.95
Set14	KAST	33.60	33.48	32.98	32.96	31.56	31.35	31.01	31.03	29.12
Urban100	SwinIR	31.58	30.71	31.04	30.33	30.14	29.44	29.72	29.12	27.15
Urban100	KAST	31.58	30.84	31.01	30.41	30.17	29.61	29.74	29.23	27.36
BSD100	SwinIR	32.25	32.40	31.74	31.84	30.35	30.40	30.01	30.03	28.18
BSD100	KAST	32.22	32.34	31.73	31.79	30.33	30.37	29.99	30.07	28.16
Manga109	SwinIR	38.26	36.09	36.47	35.88	34.62	34.28	34.00	33.86	30.69
Manga109	KAST	38.33	36.24	36.56	36.09	34.71	34.46	34.12	33.97	30.83

Table 4. Quantitative comparison on Urban100 datasets with different degradation.

Dataset	Method	Degradation Types
		b3	b4	b3j60	b4j60	b3n20	b4n20	b3n20j60	b4n20j60	b4n40j60
Set5	SwinIR	31.49	29.54	31.36	29.39	30.80	28.64	30.54	28.73	27.43
Set5	KAST	31.71	29.90	31.51	29.73	31.00	28.95	30.74	28.89	27.81
Set14	SwinIR	28.67	28.05	28.50	27.95	28.13	27.56	27.99	27.54	26.61
Set14	KAST	28.85	28.23	28.70	28.19	28.32	27.74	28.18	27.76	26.88
Urban100	SwinIR	26.33	25.65	26.06	25.52	25.98	25.23	25.79	25.29	24.75
Urban100	KAST	26.49	25.83	26.19	25.66	26.09	25.48	25.87	25.40	24.85
BSD100	SwinIR	27.87	27.75	27.67	27.54	27.39	27.27	27.22	27.14	26.42
BSD100	KAST	27.86	27.68	27.66	27.51	27.38	27.26	27.23	27.12	26.46
Manga109	SwinIR	30.76	29.22	30.43	29.08	30.03	28.64	29.82	28.53	27.42
Manga109	KAST	30.96	29.48	30.66	29.32	30.27	28.91	30.03	28.83	27.71

Table 5. Quantitative comparison with/without pre-trained degradation estimation.

Dataset	w/o Pre-Train		w/ Pre-Train
Dataset	PSNR	SSIM	PSNR	SSIM
Set5	32.31	0.904	32.65	0.909
Set14	29.55	0.839	29.90	0.846

Table 6. Quantitative comparison on BSD100 and Urban100. The best and the second-best values are highlighted in red and blue.

Dataset	Methods	Degradation Types
Dataset	Methods	bic	b2	n20	j60	b2n20	b2j60	n20j60	b2n20j60	Average
BSD100	Bicubic	24.63	25.40	21.56	24.06	21.90	24.65	21.22	21.72	23.14
	RCAN [4]	25.65	26.77	24.63	25.16	24.39	25.36	24.36	24.15	25.06
	SRResNet-FAIG [53]	25.58	26.72	24.53	25.11	24.26	25.29	24.32	24.07	24.99
	RRDBNet [55]	25.62	26.76	24.58	25.13	24.33	25.32	24.34	24.11	25.02
	SwinIR [16]	25.84	27.05	24.77	25.27	24.48	25.44	24.44	24.18	25.18
	RRDBNet-GD [55]	26.25	27.31	25.31	25.23	24.95	25.32	24.38	24.07	25.35
	SwinIR-GD [16]	26.61	27.58	25.64	25.30	25.30	25.39	24.44	24.14	25.55
	KAST (Ours)	26.68	27.82	25.44	25.24	25.35	25.45	24.64	24.48	25.61
Urban100	Bicubic	21.89	22.54	20.00	21.50	20.36	22.02	19.74	20.20	21.03
	RCAN [4]	23.65	24.67	22.93	23.35	22.59	23.36	22.77	22.35	23.21
	SRResNet-FAIG [53]	23.54	24.42	22.88	23.26	22.42	23.16	22.73	22.19	23.08
	RRDBNet [55]	23.53	24.46	22.89	23.28	22.48	23.17	22.75	22.24	23.10
	SwinIR [16]	24.16	25.10	23.34	23.73	22.86	23.62	23.09	22.53	23.55
	RRDBNet-GD [55]	24.51	25.39	23.57	23.67	23.05	23.18	22.92	22.13	23.55
	SwinIR-GD [16]	25.55	26.12	24.40	24.11	23.83	23.56	23.26	22.42	24.16
	KAST (Ours)	25.85	26.96	24.41	24.16	24.01	23.76	23.56	22.89	24.44

Table 7. Quantitative comparison of the proposed method with other methods. The best and the second-best values are highlighted in red and blue.

Method	DIV2KRKx2
Method	PSNR	SSIM
Bicubic	28.73	0.8040
Bicubic+ZSSR [32]	29.10	0.8215
EDSR [2]	29.17	0.8216
RCAN [4]	29.20	0.8233
DBPN	29.13	0.8190
DBPN+Correction	30.38	0.8717
KernelGAN [6]+SRMD [10]	29.57	0.8564
KernelGAN [6]+ZSSR [32]	30.36	0.8669
KOALAnet [8]	31.89	0.8852
KAST	32.21	0.8957

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Kernel Adaptive Swin Transformer for Image Restoration

Abstract

1. Introduction

3. Proposed Method

3.1. Motivation

3.2. Overall Pipeline

3.3. Degradation Estimation

3.4. Theoretical Rationale for U-Net-Based Estimator

3.5. Kernel Fusion Transformer Block

3.6. Parallel Attention Fusion

4. Experiments

4.1. Experimental Setup

4.1.1. Implementation Details

4.1.2. PSNR and SSIM Definitions

4.2. Ablation Study and Discussion

4.2.1. Effectiveness of Log-CPB

4.2.2. Effectiveness of Pre-Trained Degradation Estimation Model

4.2.3. Effectiveness of Kernel Fusion Blocks

4.3. Comparison with Other Methods

4.3.1. Quantitative Results with Baseline Methods

4.3.2. Quantitative Results with Other Methods

4.3.3. Visual Results with Real-World Images

4.3.4. Limitations and Failure Cases

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics

Kernel Adaptive Swin Transformer for Image Restoration

Abstract

1. Introduction

2. Related Works

2.1. Single Degrade Super-Resolution

2.2. Blind Super-Resolution

2.3. Vision Transformer

2.4. Comparison with Existing Methods and Related Restoration Tasks

3. Proposed Method

3.1. Motivation

3.2. Overall Pipeline

3.3. Degradation Estimation

3.4. Theoretical Rationale for U-Net-Based Estimator

3.5. Kernel Fusion Transformer Block

3.6. Parallel Attention Fusion

4. Experiments

4.1. Experimental Setup

4.1.1. Implementation Details

4.1.2. PSNR and SSIM Definitions

4.2. Ablation Study and Discussion

4.2.1. Effectiveness of Log-CPB

4.2.2. Effectiveness of Pre-Trained Degradation Estimation Model

4.2.3. Effectiveness of Kernel Fusion Blocks

4.3. Comparison with Other Methods

4.3.1. Quantitative Results with Baseline Methods

4.3.2. Quantitative Results with Other Methods

4.3.3. Visual Results with Real-World Images

4.3.4. Limitations and Failure Cases

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics