Adjustable Complexity Transformer Architecture for Image Denoising

Liao, Jan-Ray; Lin, Wen; Chang, Li-Wen

doi:10.3390/signals7020033

Open AccessArticle

Adjustable Complexity Transformer Architecture for Image Denoising

by

Jan-Ray Liao

¹

,

Wen Lin

¹ and

Li-Wen Chang

^2,*

¹

Department of Electrical Engineering, National Chung Hsing University, Taichung 402202, Taiwan

²

General Education Center, Chaoyang University of Technology, Taichung 413310, Taiwan

^*

Author to whom correspondence should be addressed.

Signals 2026, 7(2), 33; https://doi.org/10.3390/signals7020033

Submission received: 13 February 2026 / Revised: 16 March 2026 / Accepted: 18 March 2026 / Published: 6 April 2026

(This article belongs to the Topic Image Processing, Signal Processing and Their Applications)

Download

Browse Figures

Versions Notes

Abstract

In recent years, image denoising has seen a shift from traditional non-local self-similarity methods like BM3D to deep-learning based approaches that use learnable convolutions and attention mechanisms. While pixel-level attention is effective at capturing long-range relationships similar to non-local self-similarity based methods, it incurs extremely high computational costs that scale quadratically with image resolution. As an alternative, channel-wise attention is resolution-independent and computationally efficient but may miss crucial spatial details. In this paper, an adjustable attention mechanism is introduced that bridges the gap between pixel and channel attentions. In the proposed model, average pooling and variable-size convolutions are added before attention calculation to adjust spatial resolution and, thus, allow dynamical adjustment of computational complexity. This adjustable attention is applied in a transformer-based U-Net architecture and achieves performance comparable to state-of-the-art methods in both real and Gaussian blind denoising tasks. To be more concrete, the proposed method achieves a Peak Signal-to-Noise Ratio of 39.65 dB and a Structural Similarity Index Measure of 0.913 on the Smartphone Image Denoising Dataset. Therefore, the proposed method demonstrates a balance between efficiency and denoising quality.

Keywords:

transformer; attention; convolutional neural network; image denoising

1. Introduction

Image denoising plays a crucial role in enhancing image quality before subsequent image processing and computer vision tasks [1,2,3]. Historically, non-local self-similarity (NSS) methods like Block-matching and 3D filtering (BM3D) were considered the gold standard [4,5,6]. After the groundbreaking introduction of DnCNN [7], subsequent deep learning-based denoisers significantly outperformed these traditional NSS methods and have gradually evolved from non-blind models to blind denoising architectures capable of handling complex, non-Gaussian real-world noise [8,9,10].

Two key building blocks contribute to these remarkable advances of neural networks: learnable convolutions [11,12] and attention mechanisms [13,14,15]. While convolutions are effective for local feature extraction, attention mechanisms excel at modeling long-range dependencies, similar to identifying structural similarities in traditional NSS methods. To further enhance adaptability, multi-path architectures are frequently employed to implement various restoration strategies [16,17,18,19,20,21,22]. Most notably, transformer architectures that rely on multi-head attention have become performance leaders in image denoising [23,24]. Despite these advances, a significant challenge remains regarding the computational complexity of attention mechanisms. Pixel-wise attention, utilized in models like SwinIR [25,26,27,28,29], effectively captures spatial relationships but suffers from quadratic complexity relative to image resolution. Conversely, channel-wise attention, as seen in Restormer [24,30], offers resolution-independent efficiency but may overlook critical spatial details.

In this paper, an adjustable attention mechanism is introduced to bridge the gap between these two architectural extremes. By incorporating variable-stride average pooling and variable-size convolutions before the attention calculation, the spatial resolution of the key and value matrices can be manipulated. This design allows for the dynamic adjustment of computational complexity based on specific hardware constraints or noise characteristics. The proposed mechanism is integrated into a transformer-based U-Net architecture, achieving denoising performance comparable to state-of-the-art models while maintaining a superior balance between efficiency and quality.

This paper is organized as follows: In Section 2, previous studies are reviewed. Section 3 introduces the proposed network architecture. Section 4 presents experimental results, and Section 5 concludes the paper.

2. Related Works

As mentioned in the previous section, the first breakthrough in deep learning-based denoising is DnCNNs [7]. Following DnCNNs, earlier CNN-based denoisers are either trained at a specific noise level or required noise level as a prior information [31]. Therefore, they are considered as non-blind denoisers. The exception is DnCNNs, which are also trained under blind setting but result in lesser denoising performance as compared to non-blind setting. Since non-blind denoising is severely limited in practical applications, the development of CNN denoising gradually moves toward blind denoising that does not require prior noise knowledge.

The rapid expansion of deep learning research has led to a wealth of innovations in both blind and real-world image denoising. Because this paper focuses specifically on transformer-based techniques, before delving into those, a concise overview of other blind denoising methods will be provided. Given the diverse nature of noise, a common and effective strategy in denoising networks is to employ multiple paths with various aims [17,18,32,33,34,35,36,37,38]. One prominent approach within this strategy is to dedicate an independent path to estimate noise characteristics [17,18,36,37,38]. Alternatively, multi-path networks can apply different convolutional strategies across their paths [19,20,39,40,41], such as using various receptive fields or multiple resolutions.

Before the rise of transformer-based techniques, which use multi-head attentions, networks using single-head attention also achieved impressive results [14,15,16,33,34,35]. Two of representative methods in this regard are the residual non-local attention network (RNAN) [14] and collaborative attention network (COLA-Net) [35]. The RNAN proposed by Zhang et al. combined paths with and without attention. The COLA-Net proposed by Mou et al. adaptively combined local and non-local attention mechanisms for restoration.

Before the advent of transformer-based techniques and their multi-head attention mechanisms, neural networks employing single-head attention also delivered impressive denoising results [14,15,16,33,34,35]. Notable examples include the residual non-local attention network (RNAN) [14] and the collaborative attention network (COLA-Net) [35]. The RNAN integrated pathways both with and without attention, while the COLA-Net adaptively blended local and non-local attention mechanisms to achieve effective image restoration.

The initial adoption of transformer-based techniques from natural language processing to image processing were IPT [42] and ViT [43]. However, as already discussed in the Introduction, tokenization for NLP can inadvertently lead to loss of crucial structural information in an image. To combat this potential information loss, XFormer introduced an innovative solution [44]: it separates tokens into distinct spatial and channel dimensions and enhances information fusion through the interaction of tokens from these different dimensions. However, pixel-wise attention without tokenization is still better suited for image processing. To address the enormous computational demands of pixel attention, SwinIR [25] proposed a workaround by dividing an image into smaller windows and performing attention operations within each window. This efficient approach paved the way for further advancements. For instance, Fan et al. successfully applied SwinIR to non-blind denoising [26], while Li et al. integrated it with wavelet transforms [27]. Another contribution applied SwinIR in self-supervised denoising [29].

While dividing images into windows offers computational advantages for transformer-based models, it also restricts interaction between these windows. Researchers have proposed numerous methods to address this limitation and improve global interaction. Uformer [45], for instance, integrated a multi-scale path into transformer. Xiao et al. [46] introduced a strategy of randomly shuffling an image before window division. Zhang et al. [47] took a different approach, creating patches with interleaved pixels by sampling at fixed intervals rather than using direct window divisions. Other solutions involve adding new pathways or modifying attention mechanisms. Liu et al. [48] added a path specifically to calculate attention among corresponding positions of each window. Zhao et al. [49] adopted a two-stage approach, applying attention to superpixels first, then using pixel-wise attention. Jiang et al. [50] incorporated fast Fourier transform for global modeling. Heterogeneous Window Transformer (HWformer) [51] shifts windows both horizontally and vertically to enrich spatial information. Zhang et al. proposed [52] “quadrangle attention”, which dynamically determined the location, size, orientation, and shape of each window.

Taking an approach in an entirely different direction, Restormer [24,30] replaced pixel attention with channel attention. This architectural shift dramatically reduces computational complexity while remarkably maintaining impressive performance in image restoration tasks. More recently, Yao et al. [30] further enhanced the Restormer architecture by incorporating neural degradation representation, enabling an all-in-one approach in image restoration. A summary of these strategies is provided in Table 1. In this work, different from these previous researches, the possibility of flexibly adjusting complexity in attention mechanism is explored. In the following, details on how to accomplish this goal will be discussed.

3. Proposed Method

Image denoising under supervised learning can be described as an estimation from a noisy input

Y

given a clean image

X

. The proposed neural network finds an estimate

\hat{X}

from

Y

with the goal of minimizing the

L_{1}

loss:

∥ \hat{X} - X ∥

. In the following, the neural network architecture to achieve this purpose is described.

3.1. Overall Architecture

As shown in Figure 1, the network architecture employs a U-Net structure similar to Restormer [24] and NAFNet [23]. The network begins with a convolution layer to extract shallow features and then proceeds to a U-Net structure that incorporates three symmetric encoder and decoder with an intermediate stage connecting them. The encoder consists of three successive stages, each centered around a “Cascaded Transformer Block”, which is composed of a series of individual transformer blocks as depicted in Figure 2. To capture increasingly complex global dependencies, the three stages progressively increase the channel size while reducing spatial resolution. Downsampling between these stages was performed using a convolution layer and pixel unshuffle operations. A fourth stage acts as the intermediate bottleneck connecting the encoder and decoder with the highest channel size and lowest spatial resolution. The decoder mirrors the encoder stages but uses a convolution layer and pixel shuffle operations for upsampling to restore the original image dimensions. To preserve fine spatial details that might be lost during the hierarchical transition, skip connections were utilized. These connections employed concatenation to blend high-resolution features from the encoder directly into the corresponding decoder stages. Following the U-Net, an additional cascaded transformer block and a final convolution layer further refined the output to complete the restoration. The specific number of transformers and associated parameters within each cascaded transformer block are listed in Table 2.

3.2. Attention with Adjustable Complexity and Transformer

The core innovation in the proposed architecture is the adjustable attention mechanism (Figure 3c) used in the transformer block. Before presenting the proposed adjustable attention, the mathematical foundation of conventional attention mechanisms needs to be examined. It is a mathematical operation involving multiplication of three matrices:

softmax (Q K^{T}) V

where

Q

,

K

, and

V

are called query, key, and value in attention calculation. Therefore, computational complexity is directly related to these three matrices. For pixel-wise attention as shown in Figure 3a, the sizes of matrices

Q

and

K^{T}

are

H W \times C

and

C \times H W

, respectively.

Thus, computation of the attention score:

softmax (Q K^{T})

results in a matrix of size

H W \times H W

, which is the cause of quadratic complexity in spatial attention. On the other hand, as shown in Figure 3b, channel attention swaps the dimension of

Q

and

K^{T}

and makes them

C \times H W

and

H W \times C

, respectively. Therefore, the size of attention score matrix becomes

C \times C

, which makes it significantly more efficient.

The main differences between the proposed adjustable attention and conventional attention are twofold. The first is that two average pooling layers with variable stride F are added before generating key and value with convolutions. The kernel sizes of these two average pooling layers are

k \times k

and

v \times v

for key and value, respectively. The second is that variable-size convolutions are used to generate key, query, and value instead of fixed-size

1 \times 1

or

3 \times 3

convolutions typically used in conventional attention. Here,

k \times k

convolution for key,

q \times q

convolution for query, and

v \times v

convolution for value are used. The kernel sizes for key and value are the same as the pooling layers to avoid too many variable parameters within the system. The most important impact due to these changes is that the sizes of key and value matrices become

\frac{H}{F} \times \frac{W}{F} \times C

instead of

H \times W \times C

in conventional attention. After reshaping the query matrix to

(H W) \times C

and the key matrix to

(\frac{H W}{F^{2}}) \times C

, their multiplication (

Q K^{T}

) results in an attention score matrix of size

(H W) \times (\frac{H W}{F^{2}})

. This means that the complexity of attention mechanism can now be flexibly controlled by the stride F in the average pooling layer. For example, if the stride

F = 32

for the added average pooling layers and the size of the input is

256 \times 256 \times 48

, the sizes of key and value matrices become

8 \times 8 \times 48

instead of

256 \times 256 \times 48

. In the meantime, the size of attention score matrix becomes 65,536 × 64 instead of 65,536 × 65,536 in conventional spatial attention.

This adjustable attention mechanism was applied in the transformer block shown in Figure 2. The input X first passed through a layer normalization. The channel dimension C was then divided into h heads, and each head passed through an adjustable attention module. The outputs from all h heads of adjustable attention modules were then concatenated, passed through a

1 \times 1

convolution, and added to the input via a skip connection to form the first residual block. A spatial gated feed-forward network (SGFN) followed the multi-head attention. The SGFN started with a layer normalization, followed by two parallel paths of

1 \times 1

convolution and

3 \times 3

convolution. The differences between the two paths are that the

1 \times 1

convolution on the left path expands the channel dimension by a factor

γ = 2.66

, and a Gaussian error linear unit (GELU) activation was applied on the left path to generate spatially varying weights. Features generated by the right path were multiplied with spatially varying weights from the left path. Finally, the spatially weighted features passed through a

1 \times 1

convolution and were added to multi-head attention output via a skip connection to complete the second residual block.

The following parameters are used for specific denoising tasks. For real denoising, stride factor F is 256, kernel sizes K and Q are 11, and kernel size V is 9. For Gaussian denoising, stride factor F is 1024, kernel sizes K and Q are 7, and kernel size V is 5. The rationale behind these specific settings will be explained later in the ablation studies of Section 4.

3.3. Distinctions from Other Efficiency-Focused Mechanisms

Existing efficiency-focused mechanisms include downsampled attention, low-rank approximation, and linear attention variants. Comparing the proposed method with downsampled attention, the adjustable complexity attention still preserves the original spatial resolution in the query matrix as shown in Figure 3c, while downsampled attention needs to reduce the spatial resolution of input

X

, which will lose critical details. Similarly, while low-rank approximations need to approximate the input via decomposition, the proposed method still maintains the full-rank representation of the query matrix without the need of lossy approximation. Compared to linear attention variants, while these approaches achieve linear complexity by approximating the softmax function for attention scores through associative property of matrix multiplication, the proposed approach targets spatial resolution flexibility without simplifying nonlinear function to its linear approximation. In summary, the proposed adjustable attention mechanism achieves efficiency through careful matrix manipulation without seeking approximation strategies as other efficiency-focused mechanisms do.

4. Experimental Results

4.1. Experimental Setup

The proposed network is trained for 300,000 iterations using adaptive moment estimation with weight decay (AdamW) optimizer with

β_{1} = 0.9

,

β_{2} = 0.999

, and weight decay

= 1 \times 10^{- 4}

. The initial learning rate is

3 \times 10^{- 4}

, and it gradually drops to

1 \times 10^{- 6}

through cosine annealing. For Gaussian denoising training, images in DIV2K [53], Flickr2K [54], Waterloo Exploration [55], and BSD400 [56] are used. For real denoising training, the SIDD medium dataset [9] is used. Image patches are randomly extracted from the training images using progressive learning strategy [24]; i.e., patch size starts at

96 \times 96

, and it is updated to

120 \times 120

,

144 \times 144

,

192 \times 192

,

240 \times 240

, and

256 \times 256

at iterations 92,000, 156,000, 204,000, 240,000, and 276,000. For blind color Gaussian denoising, zero-mean Gaussian noise is added with standard deviation randomly selected in the range of 0 to 50. For data augmentation, random horizontal and vertical flips are applied. The hardware platform used is Intel i7-12700F CPU and NVIDIA GeForce RTX 3060 with 12 GB RAM. The software environment is Microsoft Windows 10, Python 3.8.8, PyTorch 1.10.0, and CUDA 11.3.

Tests for blind Gaussian color denoising are performed in CBSD68 [56], Kodak24 [57], McMaster [58], and Urban100 [59] datasets. Tests for real denoising are performed on SIDD validation dataset. The proposed algorithm is compared with the following methods listed by their publication years: DnCNN [7], FFDNet [31], BUIFD [32], SwinIR [25], DAGL [60], DeamNet [61], MPRNet [62], Restormer [24], CasaPuNet [63], GrencNet [17], DCANet [64], DRANet [20], TSP-RDANet [65], VIRNet [66], AKDT [67], and DCBDNet [68]. Except for SwinIR, all of these methods can be used in either real denoising or blind Gaussian denoising. SwinIR is only trained for non-blind Gaussian denoising, but it is included in the comparison because of its close relationship with the proposed approach. However, it is worth noting that performance of non-blind denoising is inherently superior to blind denoising because constraints on non-blind denoising is far stricter than blind denoising. Peak signal-to-noise ratio (PSNR) and structural similarity index measure (SSIM) [69] are used for quantitative comparison. To ensure fair comparison, instead of directly quoting the numbers reported in the original papers, all quantitative measures are calculated from images generated through pre-trained models provided by the authors.

4.2. Denoising Performance

Table 3 shows the quantitative comparison on SIDD dataset. From the table, it can be seen that the proposed algorithm is comparable to the state-of-the-art approaches proposed within recent two years. Due to hardware constraints, it was not possible to train on the larger patch size of

384 \times 384

used by Restormer. Since transformer-based architectures rely on capturing long-range dependencies, larger training patches typically provide a performance advantage by allowing the model to learn from a broader spatial context. Consequently, the comparison presented here is inherently more strict for the proposed method, as it was evaluated under more constrained training conditions than the primary competitor, i.e., Restormer. Despite this disadvantage, the adjustable attention mechanism demonstrates highly competitive PSNR and SSIM results. This suggests that the proposed architecture is exceptionally efficient at capturing relevant features even when trained on smaller spatial regions.

Figure 4 shows the qualitative comparison of one of the images in SIDD dataset. The image includes both flat areas and strong edges. While image quality in the flat area is generally the same for all the methods, differences in edges are quite strong. It can be seen from the figure that both the proposed method and Restormer clearly restore the line structures, but the other methods show some defects in line restoration.

Table 4 shows PSNR, and Table 5 shows SSIM of the color Gaussian denoising results. Similar to real denoising, it can be seen that the proposed algorithm achieves performance comparable to Restormer and outperforms other state-of-the-art approaches. Figure 5 shows the qualitative comparison of one of the images in Urban100 dataset, which is a challenging dataset with rich texture and structure. Differences in image quality are especially easy to distinguish in the cropped images. It can be seen that both the proposed method and Restormer clearly restore the line structures with outstanding noise removal capability in the flat areas, while other methods show either defects in the lines or artifacts in the flat areas.

4.3. Complexity Analysis

Figure 6 shows number of operations and number of parameters vs. adjustable variables measured with an input of

256 \times 256 \times 3

. The mathematical relationship established in Section 3.2 where computational complexity is inversely proportional to the square of stride factor F is empirically validated in Figure 6a. On the other hand, the number of parameters remains the same for various values of F because the number of parameters is related to the convolution kernel size, not to the stride factor. Figure 6b shows the influence of kernel sizes of three convolution kernels to computational complexity. It can be seen that their influence is significantly less than the stride factor, while kernel size of K or Q has more influence on complexity than V. For kernel K or Q, as kernel size increases from 3 to 11, the number of operations increases by

10 %

, while the number of parameters increases by

8 %

. For kernel V, as kernel size increases from 3 to 11, the number of operations increases only by

0.2 %

, while the number of parameters increase by

4 %

. In summary, it can be said that the complexity of the proposed attention mechanism works as predicted and it can be effectively controlled by the stride factor F without resorting to window division or channel adjustment. This is an advantage that other attention mechanisms lack.

Table 6 shows the comparison of complexity with the tested algorithms. While the proposed model possesses a relatively high parameter count of 22.866 million, the total number of operations (120.536 G FLOPs) is notably lower than that of other top-performing models such as Restormer (155.894 G) and MPRNet (1393.831 G). This distinction is critical: although the parameter count reflects a higher cost in training, the lower number of operations ensures that the model is computationally efficient during inference time.

4.4. Ablation Study

To find the optimal parameters for denoising, Table 7 and Table 8 show quantitative measurements vs. the variable parameters in the proposed adjustable attention. Average PSNR and SSIM across tested

σ

values and datasets are used for the selection process. From the tables, it can be seen that stride factor F has a much larger influence on performance than kernel sizes. Performance flattens out after certain stride factor F. This phenomenon occurs because the average pooling operation essentially performs a global averaging once the stride factor F matches or exceeds the spatial dimensions of the feature map. At this point, the key and value matrices are reduced to a single global representation per channel. Consequently, further increases in F do not result in additional spatial resolution reduction or changes to the attention scores, as reflected in the identical metrics for

F = 1024

and

F = 2048

in Table 7 and, similarly, for

F = 512

and

F = 1024

in Table 8.

From these experiments, the chosen parameters are stride

F = 1024

, kernel sizes K and

Q = 7

, and kernel size

V = 5

for Gaussian denoising, and stride

F = 256

, kernel sizes K and

Q = 11

, and kernel size

V = 9

for real denoising. The reason that a larger stride factor of

F = 1024

is selected for synthetic Gaussian noise can be attributed to the spatially independent and uniform nature of Gaussian noise, which allows the attention mechanism to capture global context through aggressive spatial downsampling without compromising the ability to distinguish signal from noise. Conversely, real-world noise is spatially correlated and signal-dependent. Consequently, a smaller stride factor of

F = 256

is required to maintain a higher spatial resolution in the key and value spaces and, in turn, let the model effectively differentiate complex, non-homogeneous noise patterns from the fine structural details of the image.

Table 9 shows quantitative measurements with different configurations of the proposed adjustable attention mechanism. Three different configurations are tested: (1) removing the pooling operations, (2) replacing average pooling with max pooling, and (3) performing average pooling after convolutions. These configurations are compared with the final configuration: performing average pooling before convolutions. Experimental results show that the final configuration performs best among all the configurations.

5. Conclusions

In this paper, a novel adjustable attention mechanism designed to address the high computational cost associated with pixel-wise attention mechanism is introduced. By incorporating variable-stride average pooling and convolutions to generate the key and value matrices in the attention mechanism, the possibility of dynamically adjusting the complexity of the attention mechanism is demonstrated. The proposed new approach effectively balances the trade-off between capturing long-range dependencies and computational efficiency. It offers a tunable middle ground between the spatial attention used in models like SwinIR and the channel attention in Restormer. Through extensive experiments, it is established that the proposed adjustable attention mechanism achieves performance comparable to state-of-the-art methods with the additional benefit of adjustable complexity.

Author Contributions

Conceptualization, J.-R.L. and L.-W.C.; methodology, J.-R.L. and W.L.; software, J.-R.L. and W.L.; validation, J.-R.L. and W.L.; formal analysis, J.-R.L.; investigation, W.L.; resources, J.-R.L. and W.L.; data curation, J.-R.L. and W.L.; writing—original draft preparation, J.-R.L. and W.L.; writing—review and editing, J.-R.L. and L.-W.C.; visualization, J.-R.L. and W.L.; supervision, J.-R.L. and L.-W.C.; project administration, J.-R.L. and L.-W.C.; funding acquisition, J.-R.L. and L.-W.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

All data used in the paper were publicly available.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Elad, M.; Kawar, B.; Vaksman, G. Image denoising: The deep learning revolution and beyond—A survey paper. SIAM J. Imaging Sci. 2023, 16, 1594–1654. [Google Scholar] [CrossRef]
Jiang, B.; Li, J.; Lu, Y.; Cai, Q.; Song, H.; Lu, G. Efficient image denoising using deep learning: A brief survey. Inf. Fusion 2025, 118, 103013. [Google Scholar] [CrossRef]
Mao, J.; Sun, L.; Chen, J.; Yu, S. Overview of research on digital image denoising methods. Sensors 2025, 25, 2615. [Google Scholar] [CrossRef] [PubMed]
Buades, A.; Coll, B.; Morel, J.M. A review of image denoising algorithms, with a new one. Multiscale Model. Simul. 2005, 4, 490–530. [Google Scholar] [CrossRef]
Dabov, K.; Foi, A.; Katkovnik, V.; Egiazarian, K. Image denoising by sparse 3-D transform-domain collaborative filtering. IEEE Trans. Image Process. 2007, 16, 2080–2095. [Google Scholar] [CrossRef] [PubMed]
Chang, L.W.; Liao, J.R. Improving non-local means image denoising by correlation correction. Multidimens. Syst. Signal Process. 2023, 34, 147–162. [Google Scholar] [CrossRef]
Zhang, K.; Zuo, W.; Chen, Y.; Meng, D.; Zhang, L. Beyond a gaussian denoiser: Residual learning of deep CNN for image denoising. IEEE Trans. Image Process. 2017, 26, 3142–3155. [Google Scholar] [CrossRef]
Plotz, T.; Roth, S. Benchmarking denoising algorithms with real photographs. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 1586–1595. [Google Scholar] [CrossRef]
Abdelhamed, A.; Lin, S.; Brown, M.S. A high-quality denoising dataset for smartphone cameras. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 1692–1700. [Google Scholar] [CrossRef]
Abdelhamed, A.; Timofte, R.; Brown, M.S.; Yu, S.; Park, B.; Jeong, J.; Jung, S.-W.; Kim, D.-W.; Chung, J.-R.; Liu, J.; et al. NTIRE 2019 challenge on real image denoising: Methods and results. In IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Long Beach, CA, USA, 16–17 June 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1–14. [Google Scholar] [CrossRef]
Mao, X.; Shen, C.; Yang, Y.B. Image restoration using very deep convolutional encoder-decoder networks with symmetric skip connections. In 30th Conference on Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; Curran Associates Inc.: Red Hook, NY, USA, 2016; pp. 2802–2810. [Google Scholar] [CrossRef]
Chen, Y.; Pock, T. Trainable nonlinear reaction diffusion: A flexible framework for fast and effective image restoration. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1256–1272. [Google Scholar] [CrossRef]
Plotz, T.; Roth, S. Neural nearest neighbors networks. In 32nd Conference on Neural Information Processing Systems, Montreal, QC, Canada, 3–8 December 2018; Curran Associates Inc.: Red Hook, NY, USA, 2018; pp. 1095–1106. [Google Scholar] [CrossRef]
Zhang, Y.; Li, K.; Li, K.; Zhong, B.; Fu, Y. Residual non-local attention networks for image restoration. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019; pp. 1–18. [Google Scholar] [CrossRef]
Tian, C.; Xu, Y.; Li, Z.; Zuo, W.; Fei, L.; Liu, H. Attention-guided CNN for image denoising. Neural Netw. 2020, 124, 117–129. [Google Scholar] [CrossRef]
Mei, Y.; Fan, Y.; Zhang, Y.; Yu, J.; Zhou, Y.; Liu, D.; Fu, Y.; Huang, T.S.; Shi, H. Pyramid attention network for image restoration. Int. J. Comput. Vis. 2023, 131, 3207–3225. [Google Scholar] [CrossRef]
Pan, Y.; Ren, C.; Wu, X.; Huang, J.; He, X. Real image denoising via guided residual estimation and noise correction. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 1994–2000. [Google Scholar] [CrossRef]
Qin, M.; Ren, C.; Yang, H.; He, X.; Wang, Z. Blind image denoising via deep unfolding network with degradation information guidance. IEEE Trans. Circuits Syst. II Express Briefs 2023, 70, 3179–3183. [Google Scholar] [CrossRef]
Zamir, S.W.; Arora, A.; Khan, S.; Hayat, M.; Khan, F.S.; Yang, M.H.; Shao, L. Learning enriched features for fast image restoration and enhancement. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 1934–1948. [Google Scholar] [CrossRef]
Wu, W.; Liu, S.; Xia, Y.; Zhan, Y. Dual residual attention network for image denoising. Pattern Recognit. 2024, 149, 110291. [Google Scholar] [CrossRef]
Zafar, A.; Aftab, D.; Qureshi, R.; Fan, X.; Chen, P.; Wu, J. Single stage adaptive multi-attention network for image restoration. IEEE Trans. Image Process. 2024, 33, 2924–2935. [Google Scholar] [CrossRef] [PubMed]
Ge, X.; Zhu, Y.; Qi, L.; Hu, Y.; Sun, J.; Zhang, Y. Enhancing border learning for better image denoising. Mathematics 2025, 13, 1119. [Google Scholar] [CrossRef]
Chen, L.; Chu, X.; Zhang, X.; Sun, J. Simple baselines for image restoration. In 17th European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 17–33. [Google Scholar] [CrossRef]
Zamir, S.W.; Arora, A.; Khan, S.H.; Hayat, M.; Khan, F.S.; Yang, M.H. Restormer: Efficient transformer for high-resolution image restoration. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA; IEEE: Piscataway, NJ, USA, 2022; pp. 5728–5739. [Google Scholar] [CrossRef]
Liang, J.; Cao, J.; Sun, G.; Zhang, K.; Van Gool, L.; Timofte, R. SwinIR: Image restoration using swin transformer. In IEEE/CVF International Conference on Computer Vision Workshops, Montreal, QC, Canada, 11–17 October 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1833–1844. [Google Scholar] [CrossRef]
Fan, C.M.; Liu, T.J.; Liu, K.H. SUNet: Swin transformer unet for image denoising. In IEEE International Symposium on Circuits and Systems, Austin, Texas, USA, 28 May–1 June 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 2333–2337. [Google Scholar] [CrossRef]
Li, J.; Cheng, B.; Chen, Y.; Gao, G.; Shi, J.; Zeng, T. EWT: Efficient wavelet-transformerfor single image denoising. Neural Netw. 2024, 177, 106378. [Google Scholar] [CrossRef]
Song, M.; Wang, W.; Zhao, Y. A dynamic network with transformer for image denoising. Electronics 2024, 13, 1676. [Google Scholar] [CrossRef]
Li, J.; Zhang, Z.; Zuo, W. Rethinking transformer-based blind-spot network for self-supervised image denoising. Proc. AAAI Conf. Artif. Intell. 2025, 39, 4788–4796. [Google Scholar] [CrossRef]
Yao, M.; Xu, R.; Guan, Y.; Huang, J.; Xiong, Z. Neural degradation representation learning for all-in-one image restoration. IEEE Trans. Image Process. 2024, 33, 5408–5423. [Google Scholar] [CrossRef] [PubMed]
Zhang, K.; Zuo, W.; Zhang, L. FFDNet: Toward a fast and flexible solution for CNN based image denoising. IEEE Trans. Image Process. 2018, 27, 4608–4622. [Google Scholar] [CrossRef]
El Helou, M.; Susstrunk, S. Blind universal Bayesian image denoising with gaussian noise level learning. IEEE Trans. Image Process. 2020, 29, 4885–4897. [Google Scholar] [CrossRef] [PubMed]
Du, Y.; Han, G.; Tan, Y.; Xiao, C.; He, S. Blind image denoising via dynamic dual learning. IEEE Trans. Multimed. 2021, 23, 2139–2152. [Google Scholar] [CrossRef]
Zhang, Y.; Li, K.; Li, K.; Sun, G.; Kong, Y.; Fu, Y. Accurate and fast image denoising via attention guided scaling. IEEE Trans. Image Process. 2021, 30, 6255–6265. [Google Scholar] [CrossRef]
Mou, C.; Zhang, J.; Fan, X.; Liu, H.; Wang, R. COLA-Net: Collaborative attention network for image restoration. IEEE Trans. Multimed. 2022, 24, 1366–1377. [Google Scholar] [CrossRef]
Guo, S.; Yan, Z.; Zhang, K.; Zuo, W.; Zhang, L. Toward convolutional blind denoising of real photographs. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1712–1722. [Google Scholar] [CrossRef]
Yue, Z.; Yong, H.e.; Zhao, Q.; Meng, D.; Ozair, S.; Zhang, L. Variational denoising network: Toward blind noise modeling and removal. In 32rd Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Curran Associates Inc.: Red Hook, NY, USA, 2019; pp. 1690–1701. [Google Scholar]
Zhou, Y.; Jiao, J.; Huang, H.; Wang, Y. When AWGN-based denoiser meets real noises. Proc. AAAI Conf. Artif. Intell. 2020, 34, 13074–13081. [Google Scholar] [CrossRef]
Tian, C.; Xu, Y.; Zuo, W. Image denoising using deep CNN with batch renormalization. Neural Netw. 2020, 121, 461–473. [Google Scholar] [CrossRef]
Jang, Y.I.; Kim, Y.; Cho, N.I. Dual path denoising network for real photographic noise. IEEE Signal Process. Lett. 2020, 27, 860–864. [Google Scholar] [CrossRef]
Zamir, S.W.; Arora, A.; Khan, S.H.; Munawar, H.; Khan, F.S.; Yang, M.H.; Shao, L. Learning enriched features for real image restoration and enhancement. In European Conference on Computer Vision—ECVA; Springer: Cham, Switzerland, 2020; pp. 492–511. [Google Scholar]
Chen, H.; Wang, Y.; Guo, T.; Xu, C.; Deng, Y.; Liu, Z.; Ma, S.; Xu, C.; Xu, C.; Gao, W. Pre-trained image processing transformer. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 12299–12310. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the 9th International Conference on Learning Representations (ICLR), Virtual, 3–7 May 2021; pp. 1–21. [Google Scholar]
Zhang, J.; Zhang, Y.; Gu, J.; Dong, J.; Kong, L.; Yang, X. Xformer: Hybrid X-shaped transformer for image denoising. In Proceedings of the 12th International Conference on Learning Representations (ICLR), Vienna, Austria, 7–11 May 2024; pp. 1–13. [Google Scholar]
Wang, Z.; Cun, X.; Bao, J.; Zhou, W.; Liu, J.; Li, H. Uformer: A general u-shaped transformer for image restoration. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 17662–17672. [Google Scholar] [CrossRef]
Xiao, J.; Fu, X.; Zhou, M.; Liu, H.; Zha, Z.J. Random shuffle transformer for image restoration. In 40th International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; PMLR: Cambridge, MA, USA, 2023; pp. 38039–38058. [Google Scholar]
Zhang, J.; Zhang, Y.; Gu, J.; Zhang, Y.; Kong, L.; Yuan, X. Accurate image restoration with attention retractable transformer. In Proceedings of the 11th International Conference on Learning Representations (ICLR), Kigali, Rwanda, 1–5 May 2023; pp. 1–13. [Google Scholar]
Liu, K.; Du, X.; Liu, S.; Zheng, Y.; Wu, X.; Jin, C. DDT: Dual-branch deformable transformer for image denoising. In IEEE International Conference on Multimedia and Expo (ICME), Brisbane, Australia, 10–14 July 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 2765–2770. [Google Scholar] [CrossRef]
Zhao, H.; Gou, Y.; Li, B.; Peng, D.; Lv, J.; Peng, X. Comprehensive and delicate: An efficient transformer for image restoration. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 14122–14132. [Google Scholar] [CrossRef]
Jiang, X.; Zhang, X.; Gao, N.; Deng, Y. When fast Fourier transform meets transformer for image restoration. In 18th European Conference on Computer Vision (ECCV), Milano, Italy, 29 September–4 October 2024; Springer: Cham, Switzerland, 2024; pp. 381–402. [Google Scholar] [CrossRef]
Tian, C.; Zheng, M.; Lin, C.W.; Li, Z.; Zhang, D. Heterogeneous window transformer for image denoising. IEEE Trans. Syst. Man, Cybern. Syst. 2024, 54, 6621–6632. [Google Scholar] [CrossRef]
Zhang, Q.; Zhang, J.; Xu, Y.; Tao, D. Vision transformer with quadrangle attention. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 3608–3624. [Google Scholar] [CrossRef]
Timofte, R.; Agustsson, E.; Gool, L.V.; Yang, M.H.; Zhang, L.; Lim, B.; Son, S. NTIRE 2017 challenge on single image super-resolution: Methods and results. In IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Honolulu, HI, USA, 21–26 July 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 1110–1121. [Google Scholar] [CrossRef]
Lim, B.; Son, S.; Kim, H.; Nah, S.; Lee, K.M. Enhanced deep residual networks for single image super-resolution. In IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Honolulu, HI, USA, 21–26 July 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 1132–1140. [Google Scholar] [CrossRef]
Ma, K.; Duanmu, Z.; Wu, Q.; Wang, Z.; Yong, H.; Li, H.; Zhang, L. Waterloo exploration database: New challenges for image quality assessment models. IEEE Trans. Image Process. 2017, 26, 1004–1016. [Google Scholar] [CrossRef] [PubMed]
Martin, D.; Fowlkes, C.; Tal, D.; Malik, J. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological sstatistics. In 8th IEEE International Conference on Computer Vision (ICCV), Vancouver, BC, Canada, 7–14 July 2001; IEEE: Piscataway, NJ, USA, 2001; Volume 2, pp. 416–423. [Google Scholar] [CrossRef]
Franzen, R. Kodak Lossless True Color Image Suite. Available online: http://r0k.us/graphics/kodak/ (accessed on 26 July 2025).
Zhang, L.; Wu, X.; Buades, A.; Li, X. Color demosaicking by local directional interpolation and nonlocal adaptive thresholding. J. Electron. Imaging 2011, 20, 023016. [Google Scholar] [CrossRef]
Huang, J.B.; Singh, A.; Ahuja, N. Single image super-resolution from transformed self-exemplars. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 5197–5206. [Google Scholar] [CrossRef]
Mou, C.; Zhang, J.; Wu, Z. Dynamic attentive graph learning for image restoration. In IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 4308–4317. [Google Scholar] [CrossRef]
Ren, C.; He, X.; Wang, C.; Zhao, Z. Adaptive consistency prior based deep network for image denoising. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 8592–8602. [Google Scholar] [CrossRef]
Zamir, S.W.; Arora, A.; Khan, S.; Hayat, M.; Khan, F.S.; Yang, M.H.; Shao, L. Multi-stage progressive image restoration. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 14821–14831. [Google Scholar] [CrossRef]
Huang, J.; Liu, X.; Pan, Y.; He, X.; Ren, C. CasapuNet: Channel affine self-attention- based progressively updated network for real image denoising. IEEE Trans. Ind. Inform. 2023, 19, 9145–9156. [Google Scholar] [CrossRef]
Wu, W.; Lv, G.; Duan, Y.; Liang, P.; Zhang, Y.; Xia, Y. Dual convolutional neural network with attention for image blind denoising. Multimed. Syst. 2024, 30, 263. [Google Scholar] [CrossRef]
Wu, W.; Ge, A.; Lv, G.; Xia, Y.; Zhang, Y.; Xiong, W. Two-stage progressive residual dense attention network for image denoising. arXiv 2024. [Google Scholar] [CrossRef]
Yue, Z.; Yong, H.; Zhao, Q.; Zhang, L.; Meng, D.; Wong, K.Y.K. Deep variational network toward blind image restoration. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 7011–7026. [Google Scholar] [CrossRef] [PubMed]
Brateanu, A.; Balmez, R.; Avram, A.; Orhei, C. AKDT: Adaptive kernel dilation transformer for effective image denoising. In Proceedings of the International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISAPP), Porto, Portugal, 26–28 February 2025; pp. 418–425. [Google Scholar]
Wu, W.; Liao, S.; Lv, G.; Liang, P.; Zhang, Y. Image blind denoising using dual convolutional neural network with skip connection. Signal Process. Image Commun. 2025, 138, 117365. [Google Scholar] [CrossRef]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2012, 13, 600–612. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Architecture of the proposed denoising network.

Figure 2. Block diagram of a transformer block.

Figure 3. Block diagrams of pixel attention, channel attention, and adjustable attention: (a) is the pixel attention used in SwinIR, (b) is the channel attention used in Restormer, and (c) is the adjustable attention proposed in this paper.

Figure 4. Comparison of the denoised outputs for image 11, block 16 in SIDD validation dataset. The third and fourth row show the enlarged region marked in the red rectangle of the original image. First and third row, from left to right are: ground truth, noisy input, DAGL (31.47/0.758), and CasaPuNet (32.95/0.765). Second and fourth row, from left to right are: TSP-RDANet (32.70/0.766), AKDT (32.61/0.766), Restormer (32.92/0.770), and the proposed method (33.20/0.767). Numbers in the brackets following each method are PSNR in dB and SSIM for that method.

Figure 5. Comparison of the denoised outputs for image 81 in Urban100 dataset under

σ = 50

Gaussian noises. The third and fourth row show the enlarged region marked in the red rectangle of the original image. First and third row, from left to right are: ground truth, noisy input, DnCNN (30.25/0.820), and BUIFD (30.55/0.816). Second and fourth row, from left to right are: DCBDNet (33.08/0.889), DCANet (33.29/0.894), Restormer (35.41/0.936), and the proposed method (34.70/0.921). Numbers in the brackets following each method are PSNR in dB and SSIM for that method.

Figure 5. Comparison of the denoised outputs for image 81 in Urban100 dataset under

σ = 50

Gaussian noises. The third and fourth row show the enlarged region marked in the red rectangle of the original image. First and third row, from left to right are: ground truth, noisy input, DnCNN (30.25/0.820), and BUIFD (30.55/0.816). Second and fourth row, from left to right are: DCBDNet (33.08/0.889), DCANet (33.29/0.894), Restormer (35.41/0.936), and the proposed method (34.70/0.921). Numbers in the brackets following each method are PSNR in dB and SSIM for that method.

Figure 6. Relationship between computational complexity and attention parameters. Blue lines indicate FLOPs and correspond to the left y-axis. Red lines indicate the number of parameters and correspond to the right y-axis: (a) complexity vs. stride factor F and (b) complexity vs. kernel size.

Table 1. Categorical summary of image denoising strategies in the recent literature.

Category	Strategy	Method Examples
Traditional	NSS-based	BM3D [5]
Early CNN	Basic residual network	DnCNN [7]
	Noise-level map assistance	FFDNet [31]
Multi-Path	Noise estimation	GrencNet [17]
		BDUNet [18]
		BUIFD [32]
		CBDNet [36]
		VDN [37]
		PD [38]
Multi-Path	Convolutional strategies	MIRNet-v2 [19]
		BRDNet [39]
		DPN [40]
		MIRNet [41]
Attention	Residual non-local	RNAN [14]
	Attention-guided	ADNet [15]
	Pyramid attention	PANet [16]
	Collaborative attention	COLA-Net [35]
	Dynamic dual learning	DualBDNet [33]
	Attention-guided scaling	AGS [34]
Transformer	Pixel-wise window attention	SwinIR [25,26]
	Channel attention	Restormer [24]
	Wavelet	EWT [27]
	Neural degradation representation	NDR [30]
	Multi-scale U-Net	Uformer [45]
	Randomly shuffling	ShuffleFormer [46]
	Interleaved pixels	ART [47]
	Dual-branch deformable	DDT [48]
	Superpixels	CODE [49]
	Fast Fourier transform	SFHformer [50]
	Heterogeneous window	HWformer [51]
	Quadrangle attention	QFormer [52]

Table 2. Parameters for all the cascaded transformer blocks.

		In Each Transformer
Block Name	# of Transformers	# of Heads	# of Channels
Cascaded Transformer Block 1	4	1	48
Cascaded Transformer Block 2	6	2	96
Cascaded Transformer Block 3	6	4	192
Cascaded Transformer Block 4	8	8	384
Cascaded Transformer Block 5	6	4	192
Cascaded Transformer Block 6	6	2	96
Cascaded Transformer Block 7	4	1	48
Cascaded Transformer Block 8	4	1	48

Table 3. PSNRS (dB) and SSIM for real denoising of SIDD validation dataset. Best and second-best values are indicated by bold and underline, respectively.

Method	PSNR	SSIM
DAGL (2021) [60]	38.87	0.906
DeamNet (2021) [61]	38.87	0.906
GrencNet (2023) [17]	39.45	0.912
CasaPuNet (2023) [63]	39.50	0.910
TSP-RDANet (2024) [65]	39.51	0.911
VIRNet (2024) [66]	39.62	0.912
AKDT (2025) [67]	39.62	0.912
MPRNet (2021) [62]	39.63	0.913
Restormer (2022) [24]	39.93	0.915
Ours	39.65	0.913

Table 4. PSNRS (dB) of denoising algorithms under Gaussian noise of different

σ

. Best and second-best values are indicated by bold and underline, respectively. Because SwinIR is a non-blind algorithm, it is listed as a reference and not ranked.

Table 4. PSNRS (dB) of denoising algorithms under Gaussian noise of different

σ

. Best and second-best values are indicated by bold and underline, respectively. Because SwinIR is a non-blind algorithm, it is listed as a reference and not ranked.

	Dataset	CBSD68			Kodak
Method		$σ = 15$	$σ = 25$	$σ = 50$	$σ = 15$	$σ = 25$	$σ = 50$
DnCNN (2017) [7]		33.89	31.24	27.94	34.60	32.14	28.95
FFDNet (2017) [31]		33.88	31.22	27.97	34.75	32.25	29.11
BUIFD (2020) [32]		33.89	31.24	27.98	34.69	32.22	29.10
DCBDNet (2025) [68]		34.01	31.40	28.21	34.89	32.47	29.41
DCANet (2024) [64]		34.05	31.44	28.27	34.98	32.56	29.51
TSP-RDANet (2024) [65]		34.14	31.51	28.32	35.10	32.66	29.60
DRANet (2024) [20]		34.17	31.55	28.35	35.15	32.71	29.65
Restormer (2022) [24]		34.39	31.78	28.60	35.44	33.02	30.00
Ours		34.29	31.66	28.46	35.30	32.86	29.81
SwinIR (2021) [25]		34.41	31.78	28.56	35.46	33.01	29.95
	Dataset	McMaster			Urban100
Method		$σ = 15$	$σ = 25$	$σ = 50$	$σ = 15$	$σ = 25$	$σ = 50$
DnCNN (2017) [7]		33.45	31.52	28.61	32.98	30.81	27.59
FFDNet (2017) [31]		34.65	32.35	29.18	33.83	31.40	28.05
BUIFD (2020) [32]		34.18	32.03	29.00	33.67	31.31	28.00
DCBDNet (2025) [68]		34.76	32.56	29.54	34.05	31.77	28.53
DCANet (2024) [64]		34.83	32.62	29.59	34.17	31.90	28.76
TSP-RDANet (2024) [65]		35.06	32.80	29.73	34.43	32.14	28.99
DRANet (2024) [20]		35.09	32.84	29.77	34.49	32.23	29.09
Restormer (2022) [24]		35.55	33.31	30.29	35.06	32.91	30.16
Ours		35.34	33.08	30.01	34.60	32.45	29.42
SwinIR (2021) [25]		35.61	33.32	30.20	35.16	32.93	29.86

Table 5. SSIM of denoising algorithms under Gaussian noise of different

σ

. Bold indicates the top value. Underline indicates the second value from the top. Because SwinIR is a non-blind algorithm, it is listed as a reference and not ranked.

Table 5. SSIM of denoising algorithms under Gaussian noise of different

σ

. Bold indicates the top value. Underline indicates the second value from the top. Because SwinIR is a non-blind algorithm, it is listed as a reference and not ranked.

	Dataset	CBSD68			Kodak
Method		$σ = 15$	$σ = 25$	$σ = 50$	$σ = 15$	$σ = 25$	$σ = 50$
DnCNN (2017) [7]		0.932	0.887	0.793	0.923	0.879	0.790
FFDNet (2017) [31]		0.932	0.886	0.792	0.924	0.881	0.794
BUIFD (2020) [32]		0.931	0.886	0.793	0.922	0.879	0.793
DCBDNet (2025) [68]		0.934	0.891	0.804	0.927	0.887	0.807
DCANet (2024) [64]		0.935	0.892	0.807	0.928	0.888	0.812
TSP-RDANet (2024) [65]		0.935	0.893	0.808	0.929	0.890	0.813
DRANet (2024) [20]		0.936	0.893	0.808	0.929	0.891	0.814
Restormer (2022) [24]		0.938	0.898	0.817	0.933	0.896	0.824
Ours		0.937	0.896	0.813	0.931	0.894	0.820
SwinIR (2021) [25]		0.939	0.898	0.816	0.933	0.896	0.823
	Dataset	McMaster			Urban100
Method		$σ = 15$	$σ = 25$	$σ = 50$	$σ = 15$	$σ = 25$	$σ = 50$
DnCNN (2017) [7]		0.907	0.873	0.799	0.934	0.904	0.835
FFDNet (2017) [31]		0.925	0.889	0.816	0.944	0.914	0.850
BUIFD (2020) [32]		0.918	0.882	0.811	0.940	0.910	0.843
DCBDNet (2025) [68]		0.926	0.895	0.831	0.946	0.920	0.862
DCANet (2024) [64]		0.927	0.896	0.833	0.947	0.922	0.867
TSP-RDANet (2024) [65]		0.931	0.900	0.838	0.949	0.925	0.872
DRANet (2024) [20]		0.930	0.900	0.838	0.949	0.925	0.873
Restormer (2022) [24]		0.937	0.909	0.854	0.954	0.933	0.892
Ours		0.935	0.905	0.847	0951	0.929	0.881
SwinIR (2021) [25]		0.937	0.909	0.851	0.954	0.933	0.888

Table 6. Number of operations and number of parameters for tested algorithms. Bold indicates the proposed approach.

Model	FLOPs	Parameters
FFDNet (2017) [31]	13.978 G	0.852 M
DnCNN (2017) [7]	43.873 G	0.668 M
DCBDNet (2025) [68]	49.173 G	1.013 M
AKDT (2025) [67]	60.152 G	11.424 M
DCANet (2024) [64]	75.543 G	1.389 M
BUIFD (2020) [32]	78.428 G	1.196 M
Ours	120.536 G	22.866 M
DeamNet (2021) [61]	146.369 G	1.876 M
Restormer (2022) [24]	155.894 G	26.097 M
VIRNet (2024) [66]	159.488 G	15.404 M
CasaPuNet (2023) [63]	240.635 G	2.411 M
DAGL (2021) [60]	273.386 G	5.717 M
DRANet (2024) [20]	589.428 G	1.617 M
GrencNet (2023) [17]	611.020 G	2.019 M
TSP-RDANet (2024) [65]	678.123 G	2.848 M
SwinIR (2021) [25]	808.369 G	11.456 M
MPRNet (2021) [62]	1393.831 G	15.741 M

Table 7. Average PSNR and SSIM with respect to stride factor F, kernel sizes Q, K, and V for Gaussian denoising. Bold indicates the best value.

Kernel	Stride Factor F
$K = Q = 7, V = 5$	128	256	512	1024	2048
PSNR	26.62	29.08	30.10	30.39	30.39
SSIM	0.718	0.810	0.841	0.848	0.848
Stride $F = 1024$	Kernel Size $K$ and $Q$
Kernel $V = 5$	3	5	7	9	11
PSNR	30.38	30.38	30.39	30.38	30.38
SSIM	0.847	0.843	0.848	0.847	0.847
Stride $F = 1024$	Kernel Size $V$
Kernel $K = Q = 7$	3	5	7	9	11
PSNR	30.34	30.39	30.38	30.38	30.38
SSIM	0.847	0.848	0.847	0.847	0.847

Table 8. Average PSNR and SSIM with respect to stride factor F, kernel sizes Q, K, and V for real denoising. Bold indicates the best value.

Kernel	Stride Factor F
$K = Q = 11, V = 9$	64	128	256	512	1024
PSNR	26.71	36.35	38.84	38.84	38.84
SSIM	0.850	0.937	0.953	0.953	0.953
Stride $F = 256$	Kernel Size $K$ and $Q$
Kernel $V = 9$	5	7	9	11	13
PSNR	38.81	38.82	38.82	38.84	38.8
SSIM	0.952	0.952	0.952	0.953	0.952
Stride $F = 256$	Kernel Size $V$
Kernel $K = Q = 11$	3	5	7	9	11
PSNR	38.76	38.81	38.80	38.84	38.79
SSIM	0.952	0.952	0.952	0.953	0.952

Table 9. Average PSNR and SSIM with different model configurations. Bold indicates the best value.

	Gaussian		Real
Configuration	PSNR	SSIM	PSNR	SSIM
Without Pooling	30.13	0.836	38.26	0.948
Max Pooling	30.28	0.845	38.64	0.951
Avg Pooling After Conv	30.30	0.845	38.63	0.951
Final (Avg Pooling Before Conv)	30.39	0.848	38.84	0.953

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liao, J.-R.; Lin, W.; Chang, L.-W. Adjustable Complexity Transformer Architecture for Image Denoising. Signals 2026, 7, 33. https://doi.org/10.3390/signals7020033

AMA Style

Liao J-R, Lin W, Chang L-W. Adjustable Complexity Transformer Architecture for Image Denoising. Signals. 2026; 7(2):33. https://doi.org/10.3390/signals7020033

Chicago/Turabian Style

Liao, Jan-Ray, Wen Lin, and Li-Wen Chang. 2026. "Adjustable Complexity Transformer Architecture for Image Denoising" Signals 7, no. 2: 33. https://doi.org/10.3390/signals7020033

APA Style

Liao, J.-R., Lin, W., & Chang, L.-W. (2026). Adjustable Complexity Transformer Architecture for Image Denoising. Signals, 7(2), 33. https://doi.org/10.3390/signals7020033

Article Menu

Adjustable Complexity Transformer Architecture for Image Denoising

Abstract

1. Introduction

2. Related Works

3. Proposed Method

3.1. Overall Architecture

3.2. Attention with Adjustable Complexity and Transformer

3.3. Distinctions from Other Efficiency-Focused Mechanisms

4. Experimental Results

4.1. Experimental Setup

4.2. Denoising Performance

4.3. Complexity Analysis

4.4. Ablation Study

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI