Next Article in Journal
Stochastic Assessment of Fracture Toughness and Reliability in Anisotropic Boride Layers on Ti6Al4V: A Monte Carlo-Based Mixed-Mode Model
Previous Article in Journal
Exponential Stability of Swelling Soils with Thermodiffusion Effects
Previous Article in Special Issue
CFGuide-Fuzz: Dynamic Fuzz Testing Framework Based on Control Flow Features
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Efficient Deep Image Prior with Spatial-Channel Attention Transformer

1
School of Big Data and Artificial Intelligence, Fujian Polytechnic Normal University, Fuzhou 350300, China
2
School of Surveying and Information Engineering, West Yunnan University of Applied Sciences, Dali 671002, China
3
School of Informatics, Xiamen University, Xiamen 361005, China
*
Author to whom correspondence should be addressed.
Mathematics 2026, 14(7), 1185; https://doi.org/10.3390/math14071185
Submission received: 9 February 2026 / Revised: 19 March 2026 / Accepted: 27 March 2026 / Published: 1 April 2026

Abstract

The deep image prior (DIP) suggests that it is possible to train a randomly initialized network with a suitable architecture to solve inverse imaging problems by simply optimizing its parameters to reconstruct a single degraded image. However, the prior knowledge exploited by vanilla DIP relies on basic local convolutions, which inevitably limits the performance of inverse imaging tasks to the generative capacity of the model. Furthermore, image information is often not only related to neighboring pixels but also dependent on global color features and spatial distribution. Simple local convolutions used in inverse imaging cannot capture precise fine-grained details. Moreover, DIP is an unsupervised process but requires iterations to learn inverse imaging, consuming computational power and limiting the adaptation of global attention. To solve these problems, this article explores an efficient global prior module—a tri-directional multi-head self-attention mechanism—aiming to learn pixel-wise correlations along three directions: horizontal, vertical, and channel-wise. Our observations found that global learning can effectively enhance the detail information of edge pixels, making images more vivid and textures clearer. In addition, tri-directional multi-head self-attention can efficiently replace the global perception ability of pixel-level self-attention. Finally, we demonstrate that global learning can effectively improve the imaging effect of inverse imaging problems and enhance the information of texture edge pixels. Moreover, tri-directional multi-head self-attention can effectively alleviate the computation redundancy of pixel-level self-attention, thus achieving efficient and high-quality inverse imaging tasks. The principle of this method lies in global feature capture and efficient attention modeling, striking a balance between detail fidelity and computational practicality.

1. Introduction

Recent years have witnessed the widespread application and remarkable success of deep neural networks in numerous computer vision and communication tasks. However, within the domain of image denoising, traditional methods such as BM3D [1] historically outperformed early deep learning-based approaches until the advent of methods like DnCNN [2]. DnCNN surpassed traditional methods in handling synthetic Gaussian noise, albeit at the cost of requiring a large dataset of clean–noisy image pairs for training [3,4,5,6,7,8].
In contrast, the deep image prior (DIP [7]) framework represents a paradigm shift, demonstrating that a randomly initialized convolutional network (often with an hourglass structure) can serve as a powerful implicit prior for various image inverse problems including denoising, super-resolution [9,10,11,12,13], and general image restoration [14,15,16,17] without requiring any pre-collected clean–noisy image pairs. Despite its strengths, DIP exhibits notable limitations. First, its performance of image denoising tasks, even under synthetic Gaussian noise settings, is often significantly inferior to traditional methods like BM3D unless meticulous early stopping is applied, which necessitates access to clean ground truth images. While subsequent variants like the deep decoder attempted to mitigate this through stronger structural regularization, they suffered from lower model complexity and consequently compromised denoising performance.
A deeper, more structural limitation underpins these challenges. Fundamentally, the existing DIP and its variants predominantly rely on convolutional neural networks (CNNs) as their backbone [18,19,20,21]. While effective in smoothing out noise, this choice leads to two interconnected core problems:
Recovery Quality Limitation. The inherently local receptive field of standard convolution operations limits CNNs’ ability to effectively capture long-range dependencies and model complex, non-local image features. This often results in the suboptimal restoration of fine edge details and high-frequency textures, which are crucial for perceptual image quality.
Efficiency Constraint in the DIP Framework. The DIP process is inherently optimization-based, involving thousands of forward/backward passes during iterative reconstruction. The iterative nature of DIP means that any increase in the per-iteration computational overhead of the backbone network compounds significantly into high overall time and resource costs. Therefore, improving the representational power (or “generative capability”) of the DIP network cannot simply come at the cost of drastically increased model complexity or computation, as this would be prohibitively expensive.
In essence, the core challenge we address is: How can we design a DIP network backbone that achieves better recovery quality (particularly for edge and texture details) while maintaining, or even improving, its computational efficiency to remain practical for the iterative DIP process?
Concurrently, the field has seen the rapid development of Vision Transformers, which offer superior long-range modeling through self-attention mechanisms. However, their standard global self-attention comes with quadratic complexity, making them computationally heavy [22]. Recent efforts have produced efficient transformer variants. For instance, Restormer [23,24,25] introduced a multi-axis attention mechanism to lower computational overhead for restoration tasks. Building upon this progress, we hypothesize that an efficient, attention-based architecture can be designed to be compatible with the strict efficiency demands of DIP, potentially overcoming the locality constraints of CNNs.
To overcome the dual limitations of CNNs in the DIP framework, this paper introduces TMTA (a novel transposed multi-head token attention module) and integrates it within the DIP backbone. Instead of pixel-level global attention, our approach is inspired by channel attention mechanisms like those in Restormer but extends them into a decomposed spatial-channel attention strategy. TMTA efficiently learns self-attention across both channels and spatial dimensions (specifically, along horizontal and vertical directions), effectively capturing richer global context while avoiding the computational redundancy of full pixel attention. This design allows our method to achieve significant performance gains without incurring higher computational complexity.
The main contributions of this work are threefold:
First, we propose an enhanced, unsupervised DIP method based on our novel TMTA module. Our optimization is applied directly to the original DIP backbone, offering a seamless and effective plug-in component for existing and future DIP-based approaches.
Second, we pioneer the integration of an efficient self-attention mechanism into the DIP framework, striking a competitive balance with CNN-based DIP in terms of efficiency while substantially improving restoration performance, as evidenced in our experiments.
Third, as demonstrated in our experimental results, the proposed approach yields demonstrable improvements in both reconstruction performance and iteration time. It can be regarded as a performant and efficient optimized variant for the broader DIP algorithm family.

2. Proposed Method

This section will introduce an efficient DIP (TM-DIP) based on Triple Multi-Head Transposed Attention (TMTA). Our main goal is to introduce global self-attention into the DIP task. Unlike previous DIP methods, TM-DIP overcomes the global smoothness of the convolution process from a computational efficiency point of view and enhances the information of the edge texture details through global self-attention learning. As a result, our method achieves an organic unity of efficiency and performance. In the following subsections, we will first give an overview of the proposed TM-DIP. Then, we will analyze the computational efficiency of the proposed global attention of the method. Finally, we will elaborate on the core components of the proposed method.

2.1. Deep Image Prior (DIP)

Let a noisy image y ∈ ℝN be modeled as:
y = x + n,
where x ∈ ℝN is a noiseless image that one would like to recover and n ∈ ℝN is an i.i.d. Gaussian noise such that n~N (0, σ2I), where I is an identity matrix. Denoising can be formulated as a problem predicting the unknown x from known noisy observation y. Ulyanov et al. [7] argued that a network architecture naturally encourages the restoration of the original image from a degraded image y and named it deep image prior (DIP). Specifically, DIP optimizes a convolutional neural network h with parameter by a simple least square loss L as:
θ ˆ = arg min θ L ( h ( n ˆ ; θ ) , y ) ,
where nˆ is a random variable that is independent of y. If h(•) has enough capacity (i.e., sufficiently large number of parameters or architecture size) to fit to the noisy image y, the output of model h(nˆ; θ) should be equal to y, which is not desirable. DIP uses early stopping to obtain the results with the best PSNR with clean images.

2.2. Backbone of Deep Image Prior (DIP)

The prior defined by implicit Equation (2) is implicit and does not define an appropriate probability distribution in image space. However, it is possible to extract “samples” (in a loose sense) from this prior by means of random values of the parameter θ and the generated image f (θ). In other words, we can visualize the starting point of the optimization process using Equation (2) before fitting the parameters to the noisy image. DIP analyzes such “samples” from the depth prior captured by different hourglass-type architectures. The generative network chosen for the classical DIP is still used in subsequent studies as shown in Figure 1. Therefore, this architecture is naturally the most popular choice for generating ConvNets [26,27,28,29,30]. Instead, we optimize the generative network of the DIP from its roots so that it can be applied to all DIP-based research efforts.

2.3. Overview of TM-DIP

As shown in Figure 1, there are four main components: the downsampling module, the upsampling module, the hopping connection module, and our proposed spatial-channel self-attention module. Among them, the downsampling module, the upsampling module, and the hopping connection still follow the structure of DIP, while the added spatial-channel self-attention module can be embedded into the generative network of DIP as a plug-and-play module. As shown in Figure 1, the red dashed box shows the process of combining the upsampling and spatial-channel self-attention modules. Although ConvNets are able to accomplish the task of high-fidelity imaging of the original image, they are overly noise-smoothed due to local convolutions and are homogeneous with respect to the global information, mainly because most of the regions in the image are smooth and unvarying. As shown in Figure 2, the defects of DIP at the edge texture sites are obvious, while the optimization of TM-DIP at the edge sites is significant. TM-DIP based on spatial-channel global attention is committed to adding global attention on top of the convolutional smoothing of noise, which greatly recovers important edge texture information and content in the noised image. In addition, the effect of convolutional smoothing makes TM-DIP’s recovery ability somewhat limited, i.e., it does not have the ability to recover degraded texture information (degraded texture information is independently uncorrelated with image information). It is this one characteristic that means TM-DIP is just in the interval between recovering edge texture information and degradation information, enabling it to recover only detail information and not degradation information, which is in line with the optimal state of the image recovery task.

2.4. Triple Multi-Head Transposed Attention

In addition to the above approaches, TMTA considers a very significant issue: The limitation of transformers in image restoration lies in the huge computational complexity caused by the demand to complete high-resolution correlation calculations between various pixels. As shown in Figure 1, the pixel magnification of intermediate features has been scaled by ×16, and the computation can be reduced by ×256 if the traditional self-attention mechanism of full pixels is adopted. Even so, when it comes to higher resolution images, there is still the problem of “high computational complexity”. To this end, considering the information redundancy of full-pixel self-attention, TM-DIP proposes Triple Multi-Head Transposed Attention (TMTA). It decomposes the attention of characteristic pixels into three directions of self-attention for cooperative computation: horizontal self-attention, vertical self-attention, and channel self-attention.
As shown in Figure 3, the input features first pass through the “Layer Norm+PDConv” layer to generate the locally enriched query (Q), key (K) and value (V). The Layer Norm (LN) denotes the regular layer normalization, and the PDConv denotes the combination of Pointwise Convolution (PWConv) and Depthwise Convolution (DWConv). Then, the query (Q) and key (K) are reshaped in three-dimensional directions, resulting in the horizontal queryH (QH) and keyH (KH), the vertical queryW (QW) and keyW (KW), and the queryC (QC) and keyC (KC), respectively. Then, matrix multiplication is performed on them respectively to generate three transposed attention matrices (Figure 4) with sizes of ℝH×H, ℝW×W and ℝC×C, instead of the regular attention matrix ℝHW×HW of characteristic pixels [29,31]. It is worth noting that all three processes are transformed from query (Q) and key (K) and are synergistically related to each other. In general, the process definition of TMTA is as follows:
X′ = WP Atention(Qs, Ks, Ys) + X,
Atention(Qs, Ks, Vs) = Concat(AH, AW, AC),
AH = VH × Softmax(KH × QH/aH),
AW = VW × Softmax(KW × QW/aW), and
AC = VC × Softmax(KC × QC/aC),
where X and X′ denote the input and output features; Qi ∈ (ℝWC×H,HC×W, ℝHW×C), Ki ∈ (ℝH×WC, ℝW×HC, ℝC×HW), and Vi ∈ (ℝWC×H, ℝHC×W, ℝHW×C) denote the horizontal, vertical, and channel reshaping by the generated query (Q), key (K), and value (V), respectively; and ai denotes a learnable scaling parameter to control the size of the dot product of Qi and Ki before applying the activation function. In the above expression, i ∈ [H, W, C].

2.5. Efficiency of TMTA

In this subsection, we analyze why our proposed TMTA possesses high efficiency. First, we assume that the current shape of the input features is (B, C, H, W), the size of the convolution kernel is (K, K, C, C), and the head of the multi-head attention is set to 1 for ease of computation (the value of h does not affect the comparison results). The traditional ViT calculates the correlation between each pixel of an image, so its computation is positively correlated with the pixels of the image. With the continuous iteration and update of the technology, the requirements for image processing are now gradually increased from low resolution to high resolution. Therefore, its self-attention is calculated as the matrix product between the QUERY of shape (B, H × W, C) and the KEY of shape (B, C, H × W) to get the attention map of shape (B, H × W, H × W). The computation of this part is:
OP(ViT) = B × H × W × H × W × C.
And the self-attention of TM-DIP is computed as the matrix product between QUERY of shape (B, C, H × W) and KEY of shape (B, H × W, C) to get the attention mapping of shape (B, C, C). Its corresponding computation is:
OP(TMTA) = B × (C2HW + H2WC + W2HC).
It can be seen that the difference between the two equations lies in H × W and C. For an image of common resolution such as 256 × 256, the value of channel C hardly exceeds 512, which can be seen to be much smaller than H × W. Therefore, in terms of computation and memory requirements, TM-DIP has a huge improvement in computational efficiency compared to all-pixel self-attention ViT, while in terms of performance it can sufficiently overcome the computation of redundancy, which is mainly due to the extremely high percentage of homogeneous pixels (background) for images.

3. Experiments

3.1. Experimental Setup

Implementation Details. To ensure the fairness of the comparison between methods, our method and classical DIP methods adopt the same classic datasets, including denoising, super-resolution, flash reconstruction and inpainting [32]. Consistent with the most primitive DIP, we used an encoder–decoder (“hourglass”) architecture (possibly with skip connections) for f in all experiments unless otherwise noted in Figure 1, varying the hyperparameters by a small amount.
Evaluation Metrics. We use peak signal-to-noise ratio (PSNR) and cost time (s). The PSNR is widely used in denoising literature [21,22], but it has recently been argued that it is not an ideal metric as it values the oversmoothed results [33]. We use the publicly available pre-trained weights based on AlexNet by the authors. We additionally report the performance of the peak PSNR during optimization of our method as a reference (denoted as “Ours*”).

3.2. Comparison with DIP on Denoising and Generic Reconstruction

Our approach is consistent with the original DIP in that it does not model the image degradation process that it requires for recovery. This allows it to be applied in a “plug-and-play” manner to DIP-based image restoration tasks, where the degradation process is complex and/or unknown, and real data for supervised training is difficult to obtain. We validate that TM-DIP is effective and outperforms DIP in detailed areas by using the qualitative example in Figure 5. As can be seen in the figure, TM-DIP is able to quickly focus on the learning of detailed regions at each stage of the iteration. For the iterative generation stages of DIP and TM-DIP, it can be observed that the convolutional neural network focuses on learning from global homogeneity, while the introduction of self-attention effectively enhances the detail information as well as the clarity. In other words, self-attention focuses on the approximation of edge details, while the convolutional neural network test focuses on the smoothing of overall accuracy.
Figure 6 also similarly demonstrates the learning efficiency of TM-DIP for detail information. TM-DIP is able to pay greater attention to detail regions, while convolution-based DIP is based on global averaging for optimal learning, and the final result of DIP is blurred by the meeting in the edge texture region.

3.3. Comparison with DIP on Super-Resolution

We similarly use the center crop of the generated image to compute the PSNR (Table 1, Table 2, Table 3 and Table 4). Our method, while outperforming DIP in terms of accuracy, is still lower than the learning-based method. However, learning-based methods require a large amount of training time, which is a common problem with supervised learning. As can be seen from the tables, the improvement of TM-DIP is obvious.

3.4. Comparison with DIP on Inpainting

We similarly compare the results in terms of inpainting. The results were similar in terms of visual sensations for text inpainting (Figure 7). In Figure 8, we compare the depth prior corresponding to different levels of architecture. We compare the visual differences between TM-DIP and DIP between different levels of architecture. It can be seen that the results recovered by TM-DIP still focus on non-smooth regions, while the results recovered by DIP focus more on smooth regions.

3.5. Comparison with DIP on Flash–No Flash Reconstruction

This subsection shows the results of the comparison in terms of flash–no flash reconstruction in Figure 9. It can be observed that the results of TM-DIP over DIP are closer to the original image in terms of color tone and detail. Moreover, the imaging results of DIP are diffuse and do not have a proper contrast, or the contrast is more uniform. In contrast, TM-DIP is able to focus on the enhancement of specific areas as well as adapt the overall contrast. In addition, the global averaging learning of DIP makes the overall color tone of the image less of a priority than TM-DIP.

3.6. Comparison with DIP on Time Cost

As shown in Figure 1, TM-DIP implements global self-attention learning by replacing the convolutional structure in DIP. As previously analyzed, the convolutional structure of DIP contains multiple base operations instead of a single convolutional process, while the structure of our proposed TMTA is simple, and the computational complexity approximates that of a single convolution. In particular, it should be noted that TMTA and the conventional transformer have significant advantages in the processing of high-resolution images. This is due to the fact that TMTA is proportional to the image resolution, whereas the conventional transformer is proportional to the square of the image resolution, and this relationship creates an unbridgeable gap with increasing resolution. As shown in Figure 1, the computational efficiency of TM-DIP is even better than that of DIP, which fully validates the computational efficiency of our proposed method as well as its foresight.

4. Conclusions

In this paper, we propose an efficient image denoising baseline called TM-DIP. TM-DIP introduces the Triple Multi-Head Transposed Attention mechanism, TMTA, in DIP. The TMTA mechanism decomposes the traditional all-pixel self-attention mapping computation into horizontal self-attention, vertical self-attention, and channel self-attention. Among them, the horizontal and vertical self-attention fit the spatial information of features, so TMTA can fully learn the feature information in the spatial-channel dimension. In addition, TMTA realizes the effect of fast computation by this multi-directional disassembly, and has improved time efficiency over DIP, while the visual effect is significantly improved over DIP in the learning of detail edges. In addition, the advantage of TMTA is that it can be applied to any other architectures as a standalone module with a certain guarantee of computational efficiency. And for all subsequent DIP methods, the addition of TMTA can provide a stronger enhancement to their findings. The principles of global learning and efficient structured attention presented in this work can also facilitate the reconstruction and analysis of complex structured data from corrupted inputs.

Author Contributions

Conceptualization, W.L.; methodology, W.L. and Z.Z.; software, J.L.; validation, Z.Z. and Y.Y.; formal analysis, Y.Y.; investigation, Y.Y.; data curation, J.L.; writing—original draft, W.L. and Z.Z.; visualization, J.L.; funding acquisition, W.L., Z.Z. and Y.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the Natural Science Foundation of Fujian Province, China (No. 2023J011119); the Natural Science Foundation of Fujian Province, China (No. 2023J011118); and the 2025 Yunnan Provincial Department of Education Scientific Research Fund Project, China (2025J1100).

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Burger, H.C.; Schuler, C.J.; Harmeling, S. Image denoising: Can plain neural networks compete with BM3D. In 2012 IEEE Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2012; pp. 2392–2399. [Google Scholar]
  2. Zhang, K.; Zuo, W.; Chen, Y.; Meng, D.; Zhang, L. Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising. IEEE Trans. Image Process. 2017, 26, 3142–3155. [Google Scholar] [CrossRef] [PubMed]
  3. Chen, Y.; Pock, T. Trainable nonlinear reaction diffusion: A flexible framework for fast and effective image restoration. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1256–1272. [Google Scholar] [CrossRef] [PubMed]
  4. Cheng, S.; Wang, Y.; Huang, H.; Liu, D.; Fan, H.; Liu, S. Nbnet: Noise basis learning for image denoising with subspace projection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2021; pp. 4896–4906. [Google Scholar]
  5. Mao, X.; Shen, C.; Yang, Y.-B. Image restoration using very deep convolutional encoder-decoder networks with symmetric skip connections. In Advances in Neural Information Processing Systems 29; NeurIPS: San Diego, CA, USA, 2016. [Google Scholar]
  6. Tai, Y.; Yang, J.; Liu, X.; Xu, C. Memnet: A persistent memory network for image restoration. In Proceedings of the IEEE International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2017; pp. 4539–4547. [Google Scholar]
  7. Ulyanov, D.; Vedaldi, A.; Lempitsky, V. Deep image prior. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2018; pp. 9446–9454. [Google Scholar]
  8. Zhou, Y.; Jiao, J.; Huang, H.; Wang, Y.; Wang, J.; Shi, H.; Huang, T. When awgnbased denoiser meets real noises. Proc. AAAI Conf. Artif. Intell. 2020, 34, 13074–13081. [Google Scholar] [CrossRef]
  9. Anwar, S.; Khan, S.; Barnes, N. A deep journey into super-resolution: A survey. ACM Comput. Surv. (CSUR) 2020, 53, 60. [Google Scholar] [CrossRef]
  10. Dong, C.; Loy, C.C.; He, K.; Tang, X. Image super-resolution using deep convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 38, 295–307. [Google Scholar] [CrossRef] [PubMed]
  11. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25; NeurIPS: San Diego, CA, USA, 2012. [Google Scholar]
  12. Ledig, C.; Theis, L.; Huszár, F.; Caballero, J.; Cunningham, A.; Acosta, A.; Aitken, A.; Tejani, A.; Totz, J.; Wang, Z.; et al. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2017; pp. 4681–4690. [Google Scholar]
  13. Lim, B.; Son, S.; Kim, H.; Nah, S.; Lee, K.M. Enhanced deep residual networks for single image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2017; pp. 136–144. [Google Scholar]
  14. Xia, B.; Hang, Y.; Tian, Y.; Yang, W.; Liao, Q.; Zhou, J. Efficient Non-Local Contrastive Attention for Image Super-Resolution. Proc. AAAI Conf. Artif. Intell. 2022, 36, 2759–2767. [Google Scholar] [CrossRef]
  15. Zhang, Y.; Li, K.; Li, K.; Wang, L.; Zhong, B.; Fu, Y. Image super-resolution using very deep residual channel attention networks. In Proceedings of the European Conference on Computer Vision (ECCV); IEEE: Piscataway, NJ, USA, 2018; pp. 286–301. [Google Scholar]
  16. Zhang, Y.; Tian, Y.; Kong, Y.; Zhong, B.; Fu, Y. Residual dense network for image superresolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2018; pp. 2472–2481. [Google Scholar]
  17. Zhang, Y.; Tian, Y.; Kong, Y.; Zhong, B.; Fu, Y. Residual dense network for image restoration. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 2480–2495. [Google Scholar] [CrossRef] [PubMed]
  18. Asperti, A.; Tonelli, V. Comparing the latent space of generative models. Neural Comput. Appl. 2023, 35, 3155–3172. [Google Scholar] [CrossRef]
  19. Dosovitskiy, A.; Brox, T. Inverting convolutional networks with convolutional networks. arXiv 2015, arXiv:1506.02753. [Google Scholar]
  20. Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. Commun. ACM 2020, 63, 139–144. [Google Scholar] [CrossRef]
  21. Khawar, F.; Poon, L.; Zhang, N.L. Learning the structure of auto-encoding recommenders. In Proceedings of The Web Conference 2020; Association for Computing Machinery: New York, NY, USA, 2020; pp. 519–529. [Google Scholar]
  22. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems 30; NeurIPS: San Diego, CA, USA, 2017. [Google Scholar]
  23. Zamir, S.W.; Arora, A.; Khan, S.; Hayat, M.; Khan, F.S.; Yang, M.-H. Restormer: Efficient transformer for high-resolution image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2022; pp. 5728–5739. [Google Scholar]
  24. Zamir, S.W.; Arora, A.; Khan, S.; Hayat, M.; Khan, F.S.; Yang, M.-H.; Shao, L. Learning enriched features for real image restoration and enhancement. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020, Proceedings, Part XXV 16; Springer: Cham, Switzerland, 2020; pp. 492–511. [Google Scholar]
  25. Zamir, S.W.; Arora, A.; Khan, S.; Hayat, M.; Khan, F.S.; Yang, M.-H.; Shao, L. Multi-stage progressive image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2021; pp. 14821–14831. [Google Scholar]
  26. Ding, X.; Fan, H.; Gong, J. Towards generating network of bikeways from Mapillary data. Comput. Environ. Urban Syst. 2021, 88, 101632. [Google Scholar] [CrossRef]
  27. Ding, X.; Zhang, X.; Ma, N.; Han, J.; Ding, G.; Sun, J. Repvgg: Making vgg-style convnets great again. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2021; pp. 13733–13742. [Google Scholar]
  28. Guo, S.; Yan, Z.; Zhang, K.; Zuo, W.; Zhang, L. Toward convolutional blind denoising of real photographs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2019; pp. 1712–1722. [Google Scholar]
  29. Wen, S.; Liu, W.; Yang, Y.; Huang, T.; Zeng, Z. Generating realistic videos from keyframes with concatenated GANs. IEEE Trans. Circuits Syst. Video Technol. 2018, 29, 2337–2348. [Google Scholar] [CrossRef]
  30. Wu, B.; Dai, X.; Zhang, P.; Wang, Y.; Sun, F.; Wu, Y.; Tian, Y.; Vajda, P.; Jia, Y.; Keutzer, K. Fbnet: Hardware-aware efficient convnet design via differentiable neural architecture search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2019; pp. 10734–10742. [Google Scholar]
  31. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
  32. Abdelhamed, A.; Lin, S.; Brown, M.S. A high-quality denoising dataset for smartphone cameras. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2018; pp. 1692–1700. [Google Scholar]
  33. Liu, Y.; Qin, Z.; Anwar, S.; Ji, P.; Kim, D.; Caldwell, S.; Gedeon, T. Invertible denoising network: A light solution for real noise removal. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2021; pp. 13365–13374. [Google Scholar]
  34. Mahendran, A.; Vedaldi, A. Understanding deep image representations by inverting them. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2015; pp. 5188–5196. [Google Scholar]
  35. Glasner, D.; Bagon, S.; Irani, M. Super-resolution from a single image. In 2009 IEEE 12th International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2009; pp. 349–356. [Google Scholar]
  36. Lai, W.-S.; Huang, J.-B.; Ahuja, N.; Yang, M.-H. Deep laplacian pyramid networks for fast and accurate super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2017; pp. 624–632. [Google Scholar]
  37. Shocher, A.; Cohen, N.; Irani, M. “zero-shot” super-resolution using deep internal learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2018; pp. 3118–3126. [Google Scholar]
  38. Petschnigg, G.; Szeliski, R.; Agrawala, M.; Cohen, M.; Hoppe, H.; Toyama, K. Digital photography with flash and no-flash image pairs. ACM Trans. Graph. (TOG) 2004, 23, 664–672. [Google Scholar] [CrossRef]
Figure 1. Illustration of our proposed TM-DIP. We use the same “hourglass” (also known as “decoder-encoder”) architecture as the classical DIP. We sometimes add skip connections (yellow arrows). We use the first letter of each functional module to represent the corresponding functional unit. For example, upsampling is denoted by u, downsampling is denoted by d, skip connections are denoted by s, and the non-local self-attention module is denoted by t. ui, di, si, and ti correspond to the number of filters at depth i for the upsampling, downsampling, skip connections and non-local self-attention module respectively. nu[i], nd[i], and ns[i] correspond to the number of filters at depth i for the upsampling, downsampling and skip connections respectively. The values ku[i], kd[i], and ks[i] correspond to the respective kernel sizes.
Figure 1. Illustration of our proposed TM-DIP. We use the same “hourglass” (also known as “decoder-encoder”) architecture as the classical DIP. We sometimes add skip connections (yellow arrows). We use the first letter of each functional module to represent the corresponding functional unit. For example, upsampling is denoted by u, downsampling is denoted by d, skip connections are denoted by s, and the non-local self-attention module is denoted by t. ui, di, si, and ti correspond to the number of filters at depth i for the upsampling, downsampling, skip connections and non-local self-attention module respectively. nu[i], nd[i], and ns[i] correspond to the number of filters at depth i for the upsampling, downsampling and skip connections respectively. The values ku[i], kd[i], and ks[i] correspond to the respective kernel sizes.
Mathematics 14 01185 g001
Figure 2. Time cost comparisons and visual comparisons of our methods with DIP on image super-resolution for 4× SR. The x-axis represents the computational cost (Iterations), indicating the number of optimization steps the network performs to reconstruct a clean image from a degraded input in deep image prior (DIP).
Figure 2. Time cost comparisons and visual comparisons of our methods with DIP on image super-resolution for 4× SR. The x-axis represents the computational cost (Iterations), indicating the number of optimization steps the network performs to reconstruct a clean image from a degraded input in deep image prior (DIP).
Mathematics 14 01185 g002
Figure 3. Illustration of the Triple Multi-Head Transposed Attention module (TMTA). The attention of characteristic pixels is decomposed into three directions of self-attention for cooperative computation: horizontal self-attention, vertical self-attention, and channel self-attention. The “*” in the figure indicates a repeated block structure, where the corresponding components are duplicated to streamline the illustration.
Figure 3. Illustration of the Triple Multi-Head Transposed Attention module (TMTA). The attention of characteristic pixels is decomposed into three directions of self-attention for cooperative computation: horizontal self-attention, vertical self-attention, and channel self-attention. The “*” in the figure indicates a repeated block structure, where the corresponding components are duplicated to streamline the illustration.
Mathematics 14 01185 g003
Figure 4. Illustration of the attention module of Triple Multi-Head Transposed Attention. TMTA’s attention consists of a LayerNorm module, three depth separable convolutions (PDConv), six matrix multiplications, and three point-wise convolutions.
Figure 4. Illustration of the attention module of Triple Multi-Head Transposed Attention. TMTA’s attention consists of a LayerNorm module, three depth separable convolutions (PDConv), six matrix multiplications, and three point-wise convolutions.
Mathematics 14 01185 g004
Figure 5. Blind image denoising. The deep image prior is successful at recovering both man-made and natural patterns. TM-DIP is able to recover detailed information based on the number of iterations, whereas DIP can only approximate the result from a slow global averaging, which leads to deficiencies in important detail areas.
Figure 5. Blind image denoising. The deep image prior is successful at recovering both man-made and natural patterns. TM-DIP is able to recover detailed information based on the number of iterations, whereas DIP can only approximate the result from a slow global averaging, which leads to deficiencies in important detail areas.
Mathematics 14 01185 g005
Figure 6. Blind restoration of a JPEG-compressed image (electronic zoom-in recommended). Our approach can restore an image with complex degradation (JPEG compression in this case).
Figure 6. Blind restoration of a JPEG-compressed image (electronic zoom-in recommended). Our approach can restore an image with complex degradation (JPEG compression in this case).
Mathematics 14 01185 g006
Figure 7. Comparison of text inpainting.
Figure 7. Comparison of text inpainting.
Mathematics 14 01185 g007
Figure 8. Inpainting using different depths.
Figure 8. Inpainting using different depths.
Mathematics 14 01185 g008
Figure 9. Reconstruction based on flash and no-flash image pair. The deep image prior allows us to obtain low-noise reconstruction with the lighting very close to the no-flash image. It is more successful at avoiding “leaks” of the lighting patterns from the flash pair than joint bilateral filtering [38] (c.f. blue inset).
Figure 9. Reconstruction based on flash and no-flash image pair. The deep image prior allows us to obtain low-noise reconstruction with the lighting very close to the no-flash image. It is more successful at avoiding “leaks” of the lighting patterns from the flash pair than joint bilateral filtering [38] (c.f. blue inset).
Mathematics 14 01185 g009
Table 1. Detailed super-resolution PSNR comparison on the 4× Set14.
Table 1. Detailed super-resolution PSNR comparison on the 4× Set14.
BaboonBarbaraBridgeCoastguardComicFaceFlowersForemanLennaManMonarchPepperPpt3ZebraAvg
No prior22.2424.8923.9424.6221.0629.9923.7529.0128.2324.8425.7628.7120.2621.6924.93
Bicubic22.4424.1524.4725.5321.5931.3425.3329.4529.8425.727.4530.6321.7824.0126.05
TV prior [34]22.3424.7824.4625.7821.9531.3425.9130.6329.7625.9428.4631.3222.7524.5226.42
Glasner et al. [35]22.4425.3824.7325.3821.9831.0925.5430.430.4826.3328.2232.0222.1624.3426.46
DIP22.2925.5324.3825.8122.1831.0226.1431.6630.8326.0929.9832.0824.3825.7127.00
Ours22.3125.6324.4525.9422.2931.1726.2831.7330.9926.1430.1232.2124.4325.8527.11
SRResNet-MSE [12]23.0026.0825.5226.3123.4432.7128.1333.832.4227.4332.8234.2826.5626.9528.53
LapSRN [36]22.8325.6925.3626.2122.932.6227.5433.5931.9827.2731.6233.8825.3626.9828.13
Table 2. Detailed super-resolution PSNR comparison on the 8× Set14.
Table 2. Detailed super-resolution PSNR comparison on the 8× Set14.
BaboonBarbaraBridgeCoastguardComicFaceFlowersForemanLennaManMonarchPepperPpt3ZebraAvg
No prior21.0923.0421.7823.6318.6527.8421.0525.6225.4222.5422.9125.3418.1518.8522.56
Bicubic21.2823.4422.2423.6519.2528.7922.0625.3726.2723.0623.1826.5518.6219.5923.09
TV prior [34]21.323.7222.323.8219.528.8422.526.0726.7423.5323.7127.5619.3419.8923.48
SelfExSR [37]21.3723.922.2824.1719.7929.4822.9327.0127.7223.8324.0228.6320.0920.2523.96
DIP21.3823.9422.224.2119.8629.5222.8627.8727.9323.5724.8629.1820.1220.6224.15
Ours21.4924.0722.3124.4219.9729.7122.9527.9428.0623.7524.9829.3120.1520.7124.37
LapSRN [36]21.5124.2122.7724.1020.0629.8523.3128.1328.2224.2024.9729.2220.1320.2824.35
Table 3. Detailed super-resolution PSNR comparison on the 4× Set5.
Table 3. Detailed super-resolution PSNR comparison on the 4× Set5.
BabyBirdButterflyHeadWomanAvg
No prior30.1627.6719.8229.9825.1826.56
Bicubic31.7830.222.1331.3426.7528.44
TV prior [34]31.2130.4324.3831.3426.9328.85
SelfExSR [37]32.2431.122.3631.6926.8528.84
DIP31.4931.826.2331.0428.9329.89
Ours32.2531.9526.4531.1729.2130.21
LapSRN [36]33.5533.7627.2832.6230.7231.58
SRResNet-MSE [12]33.6635.128.4132.7330.632.1
Table 4. Detailed super-resolution PSNR comparison on the 8× Set5.
Table 4. Detailed super-resolution PSNR comparison on the 8× Set5.
BabyBirdButterflyHeadWomanAvg
No prior26.2824.0317.6427.9421.3723.45
Bicubic27.2825.2817.7428.8222.7424.37
TV prior [34]27.9325.8218.428.8723.3624.87
SelfExSR [37]28.4526.4818.829.3624.0525.42
DIP28.2827.0920.0229.5524.525.88
Ours28.4127.2220.1329.7124.6726.05
LapSRN [36]28.8827.119.9729.7624.7926.1
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lin, W.; Zhang, Z.; Lin, J.; You, Y. Efficient Deep Image Prior with Spatial-Channel Attention Transformer. Mathematics 2026, 14, 1185. https://doi.org/10.3390/math14071185

AMA Style

Lin W, Zhang Z, Lin J, You Y. Efficient Deep Image Prior with Spatial-Channel Attention Transformer. Mathematics. 2026; 14(7):1185. https://doi.org/10.3390/math14071185

Chicago/Turabian Style

Lin, Weiwei, Zeqing Zhang, Jin Lin, and Ying You. 2026. "Efficient Deep Image Prior with Spatial-Channel Attention Transformer" Mathematics 14, no. 7: 1185. https://doi.org/10.3390/math14071185

APA Style

Lin, W., Zhang, Z., Lin, J., & You, Y. (2026). Efficient Deep Image Prior with Spatial-Channel Attention Transformer. Mathematics, 14(7), 1185. https://doi.org/10.3390/math14071185

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop