Efficient Deep Image Prior with Spatial-Channel Attention Transformer

Lin, Weiwei; Zhang, Zeqing; Lin, Jin; You, Ying

doi:10.3390/math14071185

Open AccessArticle

Efficient Deep Image Prior with Spatial-Channel Attention Transformer

¹

School of Big Data and Artificial Intelligence, Fujian Polytechnic Normal University, Fuzhou 350300, China

²

School of Surveying and Information Engineering, West Yunnan University of Applied Sciences, Dali 671002, China

³

School of Informatics, Xiamen University, Xiamen 361005, China

^*

Author to whom correspondence should be addressed.

Mathematics 2026, 14(7), 1185; https://doi.org/10.3390/math14071185

Submission received: 9 February 2026 / Revised: 19 March 2026 / Accepted: 27 March 2026 / Published: 1 April 2026

(This article belongs to the Special Issue Securing Software Through Mathematics and Domain-Specific Knowledge: Innovations and Applications)

Download

Browse Figures

Versions Notes

Abstract

The deep image prior (DIP) suggests that it is possible to train a randomly initialized network with a suitable architecture to solve inverse imaging problems by simply optimizing its parameters to reconstruct a single degraded image. However, the prior knowledge exploited by vanilla DIP relies on basic local convolutions, which inevitably limits the performance of inverse imaging tasks to the generative capacity of the model. Furthermore, image information is often not only related to neighboring pixels but also dependent on global color features and spatial distribution. Simple local convolutions used in inverse imaging cannot capture precise fine-grained details. Moreover, DIP is an unsupervised process but requires iterations to learn inverse imaging, consuming computational power and limiting the adaptation of global attention. To solve these problems, this article explores an efficient global prior module—a tri-directional multi-head self-attention mechanism—aiming to learn pixel-wise correlations along three directions: horizontal, vertical, and channel-wise. Our observations found that global learning can effectively enhance the detail information of edge pixels, making images more vivid and textures clearer. In addition, tri-directional multi-head self-attention can efficiently replace the global perception ability of pixel-level self-attention. Finally, we demonstrate that global learning can effectively improve the imaging effect of inverse imaging problems and enhance the information of texture edge pixels. Moreover, tri-directional multi-head self-attention can effectively alleviate the computation redundancy of pixel-level self-attention, thus achieving efficient and high-quality inverse imaging tasks. The principle of this method lies in global feature capture and efficient attention modeling, striking a balance between detail fidelity and computational practicality.

Keywords:

image degradation; efficient deep image prior; spatial-channel attention transformer

MSC:

68T07

1. Introduction

Recent years have witnessed the widespread application and remarkable success of deep neural networks in numerous computer vision and communication tasks. However, within the domain of image denoising, traditional methods such as BM3D [1] historically outperformed early deep learning-based approaches until the advent of methods like DnCNN [2]. DnCNN surpassed traditional methods in handling synthetic Gaussian noise, albeit at the cost of requiring a large dataset of clean–noisy image pairs for training [3,4,5,6,7,8].

In contrast, the deep image prior (DIP [7]) framework represents a paradigm shift, demonstrating that a randomly initialized convolutional network (often with an hourglass structure) can serve as a powerful implicit prior for various image inverse problems including denoising, super-resolution [9,10,11,12,13], and general image restoration [14,15,16,17] without requiring any pre-collected clean–noisy image pairs. Despite its strengths, DIP exhibits notable limitations. First, its performance of image denoising tasks, even under synthetic Gaussian noise settings, is often significantly inferior to traditional methods like BM3D unless meticulous early stopping is applied, which necessitates access to clean ground truth images. While subsequent variants like the deep decoder attempted to mitigate this through stronger structural regularization, they suffered from lower model complexity and consequently compromised denoising performance.

A deeper, more structural limitation underpins these challenges. Fundamentally, the existing DIP and its variants predominantly rely on convolutional neural networks (CNNs) as their backbone [18,19,20,21]. While effective in smoothing out noise, this choice leads to two interconnected core problems:

Recovery Quality Limitation. The inherently local receptive field of standard convolution operations limits CNNs’ ability to effectively capture long-range dependencies and model complex, non-local image features. This often results in the suboptimal restoration of fine edge details and high-frequency textures, which are crucial for perceptual image quality.

Efficiency Constraint in the DIP Framework. The DIP process is inherently optimization-based, involving thousands of forward/backward passes during iterative reconstruction. The iterative nature of DIP means that any increase in the per-iteration computational overhead of the backbone network compounds significantly into high overall time and resource costs. Therefore, improving the representational power (or “generative capability”) of the DIP network cannot simply come at the cost of drastically increased model complexity or computation, as this would be prohibitively expensive.

In essence, the core challenge we address is: How can we design a DIP network backbone that achieves better recovery quality (particularly for edge and texture details) while maintaining, or even improving, its computational efficiency to remain practical for the iterative DIP process?

Concurrently, the field has seen the rapid development of Vision Transformers, which offer superior long-range modeling through self-attention mechanisms. However, their standard global self-attention comes with quadratic complexity, making them computationally heavy [22]. Recent efforts have produced efficient transformer variants. For instance, Restormer [23,24,25] introduced a multi-axis attention mechanism to lower computational overhead for restoration tasks. Building upon this progress, we hypothesize that an efficient, attention-based architecture can be designed to be compatible with the strict efficiency demands of DIP, potentially overcoming the locality constraints of CNNs.

To overcome the dual limitations of CNNs in the DIP framework, this paper introduces TMTA (a novel transposed multi-head token attention module) and integrates it within the DIP backbone. Instead of pixel-level global attention, our approach is inspired by channel attention mechanisms like those in Restormer but extends them into a decomposed spatial-channel attention strategy. TMTA efficiently learns self-attention across both channels and spatial dimensions (specifically, along horizontal and vertical directions), effectively capturing richer global context while avoiding the computational redundancy of full pixel attention. This design allows our method to achieve significant performance gains without incurring higher computational complexity.

The main contributions of this work are threefold:

First, we propose an enhanced, unsupervised DIP method based on our novel TMTA module. Our optimization is applied directly to the original DIP backbone, offering a seamless and effective plug-in component for existing and future DIP-based approaches.

Second, we pioneer the integration of an efficient self-attention mechanism into the DIP framework, striking a competitive balance with CNN-based DIP in terms of efficiency while substantially improving restoration performance, as evidenced in our experiments.

Third, as demonstrated in our experimental results, the proposed approach yields demonstrable improvements in both reconstruction performance and iteration time. It can be regarded as a performant and efficient optimized variant for the broader DIP algorithm family.

2. Proposed Method

This section will introduce an efficient DIP (TM-DIP) based on Triple Multi-Head Transposed Attention (TMTA). Our main goal is to introduce global self-attention into the DIP task. Unlike previous DIP methods, TM-DIP overcomes the global smoothness of the convolution process from a computational efficiency point of view and enhances the information of the edge texture details through global self-attention learning. As a result, our method achieves an organic unity of efficiency and performance. In the following subsections, we will first give an overview of the proposed TM-DIP. Then, we will analyze the computational efficiency of the proposed global attention of the method. Finally, we will elaborate on the core components of the proposed method.

2.1. Deep Image Prior (DIP)

Let a noisy image y ∈ ℝ^N be modeled as:

y = x + n,

(1)

where x ∈ ℝ^N is a noiseless image that one would like to recover and n ∈ ℝ^N is an i.i.d. Gaussian noise such that n~N (0, σ²I), where I is an identity matrix. Denoising can be formulated as a problem predicting the unknown x from known noisy observation y. Ulyanov et al. [7] argued that a network architecture naturally encourages the restoration of the original image from a degraded image y and named it deep image prior (DIP). Specifically, DIP optimizes a convolutional neural network h with parameter by a simple least square loss L as:

θ ˆ = \arg \min_{θ} L (h (n ˆ; θ), y),

(2)

where nˆ is a random variable that is independent of y. If h(•) has enough capacity (i.e., sufficiently large number of parameters or architecture size) to fit to the noisy image y, the output of model h(nˆ; θ) should be equal to y, which is not desirable. DIP uses early stopping to obtain the results with the best PSNR with clean images.

2.2. Backbone of Deep Image Prior (DIP)

The prior defined by implicit Equation (2) is implicit and does not define an appropriate probability distribution in image space. However, it is possible to extract “samples” (in a loose sense) from this prior by means of random values of the parameter θ and the generated image f (θ). In other words, we can visualize the starting point of the optimization process using Equation (2) before fitting the parameters to the noisy image. DIP analyzes such “samples” from the depth prior captured by different hourglass-type architectures. The generative network chosen for the classical DIP is still used in subsequent studies as shown in Figure 1. Therefore, this architecture is naturally the most popular choice for generating ConvNets [26,27,28,29,30]. Instead, we optimize the generative network of the DIP from its roots so that it can be applied to all DIP-based research efforts.

2.3. Overview of TM-DIP

As shown in Figure 1, there are four main components: the downsampling module, the upsampling module, the hopping connection module, and our proposed spatial-channel self-attention module. Among them, the downsampling module, the upsampling module, and the hopping connection still follow the structure of DIP, while the added spatial-channel self-attention module can be embedded into the generative network of DIP as a plug-and-play module. As shown in Figure 1, the red dashed box shows the process of combining the upsampling and spatial-channel self-attention modules. Although ConvNets are able to accomplish the task of high-fidelity imaging of the original image, they are overly noise-smoothed due to local convolutions and are homogeneous with respect to the global information, mainly because most of the regions in the image are smooth and unvarying. As shown in Figure 2, the defects of DIP at the edge texture sites are obvious, while the optimization of TM-DIP at the edge sites is significant. TM-DIP based on spatial-channel global attention is committed to adding global attention on top of the convolutional smoothing of noise, which greatly recovers important edge texture information and content in the noised image. In addition, the effect of convolutional smoothing makes TM-DIP’s recovery ability somewhat limited, i.e., it does not have the ability to recover degraded texture information (degraded texture information is independently uncorrelated with image information). It is this one characteristic that means TM-DIP is just in the interval between recovering edge texture information and degradation information, enabling it to recover only detail information and not degradation information, which is in line with the optimal state of the image recovery task.

2.4. Triple Multi-Head Transposed Attention

In addition to the above approaches, TMTA considers a very significant issue: The limitation of transformers in image restoration lies in the huge computational complexity caused by the demand to complete high-resolution correlation calculations between various pixels. As shown in Figure 1, the pixel magnification of intermediate features has been scaled by ×16, and the computation can be reduced by ×256 if the traditional self-attention mechanism of full pixels is adopted. Even so, when it comes to higher resolution images, there is still the problem of “high computational complexity”. To this end, considering the information redundancy of full-pixel self-attention, TM-DIP proposes Triple Multi-Head Transposed Attention (TMTA). It decomposes the attention of characteristic pixels into three directions of self-attention for cooperative computation: horizontal self-attention, vertical self-attention, and channel self-attention.

As shown in Figure 3, the input features first pass through the “Layer Norm+PDConv” layer to generate the locally enriched query (Q), key (K) and value (V). The Layer Norm (LN) denotes the regular layer normalization, and the PDConv denotes the combination of Pointwise Convolution (PWConv) and Depthwise Convolution (DWConv). Then, the query (Q) and key (K) are reshaped in three-dimensional directions, resulting in the horizontal query_H (Q_H) and key_H (K_H), the vertical query_W (Q_W) and key_W (K_W), and the query_C (Q_C) and key_C (K_C), respectively. Then, matrix multiplication is performed on them respectively to generate three transposed attention matrices (Figure 4) with sizes of ℝ^H^×H, ℝ^W^×W and ℝ^C^×C, instead of the regular attention matrix ℝ^HW×HW of characteristic pixels [29,31]. It is worth noting that all three processes are transformed from query (Q) and key (K) and are synergistically related to each other. In general, the process definition of TMTA is as follows:

X′ = W_P Atention(Qs, Ks, Ys) + X,
Atention(Qs, Ks, Vs) = Concat(A_H, A_W, A_C),
A_H = V_H × Softmax(K_H × Q_H/a_H),
A_W = V_W × Softmax(K_W × Q_W/a_W), and
A_C = V_C × Softmax(K_C × Q_C/a_C),

(3)

where X and X′ denote the input and output features; Q_i ∈ (ℝ^WC^×H, ℝ^HC^×W, ℝ^HW^×C), K_i ∈ (ℝ^H^×WC, ℝ^W^×HC, ℝ^C^×HW), and V_i ∈ (ℝ^WC^×H, ℝ^HC^×W, ℝ^HW^×C) denote the horizontal, vertical, and channel reshaping by the generated query (Q), key (K), and value (V), respectively; and a_i denotes a learnable scaling parameter to control the size of the dot product of Q_i and K_i before applying the activation function. In the above expression, i ∈ [H, W, C].

2.5. Efficiency of TMTA

In this subsection, we analyze why our proposed TMTA possesses high efficiency. First, we assume that the current shape of the input features is (B, C, H, W), the size of the convolution kernel is (K, K, C, C), and the head of the multi-head attention is set to 1 for ease of computation (the value of h does not affect the comparison results). The traditional ViT calculates the correlation between each pixel of an image, so its computation is positively correlated with the pixels of the image. With the continuous iteration and update of the technology, the requirements for image processing are now gradually increased from low resolution to high resolution. Therefore, its self-attention is calculated as the matrix product between the QUERY of shape (B, H × W, C) and the KEY of shape (B, C, H × W) to get the attention map of shape (B, H × W, H × W). The computation of this part is:

OP(ViT) = B × H × W × H × W × C.

(4)

And the self-attention of TM-DIP is computed as the matrix product between QUERY of shape (B, C, H × W) and KEY of shape (B, H × W, C) to get the attention mapping of shape (B, C, C). Its corresponding computation is:

OP(TMTA) = B × (C²HW + H²WC + W²HC).

(5)

It can be seen that the difference between the two equations lies in H × W and C. For an image of common resolution such as 256 × 256, the value of channel C hardly exceeds 512, which can be seen to be much smaller than H × W. Therefore, in terms of computation and memory requirements, TM-DIP has a huge improvement in computational efficiency compared to all-pixel self-attention ViT, while in terms of performance it can sufficiently overcome the computation of redundancy, which is mainly due to the extremely high percentage of homogeneous pixels (background) for images.

3. Experiments

3.1. Experimental Setup

Implementation Details. To ensure the fairness of the comparison between methods, our method and classical DIP methods adopt the same classic datasets, including denoising, super-resolution, flash reconstruction and inpainting [32]. Consistent with the most primitive DIP, we used an encoder–decoder (“hourglass”) architecture (possibly with skip connections) for f in all experiments unless otherwise noted in Figure 1, varying the hyperparameters by a small amount.

Evaluation Metrics. We use peak signal-to-noise ratio (PSNR) and cost time (s). The PSNR is widely used in denoising literature [21,22], but it has recently been argued that it is not an ideal metric as it values the oversmoothed results [33]. We use the publicly available pre-trained weights based on AlexNet by the authors. We additionally report the performance of the peak PSNR during optimization of our method as a reference (denoted as “Ours*”).

3.2. Comparison with DIP on Denoising and Generic Reconstruction

Our approach is consistent with the original DIP in that it does not model the image degradation process that it requires for recovery. This allows it to be applied in a “plug-and-play” manner to DIP-based image restoration tasks, where the degradation process is complex and/or unknown, and real data for supervised training is difficult to obtain. We validate that TM-DIP is effective and outperforms DIP in detailed areas by using the qualitative example in Figure 5. As can be seen in the figure, TM-DIP is able to quickly focus on the learning of detailed regions at each stage of the iteration. For the iterative generation stages of DIP and TM-DIP, it can be observed that the convolutional neural network focuses on learning from global homogeneity, while the introduction of self-attention effectively enhances the detail information as well as the clarity. In other words, self-attention focuses on the approximation of edge details, while the convolutional neural network test focuses on the smoothing of overall accuracy.

Figure 6 also similarly demonstrates the learning efficiency of TM-DIP for detail information. TM-DIP is able to pay greater attention to detail regions, while convolution-based DIP is based on global averaging for optimal learning, and the final result of DIP is blurred by the meeting in the edge texture region.

3.3. Comparison with DIP on Super-Resolution

We similarly use the center crop of the generated image to compute the PSNR (Table 1, Table 2, Table 3 and Table 4). Our method, while outperforming DIP in terms of accuracy, is still lower than the learning-based method. However, learning-based methods require a large amount of training time, which is a common problem with supervised learning. As can be seen from the tables, the improvement of TM-DIP is obvious.

3.4. Comparison with DIP on Inpainting

We similarly compare the results in terms of inpainting. The results were similar in terms of visual sensations for text inpainting (Figure 7). In Figure 8, we compare the depth prior corresponding to different levels of architecture. We compare the visual differences between TM-DIP and DIP between different levels of architecture. It can be seen that the results recovered by TM-DIP still focus on non-smooth regions, while the results recovered by DIP focus more on smooth regions.

3.5. Comparison with DIP on Flash–No Flash Reconstruction

This subsection shows the results of the comparison in terms of flash–no flash reconstruction in Figure 9. It can be observed that the results of TM-DIP over DIP are closer to the original image in terms of color tone and detail. Moreover, the imaging results of DIP are diffuse and do not have a proper contrast, or the contrast is more uniform. In contrast, TM-DIP is able to focus on the enhancement of specific areas as well as adapt the overall contrast. In addition, the global averaging learning of DIP makes the overall color tone of the image less of a priority than TM-DIP.

3.6. Comparison with DIP on Time Cost

As shown in Figure 1, TM-DIP implements global self-attention learning by replacing the convolutional structure in DIP. As previously analyzed, the convolutional structure of DIP contains multiple base operations instead of a single convolutional process, while the structure of our proposed TMTA is simple, and the computational complexity approximates that of a single convolution. In particular, it should be noted that TMTA and the conventional transformer have significant advantages in the processing of high-resolution images. This is due to the fact that TMTA is proportional to the image resolution, whereas the conventional transformer is proportional to the square of the image resolution, and this relationship creates an unbridgeable gap with increasing resolution. As shown in Figure 1, the computational efficiency of TM-DIP is even better than that of DIP, which fully validates the computational efficiency of our proposed method as well as its foresight.

4. Conclusions

In this paper, we propose an efficient image denoising baseline called TM-DIP. TM-DIP introduces the Triple Multi-Head Transposed Attention mechanism, TMTA, in DIP. The TMTA mechanism decomposes the traditional all-pixel self-attention mapping computation into horizontal self-attention, vertical self-attention, and channel self-attention. Among them, the horizontal and vertical self-attention fit the spatial information of features, so TMTA can fully learn the feature information in the spatial-channel dimension. In addition, TMTA realizes the effect of fast computation by this multi-directional disassembly, and has improved time efficiency over DIP, while the visual effect is significantly improved over DIP in the learning of detail edges. In addition, the advantage of TMTA is that it can be applied to any other architectures as a standalone module with a certain guarantee of computational efficiency. And for all subsequent DIP methods, the addition of TMTA can provide a stronger enhancement to their findings. The principles of global learning and efficient structured attention presented in this work can also facilitate the reconstruction and analysis of complex structured data from corrupted inputs.

Author Contributions

Conceptualization, W.L.; methodology, W.L. and Z.Z.; software, J.L.; validation, Z.Z. and Y.Y.; formal analysis, Y.Y.; investigation, Y.Y.; data curation, J.L.; writing—original draft, W.L. and Z.Z.; visualization, J.L.; funding acquisition, W.L., Z.Z. and Y.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the Natural Science Foundation of Fujian Province, China (No. 2023J011119); the Natural Science Foundation of Fujian Province, China (No. 2023J011118); and the 2025 Yunnan Provincial Department of Education Scientific Research Fund Project, China (2025J1100).

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Burger, H.C.; Schuler, C.J.; Harmeling, S. Image denoising: Can plain neural networks compete with BM3D. In 2012 IEEE Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2012; pp. 2392–2399. [Google Scholar]
Zhang, K.; Zuo, W.; Chen, Y.; Meng, D.; Zhang, L. Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising. IEEE Trans. Image Process. 2017, 26, 3142–3155. [Google Scholar] [CrossRef] [PubMed]
Chen, Y.; Pock, T. Trainable nonlinear reaction diffusion: A flexible framework for fast and effective image restoration. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1256–1272. [Google Scholar] [CrossRef] [PubMed]
Cheng, S.; Wang, Y.; Huang, H.; Liu, D.; Fan, H.; Liu, S. Nbnet: Noise basis learning for image denoising with subspace projection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2021; pp. 4896–4906. [Google Scholar]
Mao, X.; Shen, C.; Yang, Y.-B. Image restoration using very deep convolutional encoder-decoder networks with symmetric skip connections. In Advances in Neural Information Processing Systems 29; NeurIPS: San Diego, CA, USA, 2016. [Google Scholar]
Tai, Y.; Yang, J.; Liu, X.; Xu, C. Memnet: A persistent memory network for image restoration. In Proceedings of the IEEE International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2017; pp. 4539–4547. [Google Scholar]
Ulyanov, D.; Vedaldi, A.; Lempitsky, V. Deep image prior. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2018; pp. 9446–9454. [Google Scholar]
Zhou, Y.; Jiao, J.; Huang, H.; Wang, Y.; Wang, J.; Shi, H.; Huang, T. When awgnbased denoiser meets real noises. Proc. AAAI Conf. Artif. Intell. 2020, 34, 13074–13081. [Google Scholar] [CrossRef]
Anwar, S.; Khan, S.; Barnes, N. A deep journey into super-resolution: A survey. ACM Comput. Surv. (CSUR) 2020, 53, 60. [Google Scholar] [CrossRef]
Dong, C.; Loy, C.C.; He, K.; Tang, X. Image super-resolution using deep convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 38, 295–307. [Google Scholar] [CrossRef] [PubMed]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25; NeurIPS: San Diego, CA, USA, 2012. [Google Scholar]
Ledig, C.; Theis, L.; Huszár, F.; Caballero, J.; Cunningham, A.; Acosta, A.; Aitken, A.; Tejani, A.; Totz, J.; Wang, Z.; et al. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2017; pp. 4681–4690. [Google Scholar]
Lim, B.; Son, S.; Kim, H.; Nah, S.; Lee, K.M. Enhanced deep residual networks for single image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2017; pp. 136–144. [Google Scholar]
Xia, B.; Hang, Y.; Tian, Y.; Yang, W.; Liao, Q.; Zhou, J. Efficient Non-Local Contrastive Attention for Image Super-Resolution. Proc. AAAI Conf. Artif. Intell. 2022, 36, 2759–2767. [Google Scholar] [CrossRef]
Zhang, Y.; Li, K.; Li, K.; Wang, L.; Zhong, B.; Fu, Y. Image super-resolution using very deep residual channel attention networks. In Proceedings of the European Conference on Computer Vision (ECCV); IEEE: Piscataway, NJ, USA, 2018; pp. 286–301. [Google Scholar]
Zhang, Y.; Tian, Y.; Kong, Y.; Zhong, B.; Fu, Y. Residual dense network for image superresolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2018; pp. 2472–2481. [Google Scholar]
Zhang, Y.; Tian, Y.; Kong, Y.; Zhong, B.; Fu, Y. Residual dense network for image restoration. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 2480–2495. [Google Scholar] [CrossRef] [PubMed]
Asperti, A.; Tonelli, V. Comparing the latent space of generative models. Neural Comput. Appl. 2023, 35, 3155–3172. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Brox, T. Inverting convolutional networks with convolutional networks. arXiv 2015, arXiv:1506.02753. [Google Scholar]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. Commun. ACM 2020, 63, 139–144. [Google Scholar] [CrossRef]
Khawar, F.; Poon, L.; Zhang, N.L. Learning the structure of auto-encoding recommenders. In Proceedings of The Web Conference 2020; Association for Computing Machinery: New York, NY, USA, 2020; pp. 519–529. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems 30; NeurIPS: San Diego, CA, USA, 2017. [Google Scholar]
Zamir, S.W.; Arora, A.; Khan, S.; Hayat, M.; Khan, F.S.; Yang, M.-H. Restormer: Efficient transformer for high-resolution image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2022; pp. 5728–5739. [Google Scholar]
Zamir, S.W.; Arora, A.; Khan, S.; Hayat, M.; Khan, F.S.; Yang, M.-H.; Shao, L. Learning enriched features for real image restoration and enhancement. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020, Proceedings, Part XXV 16; Springer: Cham, Switzerland, 2020; pp. 492–511. [Google Scholar]
Zamir, S.W.; Arora, A.; Khan, S.; Hayat, M.; Khan, F.S.; Yang, M.-H.; Shao, L. Multi-stage progressive image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2021; pp. 14821–14831. [Google Scholar]
Ding, X.; Fan, H.; Gong, J. Towards generating network of bikeways from Mapillary data. Comput. Environ. Urban Syst. 2021, 88, 101632. [Google Scholar] [CrossRef]
Ding, X.; Zhang, X.; Ma, N.; Han, J.; Ding, G.; Sun, J. Repvgg: Making vgg-style convnets great again. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2021; pp. 13733–13742. [Google Scholar]
Guo, S.; Yan, Z.; Zhang, K.; Zuo, W.; Zhang, L. Toward convolutional blind denoising of real photographs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2019; pp. 1712–1722. [Google Scholar]
Wen, S.; Liu, W.; Yang, Y.; Huang, T.; Zeng, Z. Generating realistic videos from keyframes with concatenated GANs. IEEE Trans. Circuits Syst. Video Technol. 2018, 29, 2337–2348. [Google Scholar] [CrossRef]
Wu, B.; Dai, X.; Zhang, P.; Wang, Y.; Sun, F.; Wu, Y.; Tian, Y.; Vajda, P.; Jia, Y.; Keutzer, K. Fbnet: Hardware-aware efficient convnet design via differentiable neural architecture search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2019; pp. 10734–10742. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Abdelhamed, A.; Lin, S.; Brown, M.S. A high-quality denoising dataset for smartphone cameras. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2018; pp. 1692–1700. [Google Scholar]
Liu, Y.; Qin, Z.; Anwar, S.; Ji, P.; Kim, D.; Caldwell, S.; Gedeon, T. Invertible denoising network: A light solution for real noise removal. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2021; pp. 13365–13374. [Google Scholar]
Mahendran, A.; Vedaldi, A. Understanding deep image representations by inverting them. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2015; pp. 5188–5196. [Google Scholar]
Glasner, D.; Bagon, S.; Irani, M. Super-resolution from a single image. In 2009 IEEE 12th International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2009; pp. 349–356. [Google Scholar]
Lai, W.-S.; Huang, J.-B.; Ahuja, N.; Yang, M.-H. Deep laplacian pyramid networks for fast and accurate super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2017; pp. 624–632. [Google Scholar]
Shocher, A.; Cohen, N.; Irani, M. “zero-shot” super-resolution using deep internal learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2018; pp. 3118–3126. [Google Scholar]
Petschnigg, G.; Szeliski, R.; Agrawala, M.; Cohen, M.; Hoppe, H.; Toyama, K. Digital photography with flash and no-flash image pairs. ACM Trans. Graph. (TOG) 2004, 23, 664–672. [Google Scholar] [CrossRef]

Figure 1. Illustration of our proposed TM-DIP. We use the same “hourglass” (also known as “decoder-encoder”) architecture as the classical DIP. We sometimes add skip connections (yellow arrows). We use the first letter of each functional module to represent the corresponding functional unit. For example, upsampling is denoted by u, downsampling is denoted by d, skip connections are denoted by s, and the non-local self-attention module is denoted by t. u_i, d_i, s_i, and t_i correspond to the number of filters at depth i for the upsampling, downsampling, skip connections and non-local self-attention module respectively. n_u[i], n_d[i], and n_s[i] correspond to the number of filters at depth i for the upsampling, downsampling and skip connections respectively. The values k_u[i], k_d[i], and k_s[i] correspond to the respective kernel sizes.

Figure 2. Time cost comparisons and visual comparisons of our methods with DIP on image super-resolution for 4× SR. The x-axis represents the computational cost (Iterations), indicating the number of optimization steps the network performs to reconstruct a clean image from a degraded input in deep image prior (DIP).

Figure 3. Illustration of the Triple Multi-Head Transposed Attention module (TMTA). The attention of characteristic pixels is decomposed into three directions of self-attention for cooperative computation: horizontal self-attention, vertical self-attention, and channel self-attention. The “*” in the figure indicates a repeated block structure, where the corresponding components are duplicated to streamline the illustration.

Figure 4. Illustration of the attention module of Triple Multi-Head Transposed Attention. TMTA’s attention consists of a LayerNorm module, three depth separable convolutions (PDConv), six matrix multiplications, and three point-wise convolutions.

Figure 5. Blind image denoising. The deep image prior is successful at recovering both man-made and natural patterns. TM-DIP is able to recover detailed information based on the number of iterations, whereas DIP can only approximate the result from a slow global averaging, which leads to deficiencies in important detail areas.

Figure 6. Blind restoration of a JPEG-compressed image (electronic zoom-in recommended). Our approach can restore an image with complex degradation (JPEG compression in this case).

Figure 7. Comparison of text inpainting.

Figure 8. Inpainting using different depths.

Figure 9. Reconstruction based on flash and no-flash image pair. The deep image prior allows us to obtain low-noise reconstruction with the lighting very close to the no-flash image. It is more successful at avoiding “leaks” of the lighting patterns from the flash pair than joint bilateral filtering [38] (c.f. blue inset).

Table 1. Detailed super-resolution PSNR comparison on the 4× Set14.

	Baboon	Barbara	Bridge	Coastguard	Comic	Face	Flowers	Foreman	Lenna	Man	Monarch	Pepper	Ppt3	Zebra	Avg
No prior	22.24	24.89	23.94	24.62	21.06	29.99	23.75	29.01	28.23	24.84	25.76	28.71	20.26	21.69	24.93
Bicubic	22.44	24.15	24.47	25.53	21.59	31.34	25.33	29.45	29.84	25.7	27.45	30.63	21.78	24.01	26.05
TV prior [34]	22.34	24.78	24.46	25.78	21.95	31.34	25.91	30.63	29.76	25.94	28.46	31.32	22.75	24.52	26.42
Glasner et al. [35]	22.44	25.38	24.73	25.38	21.98	31.09	25.54	30.4	30.48	26.33	28.22	32.02	22.16	24.34	26.46
DIP	22.29	25.53	24.38	25.81	22.18	31.02	26.14	31.66	30.83	26.09	29.98	32.08	24.38	25.71	27.00
Ours	22.31	25.63	24.45	25.94	22.29	31.17	26.28	31.73	30.99	26.14	30.12	32.21	24.43	25.85	27.11
SRResNet-MSE [12]	23.00	26.08	25.52	26.31	23.44	32.71	28.13	33.8	32.42	27.43	32.82	34.28	26.56	26.95	28.53
LapSRN [36]	22.83	25.69	25.36	26.21	22.9	32.62	27.54	33.59	31.98	27.27	31.62	33.88	25.36	26.98	28.13

Table 2. Detailed super-resolution PSNR comparison on the 8× Set14.

	Baboon	Barbara	Bridge	Coastguard	Comic	Face	Flowers	Foreman	Lenna	Man	Monarch	Pepper	Ppt3	Zebra	Avg
No prior	21.09	23.04	21.78	23.63	18.65	27.84	21.05	25.62	25.42	22.54	22.91	25.34	18.15	18.85	22.56
Bicubic	21.28	23.44	22.24	23.65	19.25	28.79	22.06	25.37	26.27	23.06	23.18	26.55	18.62	19.59	23.09
TV prior [34]	21.3	23.72	22.3	23.82	19.5	28.84	22.5	26.07	26.74	23.53	23.71	27.56	19.34	19.89	23.48
SelfExSR [37]	21.37	23.9	22.28	24.17	19.79	29.48	22.93	27.01	27.72	23.83	24.02	28.63	20.09	20.25	23.96
DIP	21.38	23.94	22.2	24.21	19.86	29.52	22.86	27.87	27.93	23.57	24.86	29.18	20.12	20.62	24.15
Ours	21.49	24.07	22.31	24.42	19.97	29.71	22.95	27.94	28.06	23.75	24.98	29.31	20.15	20.71	24.37
LapSRN [36]	21.51	24.21	22.77	24.10	20.06	29.85	23.31	28.13	28.22	24.20	24.97	29.22	20.13	20.28	24.35

Table 3. Detailed super-resolution PSNR comparison on the 4× Set5.

	Baby	Bird	Butterfly	Head	Woman	Avg
No prior	30.16	27.67	19.82	29.98	25.18	26.56
Bicubic	31.78	30.2	22.13	31.34	26.75	28.44
TV prior [34]	31.21	30.43	24.38	31.34	26.93	28.85
SelfExSR [37]	32.24	31.1	22.36	31.69	26.85	28.84
DIP	31.49	31.8	26.23	31.04	28.93	29.89
Ours	32.25	31.95	26.45	31.17	29.21	30.21
LapSRN [36]	33.55	33.76	27.28	32.62	30.72	31.58
SRResNet-MSE [12]	33.66	35.1	28.41	32.73	30.6	32.1

Table 4. Detailed super-resolution PSNR comparison on the 8× Set5.

	Baby	Bird	Butterfly	Head	Woman	Avg
No prior	26.28	24.03	17.64	27.94	21.37	23.45
Bicubic	27.28	25.28	17.74	28.82	22.74	24.37
TV prior [34]	27.93	25.82	18.4	28.87	23.36	24.87
SelfExSR [37]	28.45	26.48	18.8	29.36	24.05	25.42
DIP	28.28	27.09	20.02	29.55	24.5	25.88
Ours	28.41	27.22	20.13	29.71	24.67	26.05
LapSRN [36]	28.88	27.1	19.97	29.76	24.79	26.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lin, W.; Zhang, Z.; Lin, J.; You, Y. Efficient Deep Image Prior with Spatial-Channel Attention Transformer. Mathematics 2026, 14, 1185. https://doi.org/10.3390/math14071185

AMA Style

Lin W, Zhang Z, Lin J, You Y. Efficient Deep Image Prior with Spatial-Channel Attention Transformer. Mathematics. 2026; 14(7):1185. https://doi.org/10.3390/math14071185

Chicago/Turabian Style

Lin, Weiwei, Zeqing Zhang, Jin Lin, and Ying You. 2026. "Efficient Deep Image Prior with Spatial-Channel Attention Transformer" Mathematics 14, no. 7: 1185. https://doi.org/10.3390/math14071185

APA Style

Lin, W., Zhang, Z., Lin, J., & You, Y. (2026). Efficient Deep Image Prior with Spatial-Channel Attention Transformer. Mathematics, 14(7), 1185. https://doi.org/10.3390/math14071185

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Efficient Deep Image Prior with Spatial-Channel Attention Transformer

Abstract

1. Introduction

2. Proposed Method

2.1. Deep Image Prior (DIP)

2.2. Backbone of Deep Image Prior (DIP)

2.3. Overview of TM-DIP

2.4. Triple Multi-Head Transposed Attention

2.5. Efficiency of TMTA

3. Experiments

3.1. Experimental Setup

3.2. Comparison with DIP on Denoising and Generic Reconstruction

3.3. Comparison with DIP on Super-Resolution

3.4. Comparison with DIP on Inpainting

3.5. Comparison with DIP on Flash–No Flash Reconstruction

3.6. Comparison with DIP on Time Cost

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI