Wave-DIP: Unsupervised Image Decomposition Fusing Wavelet Multi-Scale Representation and Deep Image Prior

Mao, Zirui; Feng, Liwen; Xu, Quanyou; Liu, Yihang

doi:10.3390/sym18050854

Open AccessArticle

Wave-DIP: Unsupervised Image Decomposition Fusing Wavelet Multi-Scale Representation and Deep Image Prior

School of Mathematics and Statistics, Henan University of Science and Technology, Luoyang 471023, China

^*

Author to whom correspondence should be addressed.

Symmetry 2026, 18(5), 854; https://doi.org/10.3390/sym18050854 (registering DOI)

Submission received: 23 April 2026 / Revised: 10 May 2026 / Accepted: 15 May 2026 / Published: 18 May 2026

(This article belongs to the Section Computer)

Download

Browse Figures

Versions Notes

Abstract

To address the issues of texture residuals and structural detail loss often encountered in traditional image decomposition methods, this paper proposes an unsupervised decomposition model that integrates the Wavelet Transform with Deep Image Prior (DIP). Leveraging the multi-scale and multi-directional characteristics of the Wavelet Transform, the model carefully models the structural information of the cartoon component. Meanwhile, capitalizing on the unsupervised learning advantages of Deep Image Prior and incorporating low-rank constraints, it accurately extracts texture details. The model is solved via the Alternating Direction Method of Multipliers (ADMM). Experimental results on multiple test images demonstrate that, compared with existing methods, the proposed model achieves a more thorough separation of image structure and texture, yielding high-quality visual decomposition performance.

Keywords:

image decomposition; deep image prior; wavelet transform; low-rank constraint; alternating direction method of multipliers

1. Introduction

Image decomposition is a fundamental research topic in the fields of image processing and computer vision. Its goal is to decompose an observed image

f

into two components with distinct morphological characteristics: a cartoon component

u

(also known as the structural component) and a texture component

v

. The cartoon component describes the large-scale structural information in the image, such as smooth regions and sharp edges; whereas the texture component contains small-scale oscillatory details, such as fabric textures and periodic patterns. This decomposition technique has extensive application value in areas like object recognition, image segmentation, and image inpainting.

Research on image decomposition can be traced back to the development of variational methods. The ROF model proposed by Rudin, Osher, and Fatemi [1] pioneered the paradigm of image processing based on Total Variation (TV). This model decomposes an image into a cartoon component in the space of functions of bounded variation and an oscillatory component in the

L_{2}

space. The TV regularization term imposes a constraint on the

L_{1}

norm of the image gradient, enabling smoothing while preserving edges. However, it tends to produce piecewise constant results, easily losing fine texture details. To overcome this limitation, Meyer [2] introduced the

G

norm, which is more suitable for describing oscillatory patterns, to replace the

L_{2}

norm, establishing a more accurate texture modeling framework. However, due to the inclusion of the

L_{\infty}

norm in the

G

norm, its numerical solution faces challenges. Vese and Osher [3] subsequently used the

L_{p}

norm to approximate the Meyer model, proposing the VO model, which is straightforward to solve numerically. Osher, Sole, and Vese [4] further considered the special case of

p = 2

, proposing the OSV model based on TV. Aujol et al. [5,6] systematically studied energy function spaces suitable for different types of textures, providing theoretical guidance for parameter selection in variational decomposition models.

With the advancement of research, scholars have attempted to introduce more mathematical tools to improve decomposition performance. Schaeffer and Osher [7] first introduced low-rank priors into texture description, proposing a cartoon-texture separation model based on sparse low-rank decomposition by imposing nuclear norm constraints on image patches. Ono et al. [8] adopted block-wise nuclear norms to characterize texture components, constructing a decomposition model capable of handling various degradation types. Zhang et al. [9] combined the TV norm and global nuclear norm to represent cartoon and texture components, respectively, proposing a decomposition model based on low-rank texture priors that performs excellently when images possess well-defined global structures. In addition, methodologies such as multi-scale decomposition [10,11,12,13], dictionary learning [14] and wavelet transforms [15] have continuously enriched the theoretical framework of image decomposition. As a classic multi-resolution analysis tool, the wavelet transform can decompose an image into sub-bands of different scales and directions. Its coefficients possess natural sparsity. The low-frequency sub-bands capture the general structure of the image, while the high-frequency sub-bands contain detailed information in various directions. This multi-scale and multi-directional decomposition approach is particularly suitable for describing texture components with oscillatory characteristics. However, although combining low-rank priors with traditional models solves some texture separation problems, it remains limited by the expressive power of hand-crafted regularizers.

To overcome the expressive limitations of traditional hand-crafted regularizers, researchers have begun to shift towards data-driven deep learning methods. However, supervised learning approaches typically require a large amount of paired training data, which is particularly difficult to obtain for image decomposition tasks. This is because clean ground truth for cartoon-texture separation is often unavailable in real-world scenarios. In recent advances in image decomposition, unsupervised and self-supervised learning methods have continued to achieve significant breakthroughs. Liang et al. [16] proposed Fusion from Decomposition, which achieves multi-modal image fusion through self-supervised decomposition. Its core unsupervised component separation strategy provides a new perspective for cartoon-texture decomposition. Liu et al. [17] proposed CoCoNet, which achieves finer feature characterization in image component separation by coupling contrastive learning with multi-level feature integration. These studies collectively indicate that unsupervised deep learning is becoming an important development direction in this field.

The Deep Image Prior (DIP) [18] has emerged as a powerful alternative to data-driven methods, demonstrating that the structure of a convolutional neural network itself acts as a sufficient prior for image restoration tasks like denoising and inpainting. Unlike traditional deep learning, DIP requires no external training data, relying instead on the network’s implicit bias towards natural image statistics. For image decomposition, this characteristic is particularly valuable: the network’s inherent tendency to fit structured patterns allows it to effectively separate components with high self-similarity (such as textures) from complex mixtures. The “Double-DIP” framework proposed by Gandelsman et al. [19] systematically applied this idea to image decomposition for the first time. By coupling multiple DIP networks to generate different image layers respectively, and imposing reconstruction constraints and separation losses, this method is capable of achieving various decomposition tasks such as image dehazing, foreground/background segmentation, and watermark removal under unsupervised conditions. This work verified the universality and effectiveness of DIP in image decomposition. Kim et al. [20] further integrated deep variational priors into the traditional TV-

L_{1}

model. Through a Plug-and-Play approach, they used convolutional neural networks to learn the prior distribution of structural images, successfully realizing the distinction between high-amplitude details and structural edges. Zhou et al. [21] proposed a structure and texture-aware decomposition method. By using deep neural networks to uniformly optimize the decomposition objective function, they ingeniously constructed a self-supervised learning mechanism, achieving model training and optimization without the need for paired ground truth labels. Cascarano et al. [22] introduced spatially adaptive weighted TV regularization into the DIP framework, solved it via ADMM, and adaptively updated local regularization parameters using image gradient information, achieving results that are better than those of standard DIP in image restoration tasks. Nevertheless, these methods often require pre-trained networks or rely on specific forms of variational models. Addressing these issues, Guennec et al. [23] proposed a joint structure-texture modeling framework, which performs joint regularization on structural and texture components as a whole. By embedding deep neural networks through a Plug-and-Play framework, it can still effectively generalize to natural images after training on synthetic data, providing new insights for overcoming the limitations of traditional decomposition models.

Building on the advantages of DIP, Xu et al. [24] proposed a decomposition model that combines a low-rank prior with DIP. This model uses an adaptive weighting mechanism to better preserve edge information and achieve good decomposition results. However, the TV regularizer is essentially a first-order sparse constraint based on spatial gradients. It tends to generate piecewise constant cartoon components, which can easily cause staircasing effects in smooth regions. It also has a limited ability to represent complex textures with multi-scale and directional features. When image textures show periodic or fine structures, the TV term might mistakenly treat some texture as structure, leaving it in the cartoon part, or over-smooth the image, leading to the loss of texture details.

Meanwhile, the combination of wavelet transforms and deep learning has attracted widespread attention. For instance, Nguyen et al. [25] proposed combining sparse low-rank priors with DIP, utilizing the 2D discrete wavelet transform to obtain sparse representations, and achieved excellent results in hyperspectral image denoising. In the field of image restoration, the multi-wavelet guided deep prior method proposed by Zhang et al. [26] obtained stronger prior information by integrating the structural representation capability of wavelet transforms with the learning ability of deep networks. Recently, the WTConv wavelet convolutional layer proposed by Fogel et al. [27] achieved a very large receptive field by stacking wavelet decompositions, while also enhancing the response to low-frequency information, further verifying the unique advantage of wavelet transforms in representing low-frequency structures. Ramamonjisoa et al. [28] applied wavelet decomposition to single-image depth prediction, demonstrating that wavelet coefficients can be learned without direct supervision and can significantly reduce the computational cost of the decoder. Recently, the interpretable deep image decomposition framework proposed by Gao et al. [29] improved the interpretability of decomposition results while ensuring model generalization by combining hierarchical Bayesian modeling with deep learning. In addition, some studies have begun to explore combining multi-scale decomposition with unsupervised learning, such as the multi-branch autoencoder structure proposed by Günaydın and Sen [30], which decomposes images into different components like smooth, detail, and residual, further verifying the potential of combining multi-scale decomposition with unsupervised learning.

Inspired by research on wavelet-domain ADMM deep networks and multi-wavelet guided deep priors, this paper proposes introducing wavelet transforms into the DIP framework. We replace the original TV regularizer with the magnitude of wavelet coefficients to build a new image decomposition model. This model aims to use the ability of wavelets to distinguish between structure and texture to better describe the sparse structure of the cartoon component. This allows for a more thorough separation of texture from the image, effectively alleviating the staircasing effects and texture residuals caused by traditional TV regularizers.

It should be noted that the target tasks of the aforementioned wavelet-based deep prior methods are fundamentally different from the cartoon-texture decomposition addressed in this paper. Moreover, in those existing frameworks, wavelets typically serve as a pre-processing transform or as part of the network architecture, rather than as an explicit sparse constraint to replace TV regularization. Furthermore, PnP frameworks rely on external pre-trained models, whereas this paper employs a single-image-specific, untrained DIP. To the best of our knowledge, no prior work has incorporated the sparse regularization of wavelet coefficients as a direct replacement for TV within a DIP + low-rank decomposition framework to solve the cartoon-texture disentanglement problem. This paper fills this gap and, on this basis, designs a global adaptive weighting strategy in the wavelet domain.

The remainder of this paper is organized as follows: Section 2 introduces the related works that are closely associated with the model proposed in this paper. Section 3 presents our proposed model and provides the algorithm for solving this new model. Section 4 describes the numerical experimental setup and comparative experimental results. Section 5 presents the conclusions.

2. Related Work

The variational methods for image decomposition can be traced back to the ROF model proposed by Rudin, Osher, and Fatemi [1], which formulates the image decomposition problem as:

\min_{u} \int_{Ω} |\nabla u| d x + λ {\int_{Ω} |f - u|}^{2} d x,

(1)

among them, the first term is the Total Variation (TV) regularization term, which achieves edge-preserving smoothing by penalizing image gradients; the second term is the fidelity term, ensuring the similarity between the decomposition result and the original image. The ROF model constrains the cartoon component in the space of functions of bounded variation, while the texture component lies in the

L_{2}

space. Although this model can effectively preserve edges, it tends to over-smooth fine textures by treating them as noise.

To better describe oscillating textures, Meyer [2] introduced the

G

norm, which is more suitable for handling textures, to replace the

L_{2}

norm, and proposed an improved model:

\min_{u} \int_{Ω} |\nabla u| d x + λ {‖v‖}_{G},

(2)

where

{‖v‖}_{G}

is defined as the norm on the function space satisfying

v = \nabla \cdot g

with

|g| \in L_{\infty}

. However, the numerical solution of the

G

norm is challenging. While subsequent studies have attempted to approximate or improve this, a significant advancement was made by Zhang et al. [9]. They effectively combined the TV norm and the global nuclear norm to represent cartoon and texture components, respectively, proposing a decomposition model based on low-rank texture priors.

\min_{u, v} τ ‖D u‖ + μ {‖v‖}_{*} + \frac{1}{2} {‖u + v - f‖}_{2}^{2},

(3)

this model can obtain cleaner texture extraction results than traditional methods when the image has a good global structure. However, the low-rank model makes strong assumptions about the regularity of textures and thus still has limitations when dealing with complex, aperiodic textures.

Distinct from traditional deep learning methods for images, Ulyanov et al. [18] proposed the Deep Image Prior (DIP) method. This method leverages the inductive bias inherent in the structure of an untrained convolutional neural network as image prior information to solve various image inverse problems, such as denoising, super-resolution, and image inpainting. The corresponding model is formulated as follows:

\underset{θ}{\arg \min} \frac{1}{2} {‖H f_{θ} (z) - Y‖}_{2}^{2},

(4)

in the above formula,

f_{θ} (∙)

represents a Convolutional Neural Network (CNN) generator,

H

is a linear degradation operator,

Y

denotes a natural image, and

z

is a randomly initialized input vector. The core of this method lies in modeling the image to be reconstructed as a differentiable function of the randomly initialized neural network parameters, fitting only to a single degraded image during the optimization process. This strategy allows the model to operate without relying on large-scale training datasets, achieving relatively good reconstruction results to a certain extent.

Leveraging the advantages of DIP, Xu et al. [24] proposed a decomposition model combining low-rank priors with DIP. This model uses DIP to generate the cartoon component

u = T_{θ} (z)

, employs the low-rank norm

{‖v‖}_{*}

to constrain the texture part, and introduces a weighted TV regularization term to impose gradient sparsity constraints on the cartoon component:

\min_{θ, v} \sum_{i = 1}^{N} τ_{i} ‖ {(D T_{θ} (z))}_{i} ‖_{1} + μ {‖v‖}_{*} + \frac{1}{2} {‖T_{θ} (z) + v - f‖}_{2}^{2},

(5)

where

N

is the image size. This model effectively preserves edge information through an adaptive weighting mechanism, achieving good decomposition results.

It is worth noting that the proposed “wavelet-domain global adaptive weighting” strategy shares conceptual similarities with adaptive representation learning methods that have emerged in other fields. Rezaei et al. [31] applied deep reinforcement learning to image hashing, where an adaptive bit selection mechanism dynamically retains the most informative hash bits and directly optimizes retrieval metrics—an idea similar to our approach of dynamically adjusting the regularization strength based on wavelet coefficient sparsity. Yang et al. [32] addressed the problem of data scarcity in industrial soft sensing by designing a self-modified dynamic domain adaptation framework that adaptively adjusts its strategy to improve cross-condition prediction robustness. Jiang et al. [33] employed pre-trained large language models and multi-modal generative models to synthesize defect images, using large-scale external generative priors to compensate for the lack of data in few-shot scenarios. Although the target tasks of these works differ from our cartoon-texture decomposition, they share a common core idea: relying on adaptive or generative priors to compensate for insufficient learning under limited information. Our DIP-based framework, which requires no pre-training or external data and achieves unsupervised decomposition solely through the network’s inherent structural prior and wavelet-domain adaptive weighting, can be regarded as a concrete instance within this broader paradigm.

3. New Model and Algorithm

The total variation regularization term in the spatial domain of Equation (5) is essentially a first-order sparse constraint on gradients, lacking the ability to distinguish scale and direction. Therefore, it is highly likely to misidentify fine textures as structures (leaving residuals) or produce staircasing effects in smooth regions. To thoroughly address this inherent spatial limitation, this paper fully leverages the two major advantages of the wavelet transform: multi-scale frequency separation and directional sparse representation. Furthermore, we propose a global adaptive weighting strategy in the wavelet domain—dynamically calculating weight factors based on the global sparsity of the wavelet coefficients of the current structural component (inversely proportional to the coefficient energy), thereby adaptively adjusting the regularization strength according to the image content. Specifically, using the

L_{1}

norm of the wavelet coefficients as the new structural regularization term to replace the traditional TV constraint, the wavelet decomposition separates the image into low-frequency approximations and multiple high-frequency detail subbands, naturally decoupling the cartoon and texture components in the transform domain. The sparse distribution of coefficients enhances the difference between structure and texture, while the adaptive weighting effectively suppresses texture residuals while protecting structural edges. This achieves a more thorough and flexible decomposition than traditional TV models without the need for pre-trained networks. The new model established from this is as follows:

\min_{θ, v} \int_{Ω} α (x) |W T_{θ} (z)| d x + μ {‖v‖}_{*} + \frac{λ}{2} {‖T_{θ} (z) + v - f‖}_{2}^{2},

(6)

where

f

is the observed degraded image,

W (∙)

denotes the wavelet transform,

θ

represents the weight parameters of the deep neural network, and

α (x)

is the adaptive regularization parameter that can be dynamically adjusted according to the image content.

T_{θ} (z)

is the output of the deep network with a fixed random vector

z

as input, representing the structural component (cartoon part) of the image.

v

is the texture component to be restored, and

μ, λ

are regularization parameters.

Compared with existing frameworks based on traditional TV models and DIP-based weighted TV, the main improvements of the proposed model lie in the following aspects: First, wavelet domain sparse regularization is used to replace spatial domain TV regularization—traditional TV relies solely on the first-order sparsity of gradients, lacking scale and directional discrimination capabilities, which easily leads to texture residuals or staircasing effects. In contrast, the wavelet transform decomposes the image into multi-scale and multi-directional subbands, and its

L_{1}

norm constraint naturally fits the oscillatory characteristics of textures, thereby achieving a more thorough separation of structure and texture. Second, a wavelet domain global adaptive weighting strategy is designed. Unlike the fixed regularization parameters in existing methods, this paper dynamically calculates weight factors based on the global sparsity of the wavelet coefficients of the structural component (inversely proportional to the coefficient energy), enabling the regularization strength to adaptively adjust according to the image content. This effectively suppresses texture residuals while protecting structural edges. Third, it continues the unsupervised learning paradigm of DIP, requiring no pre-trained networks or paired data. By embedding the wavelet transform as an explicit sparse prior, it compensates for the shortcomings of ordinary DIP networks, where implicit priors are insufficient for representing complex textures. This enhances the model’s expressive power for multi-scale and directional features, thereby achieving more favorable generalization performance on natural images compared to existing methods.

Next, we present the optimization algorithm for the proposed model. First, by letting

u = T_{θ} (z)

, we obtain the augmented Lagrangian function as follows:

L = \frac{λ}{2} {‖T_{θ} (z) - (f - v)‖}^{2} + \int_{Ω} α (x) |W u| d x + μ {‖v‖}_{*} + \frac{β}{2} {‖T_{θ} (z) - u + b‖}_{2}^{2} .

(7)

To facilitate the solution of Equation (7), we adopt the ADMM, formulated as follows:

θ^{k + 1} = \arg \min_{θ} \frac{λ}{2} {‖T_{θ} (z) - (f - v^{k})‖}_{2}^{2} + \frac{β}{2} {‖T_{θ} (z) - u^{k} + b^{k}‖}_{2}^{2},

(8)

u^{k + 1} = \arg \min_{u} \int_{Ω} α (x) |W u| d x \frac{β}{2} {‖u - T_{θ^{k + 1}} (z) - b^{k}‖}_{2}^{2},

(9)

v^{k + 1} = \arg \min_{v} μ {‖v‖}_{*} + \frac{λ}{2} {‖T_{θ^{k + 1}} (z) - f + v‖}_{2}^{2},

(10)

b^{k + 1} = b^{k} + T_{θ^{k + 1}} (z) - u^{k + 1} .

(11)

For Equation (8), we use the Adam optimizer [34] for iteration.

For the solution of Equation (9), we first handle the adaptive parameter

α (x)

. Utilizing the UPEN principle [35], we dynamically calculate the weight factor in each ADMM iteration based on the global sparsity of the wavelet coefficients of the current structural component, enabling the regularization strength to adaptively adjust according to the image content. Specifically, in the

k + 1

-th iteration, the weight factor is calculated as follows:

α {(x)}^{k + 1} = \frac{λ}{2} \frac{{‖T_{θ^{k + 1}} (z) + v^{k} - f‖}_{2}^{2}}{{‖(W (T_{θ^{k + 1}} (z)))‖}_{1}},

(12)

this formula dynamically adjusts the regularization weight by calculating the global energy of the wavelet coefficients of the current reconstructed image. Specifically, when the image

x

contains rich structures or textures (i.e., the

L_{1}

norm of the wavelet coefficients is large), the denominator increases, causing the weight factor

α (x)

to decrease. This reduces the regularization strength to protect details. Conversely, when the image tends to be smooth (i.e., the

L_{1}

norm of the wavelet coefficients is small), the weight factor

α (x)

increases, enhancing noise suppression capabilities and thereby achieving content-adaptive sparse constraints.

Substituting the adaptive weights into Equation (9) and then applying coefficient-wise soft-thresholding to the wavelet coefficients of

T_{θ^{k + 1}} (z) + b^{k}

, yields the closed-form solution of Equation (9).

u^{k + 1} = W^{- 1} (S_{α^{k + 1} / β} (W (T_{θ^{k + 1}} (z) + b^{k}))),

(13)

where

S_{τ} (x) = \max (|x| - τ, 0) \cdot sign (x)

is the soft thresholding operator. In this way, we achieve spatial adaptive weighting of the wavelet coefficients, thereby extracting the structural components of the image more flexibly.

For Equation (10), using the singular value thresholding shrinkage operator [36], the solution is obtained as:

v^{k + 1} = S_{μ / λ} (f - T_{θ^{k + 1}} (z)) = U shrink (Σ, μ / λ) V^{T},

(14)

where

S

is the soft thresholding operator,

shrink (Σ, μ) = sign (Σ) \cdot \max {| Σ | - μ, 0}

, and

f - T_{θ} (z) = U Σ V^{T}

is its singular value decomposition.

Based on the above discussion, the Algorithm 1 for the new model is as follows:

Algorithm 1. The Algorithm Proposed in This Paper

Input: Initial values of

f, W, u, v, θ

and selected parameters

τ, α, β, μ, λ

for

k = 0 \to K

do:

Calculate

θ^{k + 1}

via (8)

Calculate

u^{k + 1}

via (13)

Calculate

v^{k + 1}

via (14)

Update Lagrange multiplier

b^{k + 1}

via (11)

Output: Decomposed components

u

and

v

4. Experiments

In this section, we will discuss the experimental setup and use different test images to verify the effectiveness of the new model. The test images are shown in Figure 1. Among them, Figure 1a is the Face image, Figure 1b is the Leg image, Figure 1c is the Word image, Figure 1d is the Table image, Figure 1e is the Monster image, Figure 1f is the TomJerry synthetic image, Figure 1g is the Barbara RGB natural image, and Figure 1h is the Bishapur RGB image.

4.1. Wavelet Selection

In this experiment, the wavelet transform module selects the db2 wavelet basis function, and the decomposition level is set to two layers. The db2 wavelet belongs to the Daubechies orthogonal wavelet family, possessing compact support, orthogonality, and one vanishing moment. Among these, the vanishing moment enables the wavelet transform to effectively filter out smooth backgrounds and enhance oscillatory texture representations, which closely matches the design goal of this paper to replace TV constraints with wavelet domain sparse regularization; orthogonality and compact support ensure the accuracy of the decomposition and computational efficiency. Controlling the decomposition level at two layers aims to balance multi-scale feature extraction capability with computational complexity, avoiding computational redundancy and edge artifacts caused by overly deep decomposition.

4.2. Parameter Selection

In our algorithm, the maximum number of iterations is set to 300, and 300 Adam iterations are used in each outer iteration to approximately solve for the network parameters

θ

in subproblem (8). The global weight

α (x)

in the wavelet domain is updated adaptively, and the hyperparameters involved in the model, namely

μ

(nuclear norm regularization parameter),

β

(ADMM penalty parameter), and

λ

(data fidelity term weight), need to be specified in advance. Based on experimental experience, we find that when the parameters are set within the following ranges, relatively optimal reconstruction results can be obtained through manual adjustment for different test images:

0.01 < μ < 10

,

0 < β < 0.1

, and

λ

is typically set to the order of one or ten. In practical applications, fine-tuning can be performed within the above ranges according to the image noise level and texture complexity.

4.3. Experimental Results

To verify the effectiveness of the new model, we compared the proposed method with some representative image decomposition methods. The comparison methods include: the variational model-based [9,24], the method proposed in [37], the rolling guidance filter-based method [38], the parameter-free unrolling network of [39] which builds upon the Low Patch Rank model to learn decomposition parameters from data while achieving fast inference and good generalization to natural images, the neural-network-based method of [21], and the ref. [24] model based on DIP and low-rank decomposition. Among them, Ref. [9] uses a combination of total variation and nuclear norm to characterize structure and texture; Ref. [35] constructs a decomposition model based on Sobolev space; Ref. [38] achieves scale-aware decomposition via a rolling guidance filter; Ref. [39] learns to separate structure and texture by unrolling an iterative optimization scheme without requiring manual parameter tuning; Ref. [21] learns to separate structure and texture by training a neural network; Ref. [24] introduces weighted TV and low-rank constraints within the DIP framework to achieve the separation of structure and texture. In contrast, this paper introduces the wavelet transform into the DIP framework, replacing the traditional total variation constraint with a sparse regularization term of wavelet coefficients. Without the need for pre-trained networks, it utilizes the multi-scale characteristics of wavelets to more finely extract the structural information of the cartoon component. In the experiments, all comparison methods were manually tuned to obtain their best decomposition results. Through comparison, it can be seen that the proposed method outperforms existing methods in terms of the thoroughness of structure and texture separation, edge detail preservation capability, and overall visual effect.

Figure 2 presents the decomposition results of the Leg image. It can be observed that the cartoon component in Figure 2a suffers from over-smoothing, with relatively blurred edge details. Although the edge sharpness in Figure 2b is somewhat improved, obvious texture residues still remain in the structural layer. Figure 2c achieves a relatively better overall processing effect for structure and texture; however, striped texture interference is still visible in the leg region. In Figure 2d, although the structure is free of texture contamination, it is overly smooth, losing some fine structural details. In contrast, Figure 2e, obtained by the method proposed in this paper, is capable of achieving more precise separation of structure and texture, yielding a better overall decomposition effect.

Figure 3 presents the decomposition results of the Table image. It can be observed that the cartoon components in Figure 3a,b still contain residual texture lines and details, failing to achieve complete separation. Although the decomposition in Figure 3c is relatively thorough in most areas, texture is still mixed into the cartoon component in local regions such as the table corners. Figure 3d shows the decomposition result of the proposed method, where the separation of structure and texture is cleaner, structural edges are preserved intact, texture extraction is more precise, and the overall decomposition effect is more ideal.

Figure 4 presents the relevant decomposition results. Among them, the cartoon components of Figure 4a–c all exhibit texture residuals in the eye region of the Monster image; especially in Figure 4b, the cartoon component shows considerable texture. Furthermore, the texture component of Figure 4c is mixed with color information that should belong to the cartoon component. In contrast, the proposed new model in Figure 4d achieves better decomposition performance in both the cartoon and texture components.

The cartoon component in Figure 5a exhibits an over-smoothed trend, failing to effectively preserve detailed contours. The texture extraction results in Figure 5b,c are suboptimal, with redundant interference information remaining, thus failing to achieve pure texture extraction. In contrast, the structural layer in Figure 5d performs better; not only are the edge contours clear and distinguishable, but the structural regions are also clean and tidy without obvious texture mixture, making the decomposition result more consistent with expectations.

Figure 6 presents the decomposition results of the Barbara RGB natural image. As can be seen from the results, in the cartoon component of Figure 6a, texture lines and details still remain in the edge areas of the table, indicating that structure and texture have not been completely separated. The cartoon component of Figure 6b suffers from texture residue; for instance, significant texture remains in the character’s scarf area and the table corner. The decomposition effect of the cartoon component in Figure 6c is similarly suboptimal; not only are the contours of the character’s eyes and scarf area blurred, but there is also obvious texture residue at the table corner. Figure 6d shows the decomposition result of the method proposed in this paper. Compared with the above methods, this result achieves an efficient separation of structure and texture. It not only completely preserves the detailed features of structural edges but also accurately extracts texture information, demonstrating better overall decomposition performance.

The leg edges in Figure 7a–c all appear blurred, failing to clearly reveal structural details. Although Figure 7b,c show good processing results in the facial region, there is a loss of structure in the curtain area at the top right corner. In contrast, the edge processing accuracy of the structural map in Figure 7d is significantly improved; various types of structural information are effectively extracted to the structural layer, and the overall decomposition effect meets expectations.

4.4. Parameter Sensitivity Analysis

To evaluate the influence of the key hyperparameter

λ

(data fidelity weight) on the decomposition results, while avoiding unnecessary quantitative metrics due to the lack of ground truth, we adopt a visual comparison approach. The word image is selected as the test sample. All other hyperparameters are fixed, and only λ is varied (taking values 0.1, 5, 10, 50). Figure 8 shows the resulting cartoon and texture components under different

λ

values.

From Figure 8, one can observe that when

λ = 0.1

, the cartoon component retains significant texture information, and the texture component is overly sparse. When

λ = 50

, the texture component is contaminated by some structural edges, and the cartoon component exhibits slight over-smoothing. When

λ \in [5, 10]

, the visual quality of the decomposition results is similar; all effectively separate structure from texture while preserving edges well. This observation indicates that the model is insensitive to

λ

within the range

λ \in [5, 10]

, and the decomposition results are stable. Therefore, manually selecting

λ

within this range is reasonable, and satisfactory decomposition can be achieved without rigorous numerical scanning. Similar conclusions are drawn for other hyperparameters (e.g., the nuclear norm parameter

μ

and the ADMM penalty parameter

β

): within the reported parameter ranges, the visual quality remains stable. Due to space limitations, they are not presented here.

4.5. Ablation Study

To separately verify the contributions of wavelet regularization and adaptive weighting to the decomposition performance, we designed and compared the following three model variants. Baseline model: The model of Xu et al. [24], which adopts TV regularization and low-rank constraints, without wavelet transform or adaptive weighting. We have already compared the baseline model with the full model in Figure 2, Figure 3, Figure 4, Figure 5, Figure 6 and Figure 7; the phenomenon that its cartoon component contains significant texture residue is not repeated here; wavelet only: Based on the baseline model, this variant replaces the TV regularization with the

L_{1}

norm of wavelet coefficients, but does not employ adaptive weighting (i.e., uses a fixed weight). This variant is used to evaluate the contribution of replacing TV with wavelet; full model: The proposed method in this paper, i.e., wavelet

L_{1}

norm + global adaptive weighting.

Figure 9 shows the decomposition results of the three models on the Face image. From the cartoon components, it can be seen that Variant A effectively alleviates the staircase effect, but some texture information still remains in the cartoon component, for example, on the character’s scarf and in the upper-right corner of the image, where texture residues of various shapes appear. The full model further removes these texture residues while preserving the sharpness of structural edges. A similar trend can be observed from the texture components: Variant A extracts finer textures, while the full model further excludes structural information that has been mixed in.

The above results demonstrate that wavelet regularization is superior to TV regularization in capturing multi-scale texture features, and that the global adaptive weighting can further suppress texture residue and improve decomposition quality. Together, they achieve the optimal decomposition performance.

4.6. Robustness Analysis

To evaluate the stability of the proposed model with respect to random initialization and stochasticity in the optimization process, we fixed all hyperparameters and ran the model five times on the representative Barbara image, changing only the random seed. Since cartoon-texture decomposition lacks ground truth, we followed the common practice in this field and used visual comparison as the primary evaluation criterion. The experimental results show that, across the five runs, the obtained cartoon and texture components exhibit no noticeable visual differences (Figure 10 shows only one representative result). This observation is consistent with the finding reported in the original DIP work [18]: despite random initialization and stochastic perturbations during optimization, the visual quality of the network output remains relatively stable. Therefore, the proposed model is not sensitive to random initialization and possesses good robustness.

4.7. Runtime Comparison

To evaluate the computational efficiency of the proposed method, we ran the baseline model (Xu et al. [24]) and our full model on images of size 256 × 256 under the same hardware environment. Due to space limitations, only the average runtime from start to convergence (300 outer iterations with 300 Adam updates per inner iteration) is reported. The results are shown in Table 1.

As can be seen from Table 1, the runtime of the full model is approximately 2.17 times that of the baseline. The additional time overhead mainly comes from the forward and inverse wavelet transforms and the calculation of the adaptive weighting in each iteration. In terms of algorithmic complexity, both the wavelet transform and the soft-thresholding operation have a complexity of O(N) (where N is the number of pixels), which is much lower than the computational cost of the DIP network itself. Therefore, given the significant improvement in decomposition quality achieved by the proposed method, this computational cost is acceptable in practical applications (e.g., offline image processing, medical image analysis, etc.).

5. Conclusions

This paper proposes an unsupervised image decomposition model fusing wavelet transform and DIP. Under the framework of DIP combined with low-rank decomposition, this method introduces a sparse regularization term of wavelet coefficients to replace the traditional total variation constraint, utilizing the multi-scale characteristics of wavelets to more finely characterize the structural information of the cartoon component. Experimental results demonstrate that, compared with existing methods, this model can separate texture and structure more thoroughly. It effectively removes texture residuals while better preserving edge details, achieving high-quality visual decomposition results.

Author Contributions

Z.M.: conceptualization, data collection, writing—original draft; L.F.: model construction, data validation, manuscript formatting; Q.X.: formal analysis, result interpretation; Y.L.: experimental implementation, the literature review; All authors: writing—review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Rudin, L.; Osher, S.; Fatemi, E. Nonlinear total variation based noise removal algorithms. Phys. D 1992, 60, 259–268. [Google Scholar] [CrossRef]
Meyer, Y. Oscillating Patterns in Image Processing and in Some Nonlinear Evolution Equations: The Fifteenth Dean Jacqueline B. Lewis Memorial Lectures; American Mathematical Society: Boston, MA, USA, 2001; Volume 22, p. 122. [Google Scholar]
Vese, L.; Osher, S. Modeling textures with total variation minimization and oscillating patterns in image processing. J. Sci. Comput. 2003, 19, 553–572. [Google Scholar] [CrossRef]
Osher, S.; Sole, A.; Vese, L. Image decomposition and restoration using total variation minimization and the H-1 norm. SIAM J. Multiscale Model. Simul. 2003, 1, 349–370. [Google Scholar] [CrossRef]
Aujol, J.; Aubert, G.; Blanc-Féraud, L.; Chambolle, A. Image decomposition into a bounded variation component and an oscillating component. J. Math. Imaging Vis. 2005, 22, 71–88. [Google Scholar] [CrossRef]
Aujol, J.; Gilboa, G.; Chan, T. Structure-texture image decomposition—Modeling, algorithms, and parameter selection. Int. J. Comput. Vis. 2006, 67, 111–136. [Google Scholar] [CrossRef]
Schaeffer, H.; Osher, S. A low patch-rank interpretation of texture. SIAM J. Imaging Sci. 2013, 6, 226–262. [Google Scholar] [CrossRef]
Ono, S.; Miyata, T.; Yamada, I. Cartoon-texture image decomposition using blockwise low-rank texture characterization. IEEE Trans. Image Process. 2014, 23, 1128–1142. [Google Scholar] [CrossRef]
Zhang, Z.; He, H. A customized low-rank prior model for structured cartoon-texture image decomposition. Signal Process. Image Commun. 2021, 96, 116308. [Google Scholar] [CrossRef]
Chen, Y.; Wong, A.; Fang, Y.; Wu, Y.; Xu, L. Deep residual transform for multi-scale image decomposition. J. Comput. Vis. Imaging Syst. 2021, 6, 1–5. [Google Scholar] [CrossRef]
Ennouni, A.; Sihamman, N.; Sabri, M.; Aarab, A. Early detection and classification approach for plant diseases based on multi-scale image decomposition. J. Comput. Sci. 2021, 17, 284–295. [Google Scholar] [CrossRef]
Tadmor, E.; Nezzar, S.; Vese, L. A multiscale image representation using hierarchical (BV, L²) decompositions. Multiscale Model. Simul. 2004, 2, 554–579. [Google Scholar] [CrossRef]
Tadmor, E.; Nezzar, S.; Vese, L. Multiscale hierarchical decomposition of images with applications to deblurring, denoising and segmentation. Commun. Math. Sci. 2008, 6, 281–307. [Google Scholar] [CrossRef]
Starck, J.; Elad, M.; Donoho, D. Image decomposition via the combination of sparse representations and a variational approach. IEEE Trans. Image Process. 2005, 14, 1570–1582. [Google Scholar] [CrossRef] [PubMed]
Oliveira, H.; Vermehren, V.; Cintra, R. Multidimensional wavelets for scalable image decomposition: Orbital wavelets. Int. J. Wavelets Multiresolution Inf. Process. 2020, 18, 2050038. [Google Scholar]
Liang, P.; Jiang, J.; Liu, X.; Ma, J. Fusion from decomposition: A self-supervised decomposition approach for image fusion. In Computer Vision—ECCV 2022; Springer: Cham, Switzerland, 2022; Volume 13678, pp. 719–735. [Google Scholar]
Liu, J.; Lin, R.; Wu, G.; Liu, R.; Luo, Z.; Fan, X. CoCoNet: Coupled contrastive learning network with multi-level feature ensemble for multi-modality image fusion. Int. J. Comput. Vis. 2024, 132, 1748–1775. [Google Scholar] [CrossRef]
Ulyanov, D.; Vedaldi, A.; Lempitsky, V. Deep image prior. Int. J. Comput. Vis. 2020, 128, 1867–1888. [Google Scholar] [CrossRef]
Gandelsman, Y.; Shocher, A.; Irani, M. “Double-DIP”: Unsupervised image decomposition via coupled deep-image-priors. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2019; pp. 11018–11027. [Google Scholar]
Kim, Y.; Ham, B.; Sohn, K. Structure-texture image decomposition using deep variational priors. IEEE Trans. Image Process. 2018, 28, 2692–2704. [Google Scholar] [CrossRef]
Zhou, F.; Chen, Q.; Liu, B.; Qiu, G. Structure and texture-aware image decomposition via training a neural network. IEEE Trans. Image Process. 2020, 29, 3458–3473. [Google Scholar] [CrossRef]
Cascarano, P.; Sebastiani, A.; Comes, M.C.; Franchini, G.; Porta, F. Combining weighted total variation and deep image prior for natural and medical image restoration via ADMM. In 21st International Conference on Computational Science and Its Applications (ICCSA 2021); Springer: Cham, Switzerland, 2021; pp. 39–46. [Google Scholar]
Guennec, A.; Aujol, J.-F.; Traonmilin, Y. Joint structure-texture low-dimensional modeling for image decomposition with a plug-and-play framework. SIAM J. Imaging Sci. 2025, 18, 1344–1371. [Google Scholar] [CrossRef]
Xu, J.; Guo, Y.; Shang, W.; You, S. Image decomposition combining low-rank and deep image prior. Multimed. Tools Appl. 2024, 83, 13887–13903. [Google Scholar] [CrossRef]
Nguyen, H.V.; Ulfarsson, M.O.; Sigurdsson, J.; Sveinsson, J.R. Deep sparse and low-rank prior for hyperspectral image denoising. In IGARSS 2022—2022 IEEE International Geoscience and Remote Sensing Symposium; IEEE: Piscataway, NJ, USA, 2022; pp. 1217–1220. [Google Scholar]
Zhang, M.; Yang, C.; Yuan, Y.; Guan, Y.; Wang, S.; Liu, Q. Multi-wavelet guided deep mean-shift prior for image restoration. Signal Process. Image Commun. 2021, 99, 116449. [Google Scholar] [CrossRef]
Finder, S.E.; Amoyal, R.; Treister, E.; Freifeld, O. Wavelet convolutions for large receptive fields. In Computer Vision—ECCV 2024; Springer: Cham, Switzerland, 2025; Volume 15112. [Google Scholar]
Ramamonjisoa, M.; Firman, M.; Watson, J.; Lepetit, V.; Turmukhambetov, D. Single Image Depth Prediction with Wavelet Decomposition. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2021; pp. 11084–11093. [Google Scholar]
Wang, S.; Gao, S.; Wu, F.; Zhuang, X. InDeed: Interpretable image deep decomposition with guaranteed generalizability. arXiv 2025, arXiv:2501.01127. [Google Scholar] [CrossRef]
Günaydın, Y.Ş.; Şen, B. A multi-scale unsupervised feature extraction network with structured layer-wise decomposition. Appl. Sci. 2025, 15, 7194. [Google Scholar] [CrossRef]
Rezaei, M.; Alaoui Mhamdi, M.A.; Allili, M. Adaptive bit selection via deep reinforcement learning for large-scale image hashing. Electronics 2026, 15, 1735. [Google Scholar] [CrossRef]
Yang, Z.; Gao, W.; Chen, G.; Yu, J.; He, B.; Ye, L. Self-modified dynamic domain adaptation for industrial soft sensing. IEEE Trans. Autom. Sci. Eng. 2026, 23, 4679–4692. [Google Scholar] [CrossRef]
Jiang, X.; Lin, Z.; Kong, X.; Chen, J.; Song, Z.; Xie, M. Enhancing few-shot surface defect recognition via pre-trained large generative models. IEEE Trans. Autom. Sci. Eng. 2026, 23, 643–654. [Google Scholar] [CrossRef]
Kingma, D.; Ba, J. Adam: A method for stochastic optimization. arXiv 2015, arXiv:1412.6980. [Google Scholar]
Bortolotti, V.; Brown, R.J.S.; Fantazzini, P.; Landi, G.; Zama, F. Uniform Penalty inversion of twodimensional NMR relaxation data. Inverse Probl. 2016, 33, 015003. [Google Scholar] [CrossRef]
Cai, J.F.; Candès, E.J.; Shen, Z. A singular value thresholding algorithm for matrix completion. SIAM J. Optim. 2010, 20, 1956–1982. [Google Scholar] [CrossRef]
Xu, J.; Shang, W.; Hao, Y. A new cartoon + texture image decomposition model based on the Sobolev space. Signal Image Video Process. 2022, 16, 1569–1576. [Google Scholar] [CrossRef]
Zhang, Q.; Shen, X.; Xu, L.; Jia, J. Rolling guidance filter. In European Conference on Computer Vision (ECCV); Springer International Publishing: Cham, Switzerland, 2014; Volume 8691, pp. 815–830. [Google Scholar]
Girometti, L.; Aujol, J.F.; Guennec, A.; Traonmilin, Y. Parameter-Free Structure-Texture Image Decomposition by Unrolling. In Scale Space and Variational Methods in Computer Vision; SSVM 2025, LNCS; Bubba, T.A., Gaburro, R., Gazzola, S., Papafitsoros, K., Pereyra, M., Schönlieb, C.B., Eds.; Springer: Cham, Switzerland, 2025; Volume 15667. [Google Scholar]

Figure 1. The test images. (a) Leg image; (b) Table image; (c) Monster image; (d) TomJerry synthetic image; (e) Barbara RGB natural image; (f) Bishapur RGB image; (g) Word image; (h) Face image.

Figure 2. The Leg image decomposition experiment. (a) structure and texture of [9], (b) structure and texture of [37], (c) structure and texture of [24], (d) structure and texture of [39], (e) structure and texture of the proposed model.

Figure 3. The Table image decomposition experiment. (a) structure and texture of [9], (b) structure and texture of [37], (c) structure and texture of [24], (d) structure and texture of the proposed model.

Figure 4. The Monster image decomposition experiment. (a) structure and texture of [9], (b) structure and texture of [38], (c) structure and texture of [24], (d) structure and texture of the proposed model.

Figure 5. The TomJerry synthetic image decomposition experiment. (a) structure and texture of [9], (b) structure and texture of [37], (c) structure and texture of [24], (d) structure and texture of the proposed model.

Figure 6. The Barbara RGB natural image decomposition experiment. (a) structure and texture of [9], (b) structure and texture of [38], (c) structure and texture of [24], (d) structure and texture of the proposed model.

Figure 7. The Bishapur RGB image decomposition experiment. (a) structure and texture of [9], (b) structure and texture of [21], (c) structure and texture of [24], (d) structure and texture of the proposed model.

Figure 8. Decomposition results of the word image under different

λ

values. The decomposition performance is stable and clearly superior when

λ \in [5, 10]

compared to

λ = 0.1

and

λ = 50

.

Figure 8. Decomposition results of the word image under different

λ

values. The decomposition performance is stable and clearly superior when

λ \in [5, 10]

compared to

λ = 0.1

and

λ = 50

.

Figure 9. Ablation study comparison (cartoon and texture components of the Face image). (a) wavelet-only model without adaptive weighting; (b) full model (wavelet + adaptive weighting).

Figure 10. Visual comparison of the Barbara image decomposition results under different random seeds (Seed = 1, 10, 100, 1000).

Table 1. Runtime comparison of different models on 256 × 256 images.

Model	Average Runtime (s)
Baseline (Xu et al. [24])	186.42
Full model (proposed)	427.14

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Mao, Z.; Feng, L.; Xu, Q.; Liu, Y. Wave-DIP: Unsupervised Image Decomposition Fusing Wavelet Multi-Scale Representation and Deep Image Prior. Symmetry 2026, 18, 854. https://doi.org/10.3390/sym18050854

AMA Style

Mao Z, Feng L, Xu Q, Liu Y. Wave-DIP: Unsupervised Image Decomposition Fusing Wavelet Multi-Scale Representation and Deep Image Prior. Symmetry. 2026; 18(5):854. https://doi.org/10.3390/sym18050854

Chicago/Turabian Style

Mao, Zirui, Liwen Feng, Quanyou Xu, and Yihang Liu. 2026. "Wave-DIP: Unsupervised Image Decomposition Fusing Wavelet Multi-Scale Representation and Deep Image Prior" Symmetry 18, no. 5: 854. https://doi.org/10.3390/sym18050854

APA Style

Mao, Z., Feng, L., Xu, Q., & Liu, Y. (2026). Wave-DIP: Unsupervised Image Decomposition Fusing Wavelet Multi-Scale Representation and Deep Image Prior. Symmetry, 18(5), 854. https://doi.org/10.3390/sym18050854

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Wave-DIP: Unsupervised Image Decomposition Fusing Wavelet Multi-Scale Representation and Deep Image Prior

Abstract

1. Introduction

2. Related Work

3. New Model and Algorithm

4. Experiments

4.1. Wavelet Selection

4.2. Parameter Selection

4.3. Experimental Results

4.4. Parameter Sensitivity Analysis

4.5. Ablation Study

4.6. Robustness Analysis

4.7. Runtime Comparison

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI