1. Introduction
Image decomposition is a fundamental research topic in the fields of image processing and computer vision. Its goal is to decompose an observed image into two components with distinct morphological characteristics: a cartoon component (also known as the structural component) and a texture component . The cartoon component describes the large-scale structural information in the image, such as smooth regions and sharp edges; whereas the texture component contains small-scale oscillatory details, such as fabric textures and periodic patterns. This decomposition technique has extensive application value in areas like object recognition, image segmentation, and image inpainting.
Research on image decomposition can be traced back to the development of variational methods. The ROF model proposed by Rudin, Osher, and Fatemi [
1] pioneered the paradigm of image processing based on Total Variation (TV). This model decomposes an image into a cartoon component in the space of functions of bounded variation and an oscillatory component in the
space. The TV regularization term imposes a constraint on the
norm of the image gradient, enabling smoothing while preserving edges. However, it tends to produce piecewise constant results, easily losing fine texture details. To overcome this limitation, Meyer [
2] introduced the
norm, which is more suitable for describing oscillatory patterns, to replace the
norm, establishing a more accurate texture modeling framework. However, due to the inclusion of the
norm in the
norm, its numerical solution faces challenges. Vese and Osher [
3] subsequently used the
norm to approximate the Meyer model, proposing the VO model, which is straightforward to solve numerically. Osher, Sole, and Vese [
4] further considered the special case of
, proposing the OSV model based on TV. Aujol et al. [
5,
6] systematically studied energy function spaces suitable for different types of textures, providing theoretical guidance for parameter selection in variational decomposition models.
With the advancement of research, scholars have attempted to introduce more mathematical tools to improve decomposition performance. Schaeffer and Osher [
7] first introduced low-rank priors into texture description, proposing a cartoon-texture separation model based on sparse low-rank decomposition by imposing nuclear norm constraints on image patches. Ono et al. [
8] adopted block-wise nuclear norms to characterize texture components, constructing a decomposition model capable of handling various degradation types. Zhang et al. [
9] combined the TV norm and global nuclear norm to represent cartoon and texture components, respectively, proposing a decomposition model based on low-rank texture priors that performs excellently when images possess well-defined global structures. In addition, methodologies such as multi-scale decomposition [
10,
11,
12,
13], dictionary learning [
14] and wavelet transforms [
15] have continuously enriched the theoretical framework of image decomposition. As a classic multi-resolution analysis tool, the wavelet transform can decompose an image into sub-bands of different scales and directions. Its coefficients possess natural sparsity. The low-frequency sub-bands capture the general structure of the image, while the high-frequency sub-bands contain detailed information in various directions. This multi-scale and multi-directional decomposition approach is particularly suitable for describing texture components with oscillatory characteristics. However, although combining low-rank priors with traditional models solves some texture separation problems, it remains limited by the expressive power of hand-crafted regularizers.
To overcome the expressive limitations of traditional hand-crafted regularizers, researchers have begun to shift towards data-driven deep learning methods. However, supervised learning approaches typically require a large amount of paired training data, which is particularly difficult to obtain for image decomposition tasks. This is because clean ground truth for cartoon-texture separation is often unavailable in real-world scenarios. In recent advances in image decomposition, unsupervised and self-supervised learning methods have continued to achieve significant breakthroughs. Liang et al. [
16] proposed Fusion from Decomposition, which achieves multi-modal image fusion through self-supervised decomposition. Its core unsupervised component separation strategy provides a new perspective for cartoon-texture decomposition. Liu et al. [
17] proposed CoCoNet, which achieves finer feature characterization in image component separation by coupling contrastive learning with multi-level feature integration. These studies collectively indicate that unsupervised deep learning is becoming an important development direction in this field.
The Deep Image Prior (DIP) [
18] has emerged as a powerful alternative to data-driven methods, demonstrating that the structure of a convolutional neural network itself acts as a sufficient prior for image restoration tasks like denoising and inpainting. Unlike traditional deep learning, DIP requires no external training data, relying instead on the network’s implicit bias towards natural image statistics. For image decomposition, this characteristic is particularly valuable: the network’s inherent tendency to fit structured patterns allows it to effectively separate components with high self-similarity (such as textures) from complex mixtures. The “Double-DIP” framework proposed by Gandelsman et al. [
19] systematically applied this idea to image decomposition for the first time. By coupling multiple DIP networks to generate different image layers respectively, and imposing reconstruction constraints and separation losses, this method is capable of achieving various decomposition tasks such as image dehazing, foreground/background segmentation, and watermark removal under unsupervised conditions. This work verified the universality and effectiveness of DIP in image decomposition. Kim et al. [
20] further integrated deep variational priors into the traditional TV-
model. Through a Plug-and-Play approach, they used convolutional neural networks to learn the prior distribution of structural images, successfully realizing the distinction between high-amplitude details and structural edges. Zhou et al. [
21] proposed a structure and texture-aware decomposition method. By using deep neural networks to uniformly optimize the decomposition objective function, they ingeniously constructed a self-supervised learning mechanism, achieving model training and optimization without the need for paired ground truth labels. Cascarano et al. [
22] introduced spatially adaptive weighted TV regularization into the DIP framework, solved it via ADMM, and adaptively updated local regularization parameters using image gradient information, achieving results that are better than those of standard DIP in image restoration tasks. Nevertheless, these methods often require pre-trained networks or rely on specific forms of variational models. Addressing these issues, Guennec et al. [
23] proposed a joint structure-texture modeling framework, which performs joint regularization on structural and texture components as a whole. By embedding deep neural networks through a Plug-and-Play framework, it can still effectively generalize to natural images after training on synthetic data, providing new insights for overcoming the limitations of traditional decomposition models.
Building on the advantages of DIP, Xu et al. [
24] proposed a decomposition model that combines a low-rank prior with DIP. This model uses an adaptive weighting mechanism to better preserve edge information and achieve good decomposition results. However, the TV regularizer is essentially a first-order sparse constraint based on spatial gradients. It tends to generate piecewise constant cartoon components, which can easily cause staircasing effects in smooth regions. It also has a limited ability to represent complex textures with multi-scale and directional features. When image textures show periodic or fine structures, the TV term might mistakenly treat some texture as structure, leaving it in the cartoon part, or over-smooth the image, leading to the loss of texture details.
Meanwhile, the combination of wavelet transforms and deep learning has attracted widespread attention. For instance, Nguyen et al. [
25] proposed combining sparse low-rank priors with DIP, utilizing the 2D discrete wavelet transform to obtain sparse representations, and achieved excellent results in hyperspectral image denoising. In the field of image restoration, the multi-wavelet guided deep prior method proposed by Zhang et al. [
26] obtained stronger prior information by integrating the structural representation capability of wavelet transforms with the learning ability of deep networks. Recently, the WTConv wavelet convolutional layer proposed by Fogel et al. [
27] achieved a very large receptive field by stacking wavelet decompositions, while also enhancing the response to low-frequency information, further verifying the unique advantage of wavelet transforms in representing low-frequency structures. Ramamonjisoa et al. [
28] applied wavelet decomposition to single-image depth prediction, demonstrating that wavelet coefficients can be learned without direct supervision and can significantly reduce the computational cost of the decoder. Recently, the interpretable deep image decomposition framework proposed by Gao et al. [
29] improved the interpretability of decomposition results while ensuring model generalization by combining hierarchical Bayesian modeling with deep learning. In addition, some studies have begun to explore combining multi-scale decomposition with unsupervised learning, such as the multi-branch autoencoder structure proposed by Günaydın and Sen [
30], which decomposes images into different components like smooth, detail, and residual, further verifying the potential of combining multi-scale decomposition with unsupervised learning.
Inspired by research on wavelet-domain ADMM deep networks and multi-wavelet guided deep priors, this paper proposes introducing wavelet transforms into the DIP framework. We replace the original TV regularizer with the magnitude of wavelet coefficients to build a new image decomposition model. This model aims to use the ability of wavelets to distinguish between structure and texture to better describe the sparse structure of the cartoon component. This allows for a more thorough separation of texture from the image, effectively alleviating the staircasing effects and texture residuals caused by traditional TV regularizers.
It should be noted that the target tasks of the aforementioned wavelet-based deep prior methods are fundamentally different from the cartoon-texture decomposition addressed in this paper. Moreover, in those existing frameworks, wavelets typically serve as a pre-processing transform or as part of the network architecture, rather than as an explicit sparse constraint to replace TV regularization. Furthermore, PnP frameworks rely on external pre-trained models, whereas this paper employs a single-image-specific, untrained DIP. To the best of our knowledge, no prior work has incorporated the sparse regularization of wavelet coefficients as a direct replacement for TV within a DIP + low-rank decomposition framework to solve the cartoon-texture disentanglement problem. This paper fills this gap and, on this basis, designs a global adaptive weighting strategy in the wavelet domain.
The remainder of this paper is organized as follows:
Section 2 introduces the related works that are closely associated with the model proposed in this paper.
Section 3 presents our proposed model and provides the algorithm for solving this new model.
Section 4 describes the numerical experimental setup and comparative experimental results.
Section 5 presents the conclusions.
2. Related Work
The variational methods for image decomposition can be traced back to the ROF model proposed by Rudin, Osher, and Fatemi [
1], which formulates the image decomposition problem as:
among them, the first term is the Total Variation (TV) regularization term, which achieves edge-preserving smoothing by penalizing image gradients; the second term is the fidelity term, ensuring the similarity between the decomposition result and the original image. The ROF model constrains the cartoon component in the space of functions of bounded variation, while the texture component lies in the
space. Although this model can effectively preserve edges, it tends to over-smooth fine textures by treating them as noise.
To better describe oscillating textures, Meyer [
2] introduced the
norm, which is more suitable for handling textures, to replace the
norm, and proposed an improved model:
where
is defined as the norm on the function space satisfying
with
. However, the numerical solution of the
norm is challenging. While subsequent studies have attempted to approximate or improve this, a significant advancement was made by Zhang et al. [
9]. They effectively combined the TV norm and the global nuclear norm to represent cartoon and texture components, respectively, proposing a decomposition model based on low-rank texture priors.
this model can obtain cleaner texture extraction results than traditional methods when the image has a good global structure. However, the low-rank model makes strong assumptions about the regularity of textures and thus still has limitations when dealing with complex, aperiodic textures.
Distinct from traditional deep learning methods for images, Ulyanov et al. [
18] proposed the Deep Image Prior (DIP) method. This method leverages the inductive bias inherent in the structure of an untrained convolutional neural network as image prior information to solve various image inverse problems, such as denoising, super-resolution, and image inpainting. The corresponding model is formulated as follows:
in the above formula,
represents a Convolutional Neural Network (CNN) generator,
is a linear degradation operator,
denotes a natural image, and
is a randomly initialized input vector. The core of this method lies in modeling the image to be reconstructed as a differentiable function of the randomly initialized neural network parameters, fitting only to a single degraded image during the optimization process. This strategy allows the model to operate without relying on large-scale training datasets, achieving relatively good reconstruction results to a certain extent.
Leveraging the advantages of DIP, Xu et al. [
24] proposed a decomposition model combining low-rank priors with DIP. This model uses DIP to generate the cartoon component
, employs the low-rank norm
to constrain the texture part, and introduces a weighted TV regularization term to impose gradient sparsity constraints on the cartoon component:
where
is the image size. This model effectively preserves edge information through an adaptive weighting mechanism, achieving good decomposition results.
It is worth noting that the proposed “wavelet-domain global adaptive weighting” strategy shares conceptual similarities with adaptive representation learning methods that have emerged in other fields. Rezaei et al. [
31] applied deep reinforcement learning to image hashing, where an adaptive bit selection mechanism dynamically retains the most informative hash bits and directly optimizes retrieval metrics—an idea similar to our approach of dynamically adjusting the regularization strength based on wavelet coefficient sparsity. Yang et al. [
32] addressed the problem of data scarcity in industrial soft sensing by designing a self-modified dynamic domain adaptation framework that adaptively adjusts its strategy to improve cross-condition prediction robustness. Jiang et al. [
33] employed pre-trained large language models and multi-modal generative models to synthesize defect images, using large-scale external generative priors to compensate for the lack of data in few-shot scenarios. Although the target tasks of these works differ from our cartoon-texture decomposition, they share a common core idea: relying on adaptive or generative priors to compensate for insufficient learning under limited information. Our DIP-based framework, which requires no pre-training or external data and achieves unsupervised decomposition solely through the network’s inherent structural prior and wavelet-domain adaptive weighting, can be regarded as a concrete instance within this broader paradigm.
3. New Model and Algorithm
The total variation regularization term in the spatial domain of Equation (5) is essentially a first-order sparse constraint on gradients, lacking the ability to distinguish scale and direction. Therefore, it is highly likely to misidentify fine textures as structures (leaving residuals) or produce staircasing effects in smooth regions. To thoroughly address this inherent spatial limitation, this paper fully leverages the two major advantages of the wavelet transform: multi-scale frequency separation and directional sparse representation. Furthermore, we propose a global adaptive weighting strategy in the wavelet domain—dynamically calculating weight factors based on the global sparsity of the wavelet coefficients of the current structural component (inversely proportional to the coefficient energy), thereby adaptively adjusting the regularization strength according to the image content. Specifically, using the
norm of the wavelet coefficients as the new structural regularization term to replace the traditional TV constraint, the wavelet decomposition separates the image into low-frequency approximations and multiple high-frequency detail subbands, naturally decoupling the cartoon and texture components in the transform domain. The sparse distribution of coefficients enhances the difference between structure and texture, while the adaptive weighting effectively suppresses texture residuals while protecting structural edges. This achieves a more thorough and flexible decomposition than traditional TV models without the need for pre-trained networks. The new model established from this is as follows:
where
is the observed degraded image,
denotes the wavelet transform,
represents the weight parameters of the deep neural network, and
is the adaptive regularization parameter that can be dynamically adjusted according to the image content.
is the output of the deep network with a fixed random vector
as input, representing the structural component (cartoon part) of the image.
is the texture component to be restored, and
are regularization parameters.
Compared with existing frameworks based on traditional TV models and DIP-based weighted TV, the main improvements of the proposed model lie in the following aspects: First, wavelet domain sparse regularization is used to replace spatial domain TV regularization—traditional TV relies solely on the first-order sparsity of gradients, lacking scale and directional discrimination capabilities, which easily leads to texture residuals or staircasing effects. In contrast, the wavelet transform decomposes the image into multi-scale and multi-directional subbands, and its norm constraint naturally fits the oscillatory characteristics of textures, thereby achieving a more thorough separation of structure and texture. Second, a wavelet domain global adaptive weighting strategy is designed. Unlike the fixed regularization parameters in existing methods, this paper dynamically calculates weight factors based on the global sparsity of the wavelet coefficients of the structural component (inversely proportional to the coefficient energy), enabling the regularization strength to adaptively adjust according to the image content. This effectively suppresses texture residuals while protecting structural edges. Third, it continues the unsupervised learning paradigm of DIP, requiring no pre-trained networks or paired data. By embedding the wavelet transform as an explicit sparse prior, it compensates for the shortcomings of ordinary DIP networks, where implicit priors are insufficient for representing complex textures. This enhances the model’s expressive power for multi-scale and directional features, thereby achieving more favorable generalization performance on natural images compared to existing methods.
Next, we present the optimization algorithm for the proposed model. First, by letting
, we obtain the augmented Lagrangian function as follows:
To facilitate the solution of Equation (7), we adopt the ADMM, formulated as follows:
For Equation (8), we use the Adam optimizer [
34] for iteration.
For the solution of Equation (9), we first handle the adaptive parameter
. Utilizing the UPEN principle [
35], we dynamically calculate the weight factor in each ADMM iteration based on the global sparsity of the wavelet coefficients of the current structural component, enabling the regularization strength to adaptively adjust according to the image content. Specifically, in the
-th iteration, the weight factor is calculated as follows:
this formula dynamically adjusts the regularization weight by calculating the global energy of the wavelet coefficients of the current reconstructed image. Specifically, when the image
contains rich structures or textures (i.e., the
norm of the wavelet coefficients is large), the denominator increases, causing the weight factor
to decrease. This reduces the regularization strength to protect details. Conversely, when the image tends to be smooth (i.e., the
norm of the wavelet coefficients is small), the weight factor
increases, enhancing noise suppression capabilities and thereby achieving content-adaptive sparse constraints.
Substituting the adaptive weights into Equation (9) and then applying coefficient-wise soft-thresholding to the wavelet coefficients of
, yields the closed-form solution of Equation (9).
where
is the soft thresholding operator. In this way, we achieve spatial adaptive weighting of the wavelet coefficients, thereby extracting the structural components of the image more flexibly.
For Equation (10), using the singular value thresholding shrinkage operator [
36], the solution is obtained as:
where
is the soft thresholding operator,
, and
is its singular value decomposition.
Based on the above discussion, the Algorithm 1 for the new model is as follows:
| Algorithm 1. The Algorithm Proposed in This Paper |
| Input: Initial values of and selected parameters |
| for do: |
| Calculate via (8) |
| Calculate via (13) |
| Calculate via (14) |
| Update Lagrange multiplier via (11) |
| Output: Decomposed components and |