Remote Sensing Image Denoising via Low-Rank Tensor Approximation and Robust Noise Modeling

: Noise removal is a fundamental problem in remote sensing image processing. Most existing methods, however, have not yet attained sufﬁcient robustness in practice, due to more or less neglecting the intrinsic structures of remote sensing images and/or underestimating the complexity of realistic noise. In this paper, we propose a new remote sensing image denoising method by integrating intrinsic image characterization and robust noise modeling. Speciﬁcally, we use low-Tucker-rank tensor approximation to capture the global multi-factor correlation within the underlying image, and adopt a non-identical and non-independent distributed mixture of Gaussians (non-i.i.d. MoG) assumption to encode the statistical conﬁgurations of the embedded noise. Then, we incorporate the proposed image and noise priors into a full Bayesian generative model and design an efﬁcient variational Bayesian algorithm to infer all involved variables by closed-form equations. Moreover, adaptive strategies for the selection of hyperparameters are further developed to make our algorithm free from burdensome hyperparameter-tuning. Extensive experiments on both simulated and real multispectral/hyperspectral images demonstrate the superiority of the proposed method over the compared state-of-the-art ones.


Introduction
Remote sensing images, such as multispectral images (MSIs) and hyperspectral images (HSIs), provide abundant spatial and spectral information of real scenes and play a central role in many real-world applications, such as urban planning, surveillance, and environmental monitoring. Unfortunately, during the acquisition process, remote sensing images are often corrupted by various kinds of noise, such as Gaussian noise, speckle noise, and stripe. Image denoising aims to recover an underlying clean image from its noisy observation, which is a fundamental problem in remote sensing image processing. To obtain effective signal-noise separations, denoising methods usually rely on some prior assumptions imposed on the image and noise components.
One of the key issues in denoising methods is the rational design of an image prior, which encourages some expected properties of the denoised image. As a significant property of remote sensing images, low-rankness means that high-dimensional image data lie in a low-dimensional subspace, which can also be considered to be sparsity over a learned basis. Methods based on low-rankness along this line can be categorized into two classes: matrix-based and tensor-based ones. Matrix-based methods perform low-rank matrix approximation on the unfolding (tensor matricization) of the noisy image along the spectral mode. To obtain an efficient low-rank solution, low-rank matrix factorization methods factorize the objective matrix into a product of two flat ones [1][2][3][4][5][6][7][8][9]; rank minimization methods penalize some surrogates of the rank function, such as the convex envelope nuclear norm [10][11][12] or tighter non-convex metrics, e.g., log-determinant penalty [13], Schatten p-norm [14,15], γ-norm (Laplace function) [16], and truncated/weighted nuclear norm [17,18]. These matrix-based methods, however, can capture only the spectral correlation but ignore the global multi-factor correlation in remote sensing images, which usually leads to suboptimal results under severe noise corruption. On the other hand, tensor-based methods explicitly model the underlying image as a low-rank tensor, by solving a tensor decomposition model or minimizing the corresponding induced tensor rank [19]. Representative works include CANDECOMP/PARAFAC (CP) decomposition with CP rank [20][21][22], Tucker decomposition with Tucker rank [23][24][25][26][27], tensor singular value decomposition (t-SVD) with tubal rank [28,29], and tensor train (TT) decomposition with TT rank [30][31][32]. Considering that each tensor decomposition represents a specific type of high-dimensional data structure, recent works attempt to combine the merits of different low-rank tensor models, such as the hybrid CP-Tucker model [33] and the Kronecker-basis-representation (KBR)-based tensor sparsity measure [34,35]. By characterizing the correlations across both spatial and spectral modes, the above tensor-based methods have the advantage of preserving the intrinsic multilinear structure of remote sensing images, achieving state-of-the-art denoising performance.
Another critical issue in current denoising methods is the choice of a noise prior, which characterizes the statistical properties of the data noise. This is generally realized by imposing certain assumptions on the noise distribution, leading to specific loss functions between the noisy image and the denoising result. Two traditional noise priors are the Gaussian prior [1,2,21,24] (L 2 -norm loss) and the Laplacian prior [5,36,37] (L 1 -norm loss), which are widely used for suppressing dense noise and sparse noise (outlier), respectively. A combination of Gaussian and Laplacian priors [12,27,29,38] (L 1 + L 2 loss) is commonly considered in mixed noise removal. However, these priors are generally not flexible enough to fit the noise in real applications, whose distributions are much more complicated than Gaussian/Laplacian or a simple mixture of them. To handle such complex noise, several works model the noise with a mixture of Gaussians (MoG) distribution [3,4,8] (weighted-L 2 loss), due to its universal approximation capability to any continuous probability density function [39]. Later, MoG has been generalized to a mixture of exponential power (MoEP) distribution [6] (weighted-L p loss) for further flexibility and adaptivity. Despite the sophistication of the above priors, they all assume that the noise is independent and identically distributed (i.i.d.), which is still limited in handling realistic noise with non-i.i.d. statistical structures. In remote sensing images, the noise across different bands always exhibits evident distinctions in configuration and magnitude. To encode such noise characteristics, recent works impose non-i.i.d. assumptions on the noise distribution, such as non-i.i.d. MoG [7] and Dirichlet process Gaussian mixture [9,40], resulting in better noise fitting capability and higher denoising accuracy.
Some attempts have been presented to combine the advantages of recent developments in image characterization and noise modeling. To the best of our knowledge, only several studies are constructed as follows. Chen et al. [41] proposed a robust tensor factorization method based on CP decomposition and MoG noise assumption. However, their model does not consider uncertainty information of latent variables, such as the CP factor matrices and the MoG parameters, and thus is prone to overfitting due to point estimations of these variables by optimization-based approaches. To overcome this defect, Luo et al. [42] formulated the robust CP decomposition with MoG noise assumption as a full Bayesian model, in which all latent variables are given prior distributions and inferred under a variational Bayesian framework. Considering that CP decomposition cannot well capture the correlations along different tensor modes, Chen et al. [43] further integrated Tucker decomposition and MoG noise modeling into a generalized robust tensor factorization framework. However, this method also suffers from the overfitting problem and requires some critical hyperparameters to be manually specified, such as the tensor rank and the number of MoG components. Moreover, all the above methods impose an i.i.d. assumption on the data noise, which still under-estimates the complexity of realistic noise and thus leaves room for further improvement.
To overcome the aforementioned issues, in this paper we propose a new remote sensing image denoising method by taking into consideration the intrinsic properties of both remote sensing images and realistic noise. The main contribution of this work is summarized below.

•
We formulate the image denoising problem as a full Bayesian generative model, in which a low-Tucker-rank image prior is exploited to characterize the intrinsic low-rank tensor structure of the underlying image, and a non-i.i.d. MoG noise prior is adopted to encode the complex and distinct statistical structures of the embedded noise.

•
We design a variational Bayesian algorithm for an efficient solution to the proposed model, where each variable can be updated in closed-form. Moreover, we develop adaptive strategies for the selection of involved hyperparameters, to make our algorithm free from burdensome hyperparameter-tuning.

•
We conduct extensive denoising experiments on both simulated and real MSIs/HSIs, and the results show the superiority of the proposed method over the compared state-of-the-art ones.
The rest of the paper is organized as follows. Section 2 introduces some notation used throughout the paper. Section 3 describes the proposed model and the corresponding variational inference algorithm. Section 4 presents experimental results and discussions. Section 5 concludes the paper.

Notation
We use boldface Euler script letters for tensors, e.g., A, boldface uppercase letters for matrices, e.g., A, boldface lowercase letters for vectors, e.g., a, and lowercase letters for scalars, e.g., a. In particular, we use I, 0, and 1 for identity matrices, arrays of all zeros, and arrays of all ones, respectively. We use a pair of lowercase and uppercase letters for an index and its upper bound, e.g., i = 1, . . . , I. We use Matlab expressions to denote elements and subarrays, e.g., a(i), A(i, :), and A(i, :, :).
Given a tensor A ∈ R I 1 ×···×I D (A reduces to a matrix when D = 2 or a vector when D = 1). The Frobenius norm and the 1-norm of A are, respectively, defined as For a given dimension d ∈ {1, . . . , D}, the mode-d unfolding of A is denoted as unfold d (A) or, more compactly, as A (d) , whose size is I d × (I 1 . . . I d−1 I d+1 . . . I D ). The inverse process is denoted as fold d (A (d) ) := A. More precisely, the tensor element A(i 1 , . . . , i D ) maps to the matrix element see [19] for more details. The mapping between (i 1 , . . . , i D ) and (i 1 , i 2 ) is denoted as The Tucker rank of A is defined as a vector consisting of the ranks of its unfoldings, i.e., rank(A) := (rank(A (1) ), rank(A (2) ), . . . , rank(A (D) )).
Additional notation is defined where it occurs.

Tucker Rank Minimization with Non-i.i.d. MoG Noise Modeling
This section is divided into three parts. Section 3.1 formulates the denoising problem as a full Bayesian model named Tucker rank minimization with non-i.i.d. MoG noise modeling (NMoG-Tucker). Section 3.2 presents a variational inference algorithm for solving the proposed model. Section 3.3 discusses the selection of hyperparameters involved in our model.

Bayesian Model Formulation
Let Y, X ∈ R I×J×K denote the noisy image and the underlying clean image, respectively. To characterize the low-Tucker-rank prior of remote sensing images, we consider the following low-rank matrix factorization of the mode-d unfoldings of Y (d = 1, 2, 3): where , considering that the low-rankness degrees along different modes are generally not the same) N d := fold d (N d ) ∈ R I×J×K . We assume that each element in the k-th band of N d follows a MoG distribution where L d is the number of Gaussian components, Π d (k, :) ∈ R L d is the mixing proportion satisfying Π d (k, l) > 0 and ∑ L d l=1 Π d (k, l) = 1, and τ d ∈ R L d contains precisions of the Gaussian components. By introducing an indicator variable Z d ∈ {0, 1} I×J×K×L d , we rewrite (2) as the following two-level generative process [44]: where Z d (i, j, k, :) ∈ {0, 1} L d with ∑ L d l=1 Z d (i, j, k, l) = 1 follows a multinomial distribution parameterized by Π d (k, :). Then, we impose conjugate priors on τ d and Π d to obtain a complete Bayesian model, where Gamma(·|a 0 , b 0 ) denotes the Gamma distribution with parameters a 0 and b 0 , and Dirichlet(·|α 0 1) denotes the Dirichlet distribution parameterized by α 0 1 ∈ R L d . The proposed prior can characterize the following intrinsic properties of realistic noise.
• First, noise in each band exhibits complex statistical properties, which cannot be well captured by simple distributions such as Gaussian or Laplacian. We model the noise in each band by an i.i.d. MoG distribution, which is a universal approximator to any continuous distribution.

•
Second, noise across different bands is non-identical in terms of structure and extent, due to sensor malfunctions and atmospheric conditions. This band-noise-distinctness nature is encoded by the band-dependent mixing proportion in MoG, leading to a non-i.i.d. noise distribution.

•
Third, there is a strong correlation among the noise distributions in all bands, since real-life noise corruption is generally attributed to only a few main factors. In the proposed prior, the noise correlation is reflected by the fact that the MoG distributions of different bands share the same set of Gaussian components.
Prior of the factor matrices U d and V d . Inspired by the sparse Bayesian learning principle [45], we assume that the columns of U d and V d are generated from the following Gaussian distributions: where γ d ∈ R R d denotes precisions following the conjugate prior Prior of the solution X . Given the learned low-rank components along three modes, we assume that each element in X is generated from the following weighted multiplication of Gaussian distributions: , ξ denotes the precision of the Gaussian distributions, w ∈ R 3 contains weights of the three modes satisfying w(d) > 0 and ∑ 3 d=1 w(d) = 1, and c is a normalization constant. Full Bayesian model and posterior. We can construct a full Bayesian model by combining (1)- (9); the corresponding graphical model is shown in Figure 1. Then, the goal is to infer the posterior of all involved variables, which can be expressed as Hollow nodes, shadowed nodes, and small solid nodes denote unobserved variables, observed data, and hyperparameters, respectively; a solid arrow from node a to node b indicates the explicit conditional distribution p(b|a); a dashed arrow from node a to node b implies that b is implicitly conditioned on a; the box is a compact representation indicating that there are three sets of variables corresponding to the three tensor modes.
Optimization-based interpretation. From an optimization perspective, maximizing the posterior (10) is equivalent to minimizing its negative logarithm, i.e., , denotes the element-wise multiplication, diag(γ d ) denotes the diagonal matrix with the elements of γ d on its main diagonal. Below we illustrate the origin and the effect of each term in (12).

•
The first 2 -norm term is derived from the weighted multiplication of Gaussians prior on the solution X (9). It forms X by penalizing the Euclidean distances between the unfoldings The second weighted-2 -norm term is derived from the non-i.i.d. MoG prior on the noise N d (3). It serves as a spatially varying loss function that suppresses the noise according to the local noise level estimations embedded in the weight matrix unfold d (H d ).

•
The third and the fourth weighted-2 -norm terms are derived from the Gaussian priors on the factor matrices U d and V d (6,7). They promote the joint group sparsity of The remainder terms are derived from the priors on the variables {γ d , τ d , Z d , Π d } and provide them with suitable regularization.

Approximate Variational Inference
We use the variational Bayesian (VB) method [44] to obtain an approximate inference of the posterior (10), since the exact solution is computationally intractable. Below we briefly introduce the general framework of VB, and then present the inference results for our model.
General framework of VB. Denoting by θ unobserved variables and by D observed data, VB aims to seek a variational distribution q(θ) to approximate the true posterior p(θ|D), by minimizing the Kullback-Leibler (KL) divergence between q and p, i.e., where C imposes certain restrictions on q to make the minimization tractable. In general, q is restricted to have the factorization q(θ) = ∏ i q i (θ i ), where {θ i } are disjoint groups of the variables in θ.
Under this assumption, one can approach the solution to (13) in an iterative way, by alternatively minimizing KL(q p) with respect to each q i (θ i ) while keeping the others fixed. More precisely, q i can be calculated by the following closed-form solution: where · θ\θ i denotes the expectation with respect to q over all variables except θ i . Factorized form of the approximate posterior q. We assume that the approximation of the posterior (10) has the following factorization (the subscripts of q are omitted without confusion): According to (14), we give the analytical inference of each component in (15) where For each element in γ d , we have that with parameters c γ d ∈ R and d γ d (r) ∈ R given by For each element in X , we have that with mean µ X (i,j,k) ∈ R given by with parameters a τ d (l) ∈ R and b τ d (l) ∈ R given by with a parameter α Π d (k,:) ∈ R L d given by For each mode-4 fiber of Z d , we have that with a parameter ρ Z d (i,j,k,:) ∈ R L d given by Pseudo-code and complexity analysis. The pseudo-code of the overall algorithm is summarized in Algorithm 1. The total complexity per iteration is approximately where the first term is due to calculating the covariance matrices of {U d , V d } 3 d=1 (16,17), and the second term is due to calculating the parameters of {Z d } 3 d=1 (22). Since, in general, it holds R d min(I d , J d ), the complexity of our algorithm depends linearly on the size of the input data.

Selection of Hyperparameters
This section is devoted to the selection of hyperparameters involved in our model. We develop adaptive strategies to learn their values using the results of the current iteration, which makes our algorithm free from burdensome hyperparameter-tuning.
d=1 , which is an estimation of the Tucker rank of the solution. Since the true rank is often unknown in practice, we design an adaptive rank estimation strategy to improve the applicability of our method. The main idea consists of initializing R d with a large value and then decreasing it gradually by dropping singular values smaller than an adaptive threshold. More precisely, denoting by where σ in a decreasing order ( · 2 denotes the spectral norm, i.e., the largest singular value), s where e upper ∈ (0, 1) imposes an upper bound of the sum of the dropping singular values {σ We make some comments on the proposed rank estimation strategy. First, the dropping singular values in each iteration carry at most 1% energy of U d ) T , leading to a robust rank decreasing process. Second, the threshold tends to decrease if a rank reduction occurs, i.e., s which avoids underestimating the true rank. Third, the threshold tends to increase if it is too small to trigger a rank reduction, i.e., s , which avoids overestimating the true rank. Selection of w. The hyperparameter w assigns relative weights of the three modes in the prior (9) and the posterior (19) of X . We assume a positive correlation between w(d) and the low-rankness degree of X (d) , i.e., the more sparse the singular values of X (d) are, the larger w(d) is. To measure the sparsity of singular values, we use the Gini index [46] (Here we take G := 1 − G , where G is the definition of the Gini index in [46]) defined by where a ∈ R I is a non-zero vector composed of nonnegative elements in a decreasing order. The Gini index takes positive values, and smaller values indicate better sparsity. Then, at the t-th iteration, we choose w (t) (d) as where σ contains the singular values of X (d) in a decreasing order and c is a normalization constant ) to measure the relative, rather than absolute, low-rankness degree. Selection of ξ. The hyperparameter ξ is the precision of the Gaussian distributions in the prior (9) and the posterior (19) (17), or, equivalently, penalizes the distances between X and the low-rank components For stability purpose, we initialize ξ with a small value and increase it gradually until the convergence of X . More precisely, at the t-th iteration, we set ξ (t) as where ξ 0 is an auxiliary hyperparameter updated as The hyperparameter L d is the number of Gaussian components in the MoG prior of the noise N d (2). To adaptively fit the noise distribution, we initialize L d with a relatively large value and iteratively decrease L d to L d − 1 if there exist two analogous Gaussian components satisfying . We simply fix them to 10 −6 in a non-informative manner, to minimize their impacts on the inference process [44]. Our method performs stably well in all experiments under these simple settings.

Numerical Experiments
We evaluate the denoising performance of the proposed NMoG-Tucker method on synthetic data, MSIs, and HSIs. Table 1 lists six state-of-the-art competing methods on low-rank matrix/tensor approximation: matrix-based methods LRMR [38], MoG-RPCA [4], and NMoG-LRMF [7]; tensor-based methods LRTA [24], PARAFAC [21], and KBR-RPCA [35]. Parameters involved in all competing methods are set to default values or manually tuned for the best possible denoising performance. All experiments are conducted under Windows 10 and Matlab R2016a (Version 9.0.0.341360) running on a desktop with an Intel(R) Core(TM) i7-8700K CPU at 3.70 GHz and 32 GB memory.  [19], and • denotes the vector outer product.

Competing Method
Data Prior Noise Prior LRMR [38] Matrix rank constraint Gaussian + sparse rank(X (3) ) ≤ r MoG-RPCA [4] Low-rank matrix factorization MoG Low-rank matrix factorization Non-i.i.d. MoG Tucker decomposition Gaussian CP decomposition Gaussian X = ∑ r λ r U 1 (:, r) • U 2 (:, r) • U 3 (:, r) KBR-RPCA [35] Kronecker-basis-representation We conduct both simulated and real denoising experiments. In simulated experiments, the noisy data are generated by adding synthetic noises to the original ones, and the denoising performance is evaluated by both quantitative measures and visual quality. In real experiments, the goal is to recover real-world data without knowing the ground-truths, and the denoising results are mainly judged by visual quality.
In simulated experiments, we use the following four quantitative measures: relative error (ReErr), erreur relative globale adimensionnelle de synthèse (ERGAS) [47], mean of peak signal-to-noise ratio (MPSNR), and mean of structural similarity (MSSIM) [48]. Denoting by X res ∈ R I×J×K an estimation to the original data X ori ∈ R I×J×K , the four measures of X res with respect to X ori are defined as follows: ReErr(X res , X ori ) := X res − X ori F X ori F , ERGAS(X res , X ori ) := 100 1 where the details of SSIM can be found in [48]. In general, better denoising results have smaller ReErr and ERGAS values and larger MPSNR and MSSIM values.

Synthetic Data Denoising
This section presents simulated experiments on synthetic data denoising. The original data are random low-rank tensors generated by the Tucker model with size 50 × 50 × 50 and rank (R 1 , R 2 , R 3 ), i.e., X ori := C × 1 U 1 × 2 U 2 × 3 U 3 , where the core tensor C ∈ R R 1 ×R 2 ×R 3 and each factor matrix U d ∈ R 50×R d (d = 1, 2, 3) are drawn from standard Gaussian distribution. We consider two rank settings (10, 10, 10) and (20,15,10). The original data are normalized to have unit mean absolute value, i.e., X ori 1 /50 3 = 1. We test the following three kinds of synthetic noises.
• Gaussian + sparse noise: 80% entries mixed with Gaussian noise N (0, 0.1 2 ) and 20% with additive uniform noise between [−5, 5]. • Mixture noise: 40% entries mixed with Gaussian noise N (0, 0.01 2 ), 20% with Gaussian noise N (0, 0.2 2 ), 20% with additive uniform noise between [−5, 5], and 20% missing (the locations of missing entries are not given as prior knowledge). Table 2 reports the ReErr values and execution time of different methods on synthetic data denoising, where every result is an average over ten trials with different realizations of both data and noise. Regarding the denoising accuracy, our method consistently attains comparable or lower ReErr values than the competing methods, and its superiority becomes more significant as the noise complexity increases. Regarding the computational speed, LRTA is generally the fastest method, while our method is the slowest in all cases. The relatively high cost of our algorithm is mainly due to two facts: computing variables corresponding to all three modes requires three times more calculations than those in matrix-based methods; updating the factor matrices row by row is much slower than updating them as a whole in other tensor-based methods. An acceleration of our implementation will be left to future research.    Figure 2 shows the average PSNR and SSIM values across all bands of the denoising results by different methods. For easy observation of the details, we plot the differences between our results and the competing ones at larger scales. It can be observed that our method achieves comparable or better performance for most bands, while KBR-RPCA exhibits the best robustness over all bands.  Figure 3 shows two examples on MSI denoising under Gaussian noise and mixture noise. These figures suggest that the results by the competing methods generally maintain some noise or alter image details, whereas our results exhibit higher visual quality in both noise removal and detail preservation. For better visualization, we enlarge a certain patch and show the corresponding error map, which highlights the difference between the denoised patch and the original one. A close inspection reveals that our error maps contain less color information than the competing ones, indicating that our method better recovers the spatial-spectral structures of the original MSIs.  [25]. The intensity range is scaled to [0, 1]. To simulate the degradation scenarios in real-world HSIs, we test the following three kinds of synthetic noises.

•
Gaussian noise: all entries mixed with Gaussian noise N (0, 0.05 2 ). For DCmall, the SNR value of each band varies from 6 to 20 dB, and the mean SNR value of all 160 bands is 13.79 dB. For Cuprite, the SNR value of each band varies from 16 to 20 dB, and the mean SNR value of all 89 bands is 18.69 dB.
• Speckle noise: all bands are corrupted by non-i.i.d. speckle noise with signal-dependent intensity, which is simulated by multiplicative uniform noise with mean 1 and variance randomly sampled from [0.001, 0.5] for each band. For both DCmall and Cuprite, the SNR value of each band varies from 3 to 30 dB. The mean SNR value of all 160 bands in DCmall is 19.52 dB, and that of all 89 bands in Cuprite is 20.03 dB.

•
Mixture noise: all bands are corrupted by non-i.i.d. Gaussian noise with zero-mean and band-dependent variances, and the SNR value of each band is uniformly sampled from 10 to 20 dB. Then, we randomly choose 90/50 bands in DCmall/Cuprite to add complex noises: the first 40/20 bands are corrupted by stripe noise with stripe number between [20,40] and stripe intensity between [−0.25, 0.25]; the middle 40/20 bands are corrupted by deadline with line number between [5,15]; 50% to 70% entries in the last 40/20 bands are corrupted by speckle noise with mean 1 and variance 0.3. Thus, each band is randomly corrupted by one to three types of noises. For both DCmall and Cuprite, the SNR value of each band varies from 4 to 20 dB. The mean SNR value of all 160 bands in DCmall is 11.62 dB, and that of all 89 bands in Cuprite is 12.04 dB. Table 4 presents the quantitative performance of different methods on simulated HSI denoising, where every result is an average over five trials with different noise realizations. Compared with the competing methods, our method consistently yields better performance in terms of MPSNR, MSSIM, and ERGAS in all cases.  Figure 4 plots the average PSNR and SSIM values across all bands by different methods, as well as the differences between our results and the competing ones at larger scales. These results suggest that our method achieves leading quantitative performance for most bands. We also observe that the matrix-based competing methods LRMR, MoG-RPCA, and NMoG-LRMF suffer from sharp PSNR and SSIM drops at certain bands in the cases of speckle noise and mixture noise, e.g., bands 40-80 in DCmall and bands 40-60 in Cuprite. In comparison, our method does not exhibit such phenomenon, which demonstrates its robustness over entire HSI bands.  The noisy band in DCmall is severely contaminated by a mixture of Gaussian noise, deadline, and speckle noise; the noisy band in in Cuprite is overwhelmed by heavy speckle noise. We observe that the matrix-based methods LRMR, MoG-RPCA, and NMoG-LRMF, although adopting flexible noise priors, cannot adequately separate the original HSIs from such severe degradations, especially for Cuprite with fewer spectral bands. As for tensor-based methods, LRTA can hardly reduce the noise, while PARAFAC leaves all the deadlines. Their poor performance is due to the Gaussian noise assumption, which is not able to fit complex noise. In comparison, adopting more intrinsic data and noise priors, KBR-RPCA and our method yield satisfactory denoising results in both cases. Compared with KBR-RPCA, our method preserves finer HSI structures with less residual noise, which can be seen from the demarcated patches and the corresponding error maps. Our better performance is mainly attributed to the non-i.i.d. MoG noise prior, which has a better fitting capability than the Gaussian + sparse assumption in the RPCA framework.
Real   Figure 6 shows a denoising example on band 220 in Indian Pines. One can see that the original band is overwhelmed by noise with almost no useful information. From the denoising results by the competing methods, we observe that LRTA fails to handle such severe degradation; MoG-RPCA still leaves much noise in the whole image; LRMR, PARAFAC, and KBR remove more noise but simultaneously lose tiny image details; NMoG-LRMF yields a visually satisfactory result, but seems to produce false edges in the demarcated patches. On the other hand, the proposed method outperforms the competing methods in terms of both noise removal and detail preservation. Figure 7 presents a classification example on Indian Pines. This test aims to provide a task-oriented evaluation of the denoising performance of different methods, from the perspective of the influence on the classification accuracy. In the ground-truth classification result, a total of 10249 samples are divided into 16 classes, and the number of samples in each class ranges from 20 to 2455. To conduct a supervised classification, we randomly choose ten samples for each class as training data, and use the remaining samples in each class as testing data. Then, the support vector machine (SVM) classification [50] is performed on the noisy image and its denoised versions by different methods, and the classification results are quantitatively evaluated by overall accuracy (OA). It can be seen that noise corruption significantly limits the classification accuracy, and the classification results of the denoised HSIs are more or less improved since the noise is suppressed. Among all denoising methods, our method leads to the highest OA value, demonstrating its superiority in benefiting the SVM classification.  Figure 8 shows a denoising example on band 99 in Urban under slight noise. In this example, the original band is mainly corrupted by several vertical stripes with intensity 0.01∼0.02. To visually evaluate the denoising performance, we show color maps of the noise components estimated by different methods, which should highlight the underlying noise with as few image structures as possible. For better visualization, we also plot the corresponding vertical mean profiles. From these results, we observe that LRTA fails to recognize the stripes, while the other competing methods can detect the stripes but simultaneously remove structural information of the original image. In comparison, our method extracts clearly the stripes with very few image features, indicating a more accurate signal-noise separation.  Figure 9 displays a denoising example on band 206 in Urban under severe noise, including the noisy/denoised bands and the corresponding horizontal mean profiles. One can see that the original band is contaminated by a mixture of stripes, deadlines, and other complex noise, leading to rapid fluctuations in the horizontal mean profile. Regarding the denoising results by different methods, LRTA can hardly suppress the noise; LRMR, MoG-RPCA, PARAFAC, and KBR-RPCA still leave some horizontal stripes, and the corresponding curves show evident fluctuations; NMoG-LRMF removes the noise and produces a smooth curve, but it also introduces some spectral distortions in certain regions, such as the red demarcated patch. Comparatively, our method effectively attenuates the noise and simultaneously reveals the original spatial-spectral information, providing a better trade-off between noise removal and detail preservation.

Discussion
In Section 3.3, we have developed adaptive strategies for the selection of hyperparameters involved in our model. These strategies themselves also introduce additional hyperparameters, which are fixed as default settings in our experiments. This section discusses the selection of those hyperparameters and tests their effects on the denoising performance.
The selection of e upper and e lower . The hyperparameters e upper and e lower are introduced in the update formula of the threshold s In general, a small e upper leads to a slow but stable rank decreasing process; a large e upper makes this process fast but aggressive, increasing the risk of underestimating the true rank. On the other hand, e lower in (25) controls the lower bound of the threshold s (t) d , which provides a mechanism for avoiding overestimating the true rank. Roughly speaking, larger values of e lower make our algorithm easier to reduce the rank. Please note that a too large e lower tends to underestimate the true rank, e.g., if one sets e lower = 1, then the rank decreasing process cannot stop until the rank reduces to zero. Table 5 investigates the effects of e upper and e lower on the denoising performance of our method. This test is based on synthetic data denoising, and the original data are with size 50 × 50 × 50 and Tucker rank (20,15,10). We observe that our method yields rather stable ReErr values with exact estimations of the true rank, under a wide range of settings of e upper and e lower . One exception is the mixture noise case with e upper = 10 −1 , where the true rank is underestimated, resulting in an evident increase in ReErr. Since our method is robust with a reasonable range of e upper and e lower , we choose e upper = 10 −2 and e lower = 2/3 as their default settings in all experiments. Table 5. ReErr values and estimated ranks of our method under different settings of e upper and e lower . This test is based on synthetic data denoising, and the original data are with size 50 × 50 × 50 and Tucker rank (20,15,10). The best results are highlighted in bold.  (20,15,10) 8.06e-2 (19,15,10) The initialization of {L d } 3 d=1 . The hyperparameter L d controls the number of Gaussian components in the MoG noise prior (2). In Section 3.3, we have developed an adaptive strategy to reduce L d from a large starting point to the value matching the noise complexity. However, it remains a problem to choose an appropriate initialization L (0) d . Table 6 studies the effects of {L (0) d } 3 d=1 on the denoising performance of our method. This test is based on synthetic data denoising, and the original data are with size 50 × 50 × 50 and Tucker rank (10, 10,10). From these results, we have the following two observations. First, as expected, the developed selection strategy can find suitable values of {L (end) d } 3 d=1 fitting the noise distribution. Second, our method performs poorly when {L

Gaussian
d=1 is too small to provide sufficient noise fitting capability, while its performance tends to be stable after each L (0) d is larger than a reasonable value, e.g., 8. Therefore, we choose the default setting of {L (0) d } 3 d=1 as (8,8,8) in all experiments, since it is robust enough to most realistic noise.

Conclusions
We have proposed a new remote sensing image denoising method under the Bayesian framework. To achieve an effective and robust signal-noise separation, we have formulated the denoising problem as a full Bayesian generative model integrated with a low-Tucker-rank image prior and a non-i.i.d. MoG noise prior. The proposed model has the advantage of preserving the intrinsic low-rank tensor structure of remote sensing images, while exhibiting flexible fitting capability to realistic noise. For an efficient solution to the proposed model, we have designed a variational Bayesian algorithm to infer all involved variables by closed-form equations, as well as adaptive strategies for the selection of hyperparameters. Experimental results have shown that the proposed method is highly effective and superior over the competing methods on synthetic data, MSI, and HSI denoising. Future works include accelerating the numerical implementation and incorporating more advanced image priors to enhance the denoising performance, such as nonlocal self-similarity and deep neural networks [51].