Hyperspectral Image Restoration under Complex Multi-Band Noises

Zongsheng Yue 1, Deyu Meng 1 , Yongqing Sun 2 and Qian Zhao 1,* 1 Institute for Information and System Sciences and Ministry of Education Key Lab of Intelligent Networks and Network Security, Xi’an Jiaotong University, Xi’an 710049, China; zsy20110806207@stu.xjtu.edu.cn (Z.Y.); dymeng@xjtu.edu.cn (D.M.) 2 NTT Cyber Space Labs, Kanagawa 2360026, Japan; yongqing.sun@lab.ntt.co.jp * Correspondence: timmy.zhaoqian@gmail.com; Tel.: +86-138-9187-3702


Introduction
Hyperspectral images (HSIs) are collected by high spectral resolution sensors, and consist of hundreds of bands ranging from ultraviolet to infrared wavelengths.Due to the richness of spatial and spectral information, this type of image has been widely used in many remote sensing applications [1,2].However, HSIs inevitably suffer from various noise contamination in practice [3], which severely influences the image quality and limits the performance of the subsequent HSI processing tasks, such as unmixing [4], classification [5] and so on.Therefore, as a necessary pre-processing step of certain application tasks, HSI restoration is an important research topic and has attracted much attention in recent decades.
Regarding each band of HSIs as a gray image, we can simply use the traditional image denoising method to reduce noise in a band-by-band fashion, such as the sparse constraint methods [6][7][8][9], local polynomial approximation (LPA) methods [10][11][12] and so on.However, such methods ignore the intrinsic characteristics of HSIs, i.e., the spatial and spectral correlation among different bands, thus often resulting in unsatisfactory performance.To alleviate this issue, many methods have been proposed in order to better encode the spatial and spectral correlation knowledge of HSIs, e.g., [13][14][15][16].Besides, nonlocal similarity [17], anisotropic diffusion [18] and conditional random fields [19], have also been utilized for the HSI restoration task recently.
Inspired by the spectral correlation of HSIs, various methods based on low-rank matrix factorization (LRMF) have been proposed in the past few years.For example, Zhang et al. [20] used a rank-constrained RPCA [21] while Wu et al. [22] and Xie et al. [23] employed weighted nuclear norm minimization (WNNM) [24,25] to enhance the restoration quality.In addition, considering the local similarity within a patch and the non-local similarity across the patches in a group simultaneously, a novel group low-rank representation (GLRR) model was proposed in [26].Besides this kind of LRMF approach, a series of methods regarding HSIs as a 3-D tensor have also been proposed in recent years.Along this line, we can categorize these methods as tensor decomposition-based methods [27][28][29][30][31] and wavlet-based methods [32][33][34][35].
The current HSI restoration methods generally assume similar complexity of noise structures across all bands of an HSI dataset.In fact, most of them only assume simple independent and identical Gaussian/Laplacian noise on all HSIs, which implies that the noises on all bands share a similar noise type and extent.To make the approach better fit real noises with more complexity, some HSI restoration methods have been constructed to consider more complex noise beyond simple Gaussian/Laplacian [36,37].Additionally, He et al. [38] considered the noise as a mixture of Gaussian and Laplacian distribution while Cao et al. [39] modelled the noise using a more general mixture of exponential power (MoEP) distributions.However, these methods all assume the noise in each HSI band as an identical distribution.Recently, Chen et al. [40] proposed modelling the noise distribution of each band as a mixture of Gaussian (MoG) distributions with different parameters sharing with the unique conjugate prior.This method, however, still assumes the same complexity (i.e., the number of components) of MoGs in all HSI bands.
For most practical HSIs, however, the noise across different bands is always evidently distinct, both in terms of distribution type and corruption extent.As shown in Figure 1, the noises in the shown three bands of the Urban (http://lesun.weebly.com/hyperspectral-data-set.html)HSI dataset (Figure 1a) and the Terrain (https://www.agc.army.mil)HSI dataset (Figure 1b To address the aforementioned noise fitting issue, in this paper we propose a new HSI restoration method, which can more sufficiently encode such noise characteristics underlying HSIs.The main contribution of this work can be summarized as follows: 1. The noise in each HSI band is modelled by a Dirichlet process Gaussian mixture model (DP-GMM), in which the Gaussian components of MoG in each band are adaptively determined based on the specific noise complexity of this band.The distinctness of noise structures in different bands is thus able to be faithfully reflected by the model.

2.
By using the hierarchical Dirichlet process (HDP) technique, we correlate the noise of different bands through a sharing strategy, in which the noise parameters of each band share a high-level noise configuration of the entire HSI.

3.
A variational Bayes algorithm is readily designed to solve the model, and each of the involved parameters can be effectively updated in closed-form.
This paper is organized as follows.Section 2 provides some preliminaries on the utilized HDP technique.Section 3 presents the main model against the investigated problem, and the variational Bayes algorithm for solving it is given in Section 4. Experimental results on synthetic and real datasets are demonstrated in Section 5, and finally the paper is rounded up with a conclusion.

Preliminaries
In this section, we review some preliminary knowledge about the Dirichlet process and hierarchical Dirichlet process to set the stage for the presentation of our model later.
Besides, Ferguson [41] proved two important properties of the DP.The first is that any random sample G drawn from DP(γ, H) is a discrete distribution over Θ.The second property considers repeated draws θ 1 , θ 2 , . . ., θ n from this discrete distribution G, i.e., where these θ i s will exhibit a clustering property, i.e., they share repeated values with positive probability.In the DP-GMM, θ i represents the parameters (i.e., mean and variance) of the corresponding Gaussian component to which the i-th data point belongs.Thus, the clustering property of θ i s can adaptively determine the number of Gaussian components in MoG.
Sethuranman [42] provides an equivalent but more explicit representation of DP based on the stick-breaking construction.Considering two infinite collections of independent random variables, . ., the stick-breaking representation of G is as follows: where δ η * i represents the Dirac Delta function located at η * i .The stick-breaking construction depicts the DP from a different point of view, in which the clustering property of DP is realized by the mixing coefficient π k .The parameter γ controls the dispersion level of π k , and we simulate the stick-breaking process under different γ in Figure 2. It is easy to see that this stick-breaking process is able to determine the number of Gaussian components of MoG through adjusting the value of π k . (a)

Hierarchical Dirichlet Process
The hierarchical Dirichlet process (HDP) [43] is originally proposed as a hierarchical Bayesian model for natural language processing, which provides a flexible framework for sharing mixture components among groups of related data.A two-level HDP is a collection of DPs, which is also drawn from a higher-level DP with base distribution H, i.e., where j is an index for the data group.It should be noted that all distributions G j possess different parameters while sharing the distribution G.

Notation Explanation
For convenience of formulating our model, we first introduce some necessary notations to be used in the following sections.We use light lowercase letters, bold lowercase letters and bold uppercase letters to denote scalars, vectors and matrices, respectively.Given a matrix X, we use x •j to denote its j-th column vector, x i• to denote its i-th row vector, and x ij to denote the element in its i-th row and j-th column.For probability distributions, N (µ, Σ) represents the multivariate Gaussian distribution with mean vector µ and covariance matrix Σ, N (µ, ξ −1 ) represents the univariate Gaussian distribution with mean µ and precision ξ, Beta(a, b) represents the Beta distribution with parameters a and b, Gam(c, d) represents the Gamma distribution with parameters c and d, and Multi(π) represents the multinomial distribution with parameter π.

Model Formulation
Now, we give the formulations of our proposed model.Let us present the observed HSI data contaminated by noises as Y ∈ R d×B , where d = hw, h and w are the height and width of the investigated HSI, respectively, and B is the number of bands.By considering a generative model, we can decompose the observed data into where E denotes the noise term and X is the expected restorated HSI.

The DP-GMM Model
In this study, we model the noise of each band as a DP-GMM, which is an advanced version of the non-parametric GMM model capable of adaptively rectifying the number of Gaussian components of MoG based on the data [44,45].The noise complexity of each band can be adjusted by varying the number of components of MoG.What is more, in order to simultaneously encode the correlation among different bands, we associate the corresponding DP-GMM of each band at a higher level and conduct a two-level HDP.
Specifically, we consider the noise located at the i-th row and j-th column of the noise term E. Following the idea of [40,46], we can model it as an element-wise Gaussian mixture distribution: which can be equivalently reformulated as a two-level generative process: where π j is a vector of length K, 1 ≤ z ij ≤ K.Then, we place a hierarchical Dirichlet distribution prior on π j to associate the noise of different bands, i.e., Besides, ξ k is also generated from its conjugate prior, i.e., To allow the model to be flexible, we further assume K → +∞.Teh et al. [47] proved that the limit of Equations ( 5)-( 7) is the following HDP noise model: where H = Gam(a 0 , b 0 ).
The noise model of Equation ( 8) is too metaphysical to understand, so we reformulate it equivalently but more explicitly with two coupled latent variables z ij and c jt in a stick-breaking constructive manner, i.e., It is worth noting that the ξ in Equation ( 9) and the ψ in Equation ( 8) have the following relation: We can explain the hierarchical structure of the noise model (8) using Figure 3a intuitively.In the first level, we attempt to fit the noise of the entire HSI data by a DP-GMM.In the second level, we also fit the noise of each band by a DP-GMM, but each of its components is shared from the first level.This sharing strategy associates the noise of different bands, and the DP-GMM, which allows an MoG to have a limited but unbounded number of Gaussian components, adjusts the noise complexity adaptivity for each band.
Note that we adopt the truncation strategy [48] for the DP-GMMs in the variational inference of Section 4, which means that the number of Gaussian components of DP-GMMs in the first and second levels are fixed as K and T (K and T is large enough, T ≤ K) as shown in Figure 3a.Based on such a truncation strategy, z ij (1 ≤ z ij ≤ T) denotes the index of the Gaussian component from which e ij is generated in the DP-GMM of band j (1 ≤ j ≤ B), and c jt (1 ≤ c jt ≤ K, 1 ≤ t ≤ T) denotes the index of the Gaussian component of DP-GMM in the first level, with which the t-th Gaussian component of DP-GMM of band j in the second level is shared (c jt is marked in Figure 3a).
Finally, the concentration parameters α and γ in Equation ( 8) mainly affect the number of Gaussian components of the second level MoG in each band and the first level MoG for the entire HSIs dataset, respectively.In order to enhance the adaptability of the model complexity, we can place the Gamma distribution on them [49], i.e., γ ∼ Gam(c 0 , d 0 ), α j ∼ Gam(e 0 , f 0 ).(10) In the above formulations, a 0 , b 0 , c 0 , d 0 , e 0 , f 0 are hyperparameters of prior distributions, all of which can be easily set as small values to make the prior non-informative.

LRMF Model
As the conventional HSI restoration model, we assume that the expected HSI X in Equation ( 3) has a low-rank property.Based on such an assumption, we formulate X ∈ R d×B as the product of two smaller matrices U ∈ R d×l and V ∈ R B×l , i.e., where l ≤ B is fixed pre-manually.Furthermore, U and V can be generated from a multivariate zero-mean Gaussian distribution by the following method: where I d denotes the d × d identity matrix.Furthermore, the corresponding conjugate prior on each precision variable λ r is:

The Entire Graphical Model
Combining Equations ( 3), ( 9)-( 13), we can construct a full Bayesian model for HSI restoration.In brief, we call our model DP-GMM.The corresponding graphical model of our method is shown in Figure 3b.The goal turns to infer the posteriors of all involved variables: ξ, C, Z, β, π, λ, α, γ|Y). (

Variational Inference
Directly computing the posterior of Equation ( 14) under an HDP mixture prior is intractable, and thus approximate inference methods need to be designed.Although the Markov chain Monte Carlo (MCMC) sampling method can provide very accurate approximation to the posterior [50], this method is limited for massive scale data due to its high computational cost and difficulty to detect convergence.As a more efficient and deterministic alternative to MCMC, we adopt a Variational Bayes (VB) method [51] to solve our model in this paper.Given the observed data Y, maximizing the marginal log-likelihood p(Y) = p(Y|θ)p(θ)dθ directly is computationally intractable.VB attempts to seek a lower bound of the log-likelihood and maximize it instead, i.e., ln p(Y) = q(θ) ln p(Y|θ)p(θ) q(θ) dθ + KL(q(θ)||p(θ|Y))

L(q)
where KL(•||•) denotes the Kullback-Leibler (KL) divergence of two distributions.Assuming q(θ) = ∏ i q i (θ i ), we can obtain the closed-form solution to q j (θ j ) with other factors fixed through optimizing L(q) with respect to q j (θ j ), i.e., where E x [•] denotes the expectation over the variable x, and θ\θ j represents the set of θ with θ j removed.
The entire inference can be accomplished by alternatively computing Equation ( 16) with respect to all the involved variables.
Based on the stick-breaking construction and the truncation strategy [48], we can approximate the posterior distribution of Equation ( 14) with the following factorized form: , V , ξ, C, Z, β, π, λ, α, γ Then we can analytically infer all the factorized distributions involved in Equation ( 17) as below.The computational details are provided in the Supplementary Material.
Update Noise Component: The parameters involved in the noise components are ξ, C, Z, π, β.We use the stick-breaking procedure by setting a large enough value of K and T for truncated approximation, and then get the following updating equations on Z and C: where As for ξ, based on its conjugate priors, we have where Similarly, for π, β, we have where ϕ jts + γ.

Update the Concentration Parameters:
According to the conjugate priors of α and γ in Equation (10), we can obtain the following updating equations: q(α j ) = Gam(e j , f j ), q(γ) = Gam(c, d), where Update the LRMF Components: The parameters involved in the LRMF components are U, V , λ.For each row of U, using the factorization in Equation ( 11), we can obtain with mean µ u i• and covariance matrix Σ u i , given by where Similarly, for each column of V , we have and the mean µ v j• and the covariance matrix Σ v j are calculated by For parameter λ, we have q(λ r ) = Gam(p r , q r ), where

Settings of Hyperparameters:
We set all the hyperparameters involved in our model in a non-informative manner to minimize their impacts on the inference of the posterior distribution [51].Specifically, throughout our experiments, we easily set a 0 , b 0 , c 0 , d 0 , e 0 , f 0 , p 0 , q 0 as 10 −6 .Our proposed method behaves stably throughout all the experiments with this simple initialization choice.

Experimental Results
To demonstrate the effectiveness of the proposed DP-GMM model, we perform both synthetic and real HSI data experiments, and evaluate the experimental results quantitatively and visually.In order to comprehensively evaluate the performance of the proposed method, we compared the results with eight popular HSI restoration methods: the traditional image denoising method SVD [52], LRMF-based methods LRMR [20], LRTV [38], WNNM [24] and WSNM [23], tensor-based methods TDL [30] and BM4D [35], and mixture noise based-method NMoG [33].These methods represent the state-of-the-art for the HSI restoration task.The parameters for these compared methods were manually adjusted according to their default strategies.In addition, to facilitate the numerical calculation and visualization, all the bands of the HSI datasets are normalized into [0,1].

Simulated Data Experiments
Two HSI datasets were employed in this experiment: Washington DC Mall (http://lesun.weebly.com/hyperspectral-data-set.html) with size 1208 × 307 × 191 and RemoteImage (http://peterwonka.net/Publications/code/LRTC_Package_Ji.zip) with size 205 × 246 × 96.Some bands of these two HSI datasets are heavily contaminated by noise and cannot be regarded as clean ground truth, and thus have to be deleted [33,36].Due to the computer's memory limitation, we also cropped the subimage of the HSI and finally obtained two synthetic datasets with size 200 × 200 × 160 and 200 × 200 × 89, respectively.These two HSI datasets are considered as clean data in this simulated experiment.
To simulate noisy HSIs, we added six types of noises to the clean HSI datasets to test the performance of all the compared methods.The details of the added noise are listed as follows: • Case 1: For different bands, the noise intensity was equal in this case.Furthermore, the same zero-mean Gaussian noise N (0, σ 2 ) with σ = 0.05 was added to all the bands • Case 2: Entries in all bands were corrupted by zero-mean Gaussian noise but with different intensity.Furthermore, the standard deviation of Gaussian noise that was added to each band of HSIs was uniformly selected from 0.01 to 0. To fairly compare the overall performance of different methods, we set the rank of the expected HSIs as 7 in the DC Mall and RemoteImage dataset for all the compared methods.Furthermore, three measurements-the mean peak signal-to-noise ratio (MPSNR), mean structural similarity index (MSSIM) [53] and Erreur Relative Globale Adimensionnelle de Synthèse (ERGAS) [54]-are utilized to evaluate the restoration quality: where PSNR i and SSIM i denote the PSNR and SSIM values for the i-th band, and Ref i and Res i denote the i-th band of the reference image and the restorated image, respectively.Tables 1 and 2 present the restoration results by all the compared methods in terms of the aforementioned three measurements.The best results for each quality measurement are highlighted in bold.It is clear that our proposed method provides the highest values in both MPSNR and MSSIM, and lowest ERGAS in most of the cases, which validates the superiority of the proposed method over the other methods.For Case 1 with the simple independent and identically distributed (i.i.d.) Gaussian noise, most of the methods have comparable performance, which confirms the effectiveness of these methods to some extent, especially for TDL.When the noise becomes more complex, e.g., the intensity of noise varies by bands, more stripes, more deadlines and more impulse noises are added, the advantages of our proposed method become more evident from Case 2 to Case 6.This advantage substantiates the robustness of our proposed method against complex noise, owing to its adaptability to distinguish noise complexities in different HSI bands in our model.
Figures 4 and 5 present some typical bands of the DC Mall and RemoteImage HSI dataset before and after restoration under Case 3, Case 4 and Case 6, respectively.By comparing the restoration results, it can be clearly seen that DP-GMM performs best, effectively removing all types of noise while preserving more detailed information.In Case 3, the restored HSIs of almost all of the compared methods contain some evident deadlines.Similarly, some stripes still exist in the restoration results of most of the compared methods in Case 4.Under the mixture noise of Case 6, LRTV, NMoG and DP-GMM all have relatively better results than the other methods.In conclusion, our proposed method can have stable and better performance in all cases, which verifies the effectiveness of the proposed method with better capability in fitting a much wider range of noises than current methods.
In order to further compare the performance of all the restoration methods, we also show the spectral signatures of some pixels before and after restoration; for example, Figure 6 shows the spectrum of pixels (55, 110) and (110, 184) for the DC Mall HSI dataset under Case 6, and Figure 7 shows similar results for the RemoteImage HSI dataset.It is easy to see that our proposed method provides the best spectral signature among all of the compared methods, which is closest to the original one.Besides, the NMoG method, which adopts non-i.i.d.MoG to model the noises of HSIs, also has comparable performance.These advantages of DP-GMM and NMoG indicate the necessity of giving more consideration to the complex noise structure of HSIs.

HYDICE Urban Dataset:
The original image is 307 × 307 × 210 in size, and we feed the entire dataset into the denoising algorithms directly.Some bands of Urban are seriously polluted by the atmosphere, water absorption and other unknown noises, such as bands 104-108, 139-151 and 206-210.Additionally, some other bands are contaminated by stripes, deadlines and Gaussian noise, such as bands 1-2, 76 and 198.In this first real data experiment, the rank is set as 7 for all the competing methods.
Figure 8 shows bands 76 (top), 104 (middle) and 139 (bottom) of the restored HSIs of the Urban dataset, and these three bands are typically polluted to varying degrees by different types of noise as analyzed in Figure 1a.Band 76 is relatively clean with several deadlines, and most of the competing methods can remove these deadlines except TDL and BM4D.As for bands 104 and 139, however, some stripes still exist in the restored images for almost all the competing methods.This is because these bands are heavily corrupted and can provide little useful information, and these methods lack noise adaptivity to deal with these complex cases.In contrast, our proposed DP-GMM method is able to completely suppress the stripes and effectively preserves the detailed information.What is more, the images restored by our proposed method are smoother than those restored by other compared methods.These visual results show that our proposed DP-GMM method can remove more complicated noise embedded in HSIs than other competing methods.
Figure 9 shows the vertical mean profiles of bands 76, 104 and 139, which is a quantitative comparison of the restorated results shown in Figure 8.The horizontal axis in Figure 8 represents the column number, and the vertical axis represents the mean digital number value of each column.Band 76 is almost noise-free with only one deadline.There is thus no visible change between the mean profiles before and after the restoration.When the noise becomes more complicated, the curves fluctuate rapidly in the original images for bands 104 and 139.After the restoration processing, the fluctuations are more or less suppressed.Obviously, the DP-GMM methods perform best and obtain the flattest curve without fluctuations.These results are in accordance with the visual results of Figure 8.

AGC Terrain Dataset:
The second real data experiment employed the Terrain dataset as the test dataset, and it has 500 × 307 pixels and 210 bands.The full Terrain image is mainly corrupted by deadlines, atmosphere, water absorption and others, especially the bands 104-109, 138-152 and 204-210.We also set the rank as 7 in this experiment for all the methods.
We selected three typical bands 2, 139 and 151 to display the performance of all the compared methods in Figure 10.These three bands exhibit evident noise distinctness, as shown in Figure 1b, both in terms of distribution types and corruption extent.In band 2, there are some deadlines clustered together, as shown in the amplified area, and SVD, BM4D, WNNM, WSNM, LRMR, LRTV and our proposed DP-GMM can all remove these deadlines completely.As for bands 139 and 151, which contain many stripes and other mixed noise, it is easy to see that the DP-GMM removes more noise, especially stripes, compared with the other methods, owing to its noise adaptivity for different bands.These visual results show that our proposed DP-GMM method is more robust when dealing with the complicated real noise of HSI data.
Figure 11 shows the vertical mean profiles comparison before and after restoration for each competing method, corresponding to the visualization of Figure 10.The horizontal axis in this figure represents the column number, and the corresponding vertical axis represents the mean digital number value of each column.The curves obtained by different methods in band 2 appear to be slightly smoother to different degrees, as compared with that of the original images.However, as for bands 139 and 151, due to the interference of the mixed noise, especially stripes and deadlines, there are sharp fluctuations in the curve for the original images.After the restoration, the fluctuations are suppressed by all of the competing methods to some extent.Obviously, the curves obtained by our proposed method are smoother than others, which indicates the better restoration results of DP-GMM.These quantitative results are also in accordance with the visual results of Figure 10.

Figure 3 .
Figure 3. (a): Illustration of the hierarchical DP-GMM model for HSI restoration.The ellipses with different colors denote the Gaussian component with parameters (µ k , σ 2 k ).The green dotted rectangles, representing the Gaussian compounds of each band, share Gaussian components from a top-level MoG in the red dotted rectangle.The number of Gaussian components of MoG at the two levels are dynamically adjusted during the HSI restoration until it converges.(b): The entire graphic model of DP-GMM.

1 . • Case 3 : 4 : 5 :
All the bands were corrupted by Gaussian noise as Case 2. Besides, 40 bands of the DC Mall dataset (20 bands of the RemoteImage dataset) were randomly chosen to add deadlines, and the number of deadlines in each band is from 5 to 15 with width 1 or 2.•Case All the bands were contaminated by Gaussian noise as Case 2. Besides, 40 bands of the DC Mall dataset (20 bands of the RemoteImage) were randomly selected to add stripes, and the number of stripes is from 15 to 40 with width 1 or 2.•Case All the bands were corrupted by Gaussian noise as Case 2. In addition, different percentages of impulse noises which were uniformly selected from 0 to 0.15 were added to each band.•Case6: Each band was contaminated by Gaussian and impulse noise as Case 5. Besides, 20 bands of the DC Mall dataset (10 bands of the RemoteImage dataset) were randomly selected to add deadlines as Case 3, and another 20 bands of the DC Mall dataset (10 bands of the RemoteImage dataset) were randomly selected to add stripes as Case 4.

Table 1 .
Quantitative Evaluation of different methods on the simulated DC Mall dataset in different cases of noise.

Table 2 .
Quantitative Evaluation of different methods on the simulated RemoteImage dateset in different cases of noise.