Hierarchical Sparse Learning with Spectral-Spatial Information for Hyperspectral Imagery Denoising

During the acquisition process hyperspectral images (HSI) are inevitably corrupted by various noises, which greatly influence their visual impression and subsequent applications. In this paper, a novel Bayesian approach integrating hierarchical sparse learning and spectral-spatial information is proposed for HSI denoising. Based on the structure correlations, spectral bands with similar and continuous features are segmented into the same band-subset. To exploit local similarity, each subset is then divided into overlapping cubic patches. All patches can be regarded as consisting of clean image component, Gaussian noise component and sparse noise component. The first term is depicted by a linear combination of dictionary elements, where Gaussian process with Gamma distribution is applied to impose spatial consistency on dictionary. The last two terms are utilized to fully depict the noise characteristics. Furthermore, the sparseness of the model is adaptively manifested through Beta-Bernoulli process. Calculated by Gibbs sampler, the proposed model can directly predict the noise and dictionary without priori information of the degraded HSI. The experimental results on both synthetic and real HSI demonstrate that the proposed approach can better suppress the existing various noises and preserve the structure/spectral-spatial information than the compared state-of-art approaches.


Introduction
As three-dimensional (3D) data integrating both spectral and spatial information concurrently, hyperspectral imagery (HSI) can provide more reliable and accurate information about observed objects. It has numerous applications in remote sensing, diagnostic medicine and mineralogy, etc., with encouraging results [1][2][3]. HSI can be considered as a set of gray-scale images over wavelength, whose entries are the spectral responses. Due to the limited light, imaging system, photon effects, and calibration error, hyperspectral images are inevitably contaminated by annoying noises with different statistical properties [4,5]. The existing noises greatly reduce the quality of the HSI and increase the difficulty of subsequent processing, such as target detection, agriculture assessment, mineral exploitation and ground-cover classification. Therefore, noise reduction is an essential research topic in hyperspectral images analysis.
Over the past two decades, various approaches have been developed for the noise reduction of HSI, which aim to restore the original HSI from its noisy images as accurately as possible. The traditional denoising algorithms, in two-dimensional (2D) image domain, have been widely utilized to remove the noise for HSI [6][7][8][9][10], such as wavelet transform, total variation, nonlocalmeans (NLM), K-singular The first term on the right side of the equation represents the clean component of the i-th patch, where D is the underlying dictionary and a i are the sparse coefficients with respect to dictionary elements. The second part N defines the zero-mean Gaussian noise component, with the standard variance σ. Generally, the noiseless signals can be estimated by solving the l p -minimization problem given by Equation (2): In Equation (2), ||x i − Da i || 2 2 is the data fidelity item, which stands for the fidelity between the observed noisy image and the original clean image, and ||a i || p is the regularization item, which gives the prior knowledge of noiseless images. λ ≥ 0 is a regularization parameter, which is utilized to trade off between the fidelity item and sparseness of the coefficient. A larger λ usually implies a much sparser solution of a.
Denoising approaches aim to restore the noise-free patchx i from degraded data x i by calculatinĝ x i = Dâ i . The sparse prior, p is usually set to zero or one. When p = 0, ||a i || 0 represents the number of nonzero elements in a i ; when p = 1, ||a i || 1 is the sum of elements' absolute value. l 0 norm is not convex, and it cannot guarantee the uniqueness of the solution. Meanwhile, l 0 -minimization is an NP-hard combinatorial search problem in general case, which is usually solved by greedy algorithms. l 1 norm is a convex function. The uniqueness of solution can be guaranteed by replacing l 0 norm with l 1 norm [35]. Traditional convex optimization methods are used to solve the problem by alternating minimization of D and a i . According to the types of dictionary basis, methods can be broadly classified into two groups [36]: (1) transformation-based methods. In these works, the mathematical model are developed to predict a dictionary, e.g., discrete cosine transformation (DCT), wavelet, curvelet, contourlet and shearlet; (2) learning-based methods [9,37]. The learning techniques are utilized to construct the dictionary from the training data in these studies. The learned dictionary has shown better performance in many applications. Once D andâ i are given, the idea image x i can be obtained asx i = Dâ i . Bayesian methods have been widely used for learning the sparse dictionary [25,38], which have achieved competing restoration performance.
All these conventional methods have achieved good performance on denoising 2D images, but for more complicated pictures, e.g., hyperspectral images, the conventional 2D methods usually have less ability to explain the structural or textural details, which significantly limits the ability of representation and the range of application for the dictionary learning. In this case, the hierarchical dictionary learning methods have been introduced to deeply exploit the latent characteristics of images [37][38][39], with promising results.

HSI Denoising with Spectral-Spatial Information and Hierarchical Sparse Learning
For restoring well the HSI, we extend the denoising framework for 2D images into the hierarchical dictionary learning framework in a Bayesian manner. Let Y be a HSI with a size of l m × l n × l λ , where l m × l n represents the number of spatial pixels, and l λ defines the size of the spectral domain. Generally, spectral correlations of HSI are of more importance than spatial characteristics, which have recently been reported to enhance performance in tracking, classification and target detection, etc. The proposed method consists of two main stages: spatial-spectral data extraction and noise-free estimation based on hierarchal spare learning. Firstly, according to the structure similarity index measure (SSIM), a band-subset partition is introduced to segment the HSI into multiple/one band-subsets. Each band-subset consists of continuous bands with high structure correlation. Then local homogeneous region can be extracted for each spectral pixel. Secondly, a hierarchical sparse learning model, which is composed of the clean image term, Gaussian noise term and sparse noise term, is constructed to suppress well the various noises. To effectively capture the latent spatial information of HSI, a Gaussian process with gamma constraints is imposed to the dictionary in the clean image term. The second and the third term are used to infer the statistics characteristics of existing noises, e.g., Gaussian noise, Poisson noise, dead pixel lines, stripes or a mixture of them. By solving this framework in a Bayesian manner, the restored results for each band-subset can be obtained. The framework of the proposed method is shown in Figure 1. 16, 1718 4 of 26 replacing l0 norm with l1 norm [35]. Traditional convex optimization methods are used to solve the problem by alternating minimization of D and ai.
According to the types of dictionary basis, methods can be broadly classified into two groups [36]: (1) transformation-based methods. In these works, the mathematical model are developed to predict a dictionary, e.g., discrete cosine transformation (DCT), wavelet, curvelet, contourlet and shearlet; (2) learning-based methods [9,37]. The learning techniques are utilized to construct the dictionary from the training data in these studies. The learned dictionary has shown better performance in many applications. Once D and ̂ are given, the idea image xi can be obtained as ̂=̂. Bayesian methods have been widely used for learning the sparse dictionary [25,38], which have achieved competing restoration performance.
All these conventional methods have achieved good performance on denoising 2D images, but for more complicated pictures, e.g., hyperspectral images, the conventional 2D methods usually have less ability to explain the structural or textural details, which significantly limits the ability of representation and the range of application for the dictionary learning. In this case, the hierarchical dictionary learning methods have been introduced to deeply exploit the latent characteristics of images [37][38][39], with promising results.

HSI Denoising with Spectral-spatial Information and Hierarchical Sparse Learning
For restoring well the HSI, we extend the denoising framework for 2D images into the hierarchical dictionary learning framework in a Bayesian manner. Let Y be a HSI with a size of lmlnl, where lmln represents the number of spatial pixels, and l defines the size of the spectral domain. Generally, spectral correlations of HSI are of more importance than spatial characteristics, which have recently been reported to enhance performance in tracking, classification and target detection, etc. The proposed method consists of two main stages: spatial-spectral data extraction and noise-free estimation based on hierarchal spare learning. Firstly, according to the structure similarity index measure (SSIM), a band-subset partition is introduced to segment the HSI into multiple/one band-subsets. Each band-subset consists of continuous bands with high structure correlation. Then local homogeneous region can be extracted for each spectral pixel. Secondly, a hierarchical sparse learning model, which is composed of the clean image term, Gaussian noise term and sparse noise term, is constructed to suppress well the various noises. To effectively capture the latent spatial information of HSI, a Gaussian process with gamma constraints is imposed to the dictionary in the clean image term. The second and the third term are used to infer the statistics characteristics of existing noises, e.g., Gaussian noise, Poisson noise, dead pixel lines, stripes or a mixture of them. By solving this framework in a Bayesian manner, the restored results for each band-subset can be obtained. The framework of the proposed method is shown in Figure 1.

Spatial-Spectral Data Extraction
In HSI, the neighboring bands are acquired under relatively similar sensor conditions, and they are strongly correlated in the spectral domain. Based on this, SSIM is utilized to measure spectral correlations between adjacent bands [40]. Suppose B j and B j+1 represent 2D images lying in the j-th and j+1-th band, respectively. Structure similarity between the j-th and j+1-th band can be defined by Equation (3): In Equation (3), µ B j and σ B j are the mean and variance of band B j , respectively; µ B j+1 and σ 2

B j+1
are the mean and variance of band B j+1 ; The predefined constants c 1 and c 2 are applied to stabilize the division with weak denominator. Normally, the closer SSIM(B j ,B j+1 ) is to one, the stronger the structural correlations are between the j-th and j+1-th spectral bands. Supposing S c (j) = SSIM(B j ,B j+1 ), the structural correlation curve S c can be generated. Figure 2 shows the structural correlation curves of different hyperspectral images.

Spatial-spectral Data Extraction
In HSI, the neighboring bands are acquired under relatively similar sensor conditions, and they are strongly correlated in the spectral domain. Based on this, SSIM is utilized to measure spectral correlations between adjacent bands [40]. Suppose Bj and Bj+1 represent 2D images lying in the j-th and j+1-th band, respectively. Structure similarity between the j-th and j+1-th band can be defined by Equation (3): In Equation (3), and are the mean and variance of band Bj, respectively; +1 and +1 2 are the mean and variance of band Bj+1; The predefined constants c1 and c2 are applied to stabilize the division with weak denominator. Normally, the closer SSIM(Bj,Bj+1) is to one, the stronger the structural correlations are between the j-th and j+1-th spectral bands. Supposing Sc(j) = SSIM(Bj,Bj+1), the structural correlation curve Sc can be generated. Figure 2 shows the structural correlation curves of different hyperspectral images.
(a) (b) (c) Based on the curves in Figure 2, it can be found that the correlation coefficients across adjacent bands obviously vary considerably for different hyperspectral images. The curve in Figure 2a displays a relatively steady trend. Obvious drops exist in Figure 2b,c, which means that some adjacent spectral bands in Urban data and Indian Pines data have much lower structural similarity; meanwhile, continuous spectral bands between two neighboring drops show a relatively stable trend. However, most previous studies neglect this property of spectral bands. These approaches are applied to directly restore the entire bands, or to sequentially denoise the band-subset constructed by partitioning all spectral bands with a fixed value. To make the best of the correlations across neighboring bands, this paper explores the optimal partition in spectral domain by estimating the drops in curve Sc. The detailed procedure of spatial-spectral data extraction is as follows: firstly, form the structural correlation curve Sc using Equation (3). Secondly, detect the local segmentation points Sc(j) in the curve Sc. Sc(j) meets the condition denoted by Equation (4): (4) In Equation (4),  is a predefined threshold, which is applied to avoid the local disturbance caused by noise in curve Sc. The beginning and ending points in Sc can be treated as the intrinsic local partition points. To better exploit useful spectral characteristics and suppress the noise [40], a fused image is generated by using the average of the spectral bands between the neighboring segmentation points. Based on Equations (3) and (4), it is determined whether adjacent fused images are to be merged or not. By this way, the segmentation points in curve Sc can be identified. Based on the curves in Figure 2, it can be found that the correlation coefficients across adjacent bands obviously vary considerably for different hyperspectral images. The curve in Figure 2a displays a relatively steady trend. Obvious drops exist in Figure 2b,c, which means that some adjacent spectral bands in Urban data and Indian Pines data have much lower structural similarity; meanwhile, continuous spectral bands between two neighboring drops show a relatively stable trend. However, most previous studies neglect this property of spectral bands. These approaches are applied to directly restore the entire bands, or to sequentially denoise the band-subset constructed by partitioning all spectral bands with a fixed value. To make the best of the correlations across neighboring bands, this paper explores the optimal partition in spectral domain by estimating the drops in curve S c . The detailed procedure of spatial-spectral data extraction is as follows: firstly, form the structural correlation curve S c using Equation (3). Secondly, detect the local segmentation points S c (j) in the curve S c . S c (j) meets the condition denoted by Equation (4): In Equation (4), η is a predefined threshold, which is applied to avoid the local disturbance caused by noise in curve S c . The beginning and ending points in S c can be treated as the intrinsic local partition points. To better exploit useful spectral characteristics and suppress the noise [40], a fused image is generated by using the average of the spectral bands between the neighboring segmentation points. (3) and (4), it is determined whether adjacent fused images are to be merged or not. By this way, the segmentation points in curve S c can be identified.

Based on Equations
As shown in Figure 2a, the correlations across all neighboring bands have a relatively stable trend, and there is no need for spectral band segmentation in this case. As shown in Figure 2b,c, the HSI can be divided into the non-overlapping band-subsets based on the segmentation points. In this case, the HSI itself can be considered as a special band-subset. Let C denote the number of band-subsets. Above all, the noisy data can be reshaped as X = {X 1 , . . . , X c , . . . , X C }, c = 1, . . . , C and X c is the c-th band-subset.
Finally, to effectively preserve the local details of HSI in the spatial domain [41], we utilize the cubic patches instead of the 2D patches during the denoising process. By this, each band-subset is divided into many overlapped cubic patches. The size of each cubic patch is fixed as l x × l y × l c , where l x × l y is the size of spatial dimensions and l c defines the number of spectral bands in the c-th band-set. We note that there are various material categories within the same cubic patch, which increase the spectral-signature contamination (mixing/blurring) as the spatial size of cubic patches gets bigger. Therefore, the larger l x and l y may result in the instability of classification accuracies. If the neighborhood size is too small, it cannot well explore the spatial information. In this paper, we make a balance between the stability and the exploitation of spatial information, and we set l x = l y = 4 in the experiments.
After reshaping the cubic patches into the vectors, we can obtain the c-th band-subset X c = x c 1 , . . . , x c i , . . . , x c M , where M = (l m − l x + 1)(l n − l y + 1) represents the number of cubic patches and x c i ∈ R P , P = l x × l y × l c is the vector generated by the i-th cubic patch of the c-th band-subset.

Hierarchical Sparse Learning for Denoising Each Band-Subset
In this subsection, the hierarchical sparse learning framework is well constructed with prior distribution and hyper-prior distribution to restore the noisy HSI [27,37,42], which integrates the Gaussian process and Gamma distribution into dictionary atoms to explore the spatial consistency of HSI simultaneously. The novel prior consisting of prior distribution and hyper-prior distribution is often called as hierarchical prior. Noticed that, the sparse learning framework with multiple hierarchical priors can be regarded as a special case of deep learning. Considering the dataset X = {X 1 , . . . , X c , . . . , X C }, c = 1, . . . , C, the suggested approach sequentially depicts the characteristics of each subset to recover HSI by utilizing independent draws. For the c-th band-subset X c = x c 1 , . . . , x c i , . . . , x c M , the hierarchical denoising model can be denoted as Equation (5): where the symbol • represents the element-wise multiplication. The first term on the right side of the Equation (5) represents the ideally noiseless estimation for X c , which is represented as a linear combination of dictionary atoms. D c = [d c 1 , . . . , d c i , . . . , d c K ] ∈ R P×K has columns that denote the dictionary atoms to be learned, where K is the number of dictionary atoms. In the context of HSI, the signatures of x c i have the highly spatial consistency with samples located at its neighboring region. To better learn the HSI [43], this prior knowledge is explicitly imposed to the dictionary atom d c k by using the Gaussian process (GP), which is constrained by the Gamma distribution. Additionally, there is a very high probability that the spectral pixels with similar characteristics in a local region share the same dictionary atoms, which is consistent with the cognition of vision and structure. the K × K identity matrix, Bern and Beta refer to the Bernoulli and Beta distributions, respectively. Then the noise-free data D c A c may be represented in the following manner: where ξ 1 is drawn from the Gamma distribution and ξ 1 is the predefined constant. Both of them are reapplied to manifest the smoothness between x c j and x c j . If ||x c j − x c j || 2 2 is small, and the corresponding components between x c j and x c j are regarded to have a great spatial consistency. According to this, both the spectral correlation and spatial consistency for the c-th band-subset X c can be exploited simultaneously. z c ki denotes whether the atom d c k is utilized to represent x c i with probability π c k or not. When K→∞, the expectation E(π c k ) = K −1 a 0 /[K −1 a 0 + K −1 (K−1)b 0 ] is approximately zero. According to this, most elements in the set z c ki k=1,...,K are equal to zero, and the sparsity is reasonably imposed to the vector a c i . Obviously, a c i consists of a few non-zero values and numerous zero values. In this sense, the constraint constructed by the Beta-Bernoulli process can be regarded as a special type of l 0 norm. Notice that, noiseless estimation for each x c j uses a specific subset of dictionary atoms {d c k } k=1,...,K , which is specified by the sparse vector z c ki k=1,...,K . The expression z c ki = 1 indicates that the atom d c k is employed for representing coefficient a c ki . By analyzing the number and the locations non-zero elements in the set z c ki k=1,...,K , the size of the dictionary can be learned adaptively, including atom selection and prediction. The matrix N c = n c 1 , . . . , n c i , . . . , n c M , with the size of P × M, defines the zero-mean Gaussian noise. n c i is drawn from the zero-mean Gaussian noise components, with standard variance γ n , and it can be displayed by the expressions in Equation (7). Additionally, Poisson noise is a dependent-signal noise, which means the pixels with higher light can be very strongly interrupted. To remove the dependency of the noise variance, the variance stability transform (VST) is introduced to convert Poisson noise into Gaussian noise before implementing a denoising approach [14,44]. After recovery, a corresponding inverse transformation is utilized to obtain the final HSI restoration results.
The third term denotes the sparse noise, such as dead pixel lines, which exploits the Beta-Bernoulli process to manifest the sparseness as displayed in Equation (8). The matrix Q c = q c 1 , . . . , q c i , . . . , q c M , with the size of P × M, represents the intensity of the sparse noise, and S c = s c 1 , . . . , s c i , . . . , s c M ∈ R P×M depicts the location information where the sparse noise may exist.
When s c pi = 1, the m-th element of x c i is polluted by sparse noise with the amplitude q c pi . The expectation E(θ c i ) = a θ /(a θ + b θ ) could be close to zero by adjusting the shape parameters of the Beta distribution (a 0 and b 0 ). Each element of s c i is i.i.d. drawn from the Beta distribution separately, which depicts well the arbitrariness of positions in the sparse noise. γ c w , γ c n and γ c v are interpretable as non-informative hyper-parameters, which can regulate the precision of w c i , n c i and q c i . To flexibly solve the model with the posterior PDF, Gamma distribution is developed for these hyper-parameters as shown in Equation (9): Sensors 2016, 16, 1718 8 of 26 The negative logarithm posterior density function of the above model (utilized jointly to all data X c = x c i i=1,...,M ) may be represented as Equation (10): According to Equation (10), the full posterior of all parameters can be obtained rather than a point approximation, which is commonly applied by maximum a posteriori (MAP). All necessary parameters for solving the model can be learned by the prior and hyper-prior distribution, so the proposed hierarchical sparse learning framework is more robust and accurate. A graphical representation of the complete model is shown in Figure 3.  T  2  1   2   2   00  ,,   2   ,,   log  ,  ,  ,  ,  ,  , , , , , According to Equation (10), the full posterior of all parameters can be obtained rather than a point approximation, which is commonly applied by maximum a posteriori (MAP). All necessary parameters for solving the model can be learned by the prior and hyper-prior distribution, so the proposed hierarchical sparse learning framework is more robust and accurate. A graphical representation of the complete model is shown in Figure 3. Additionally, as there are several solutions for the same spectral pixel due to the usage of overlapping cubic patches, the final restored result is constructed by averaging all overlapping cubic patches. For all band-subsets X = {X 1 ,…,X c ,…,X C }, c = 1,…,C, the whole HSI after denoising can be obtained by sequentially performing this operation.

Experimental Results and Discussion
To evaluate the performance of the proposed approach, six state-of-the-art denoising methods were selected for a comparison based on both simulated and real data. The compared methods include K-SVD [9], BM3D [10], ANLM3D [13], BM4D [19], LRMR [31] and DDL3+FT [34]. The necessary parameters in the K-SVD, BM3D, ANLM3D and DDL3+FT methods, were finely tuned or selected automatically to generate the optimal experimental results as the reference suggested. In Additionally, as there are several solutions for the same spectral pixel due to the usage of overlapping cubic patches, the final restored result is constructed by averaging all overlapping cubic patches. For all band-subsets X = {X 1 , . . . , X c , . . . , X C }, c = 1, . . . , C, the whole HSI after denoising can be obtained by sequentially performing this operation.

Experimental Results and Discussion
To evaluate the performance of the proposed approach, six state-of-the-art denoising methods were selected for a comparison based on both simulated and real data. The compared methods include K-SVD [9], BM3D [10], ANLM3D [13], BM4D [19], LRMR [31] and DDL3+FT [34]. The necessary parameters in the K-SVD, BM3D, ANLM3D and DDL3+FT methods, were finely tuned or selected automatically to generate the optimal experimental results as the reference suggested. In BM4D, the noise variance is chosen from the set {0.01, 0.03, 0.04, 0.05, 0.07, 0.09, 1.1}. In LRMR, the rank of the noiseless matrix is selected from {4, 5, 6, 7}, and the cardinality of the sparse term is chosen from the set {0, 500, 1000, 1500, 2000, 3000, 4000, 5000}. For the proposed method, K is set to a relatively small value to reduce computational time, and here we choose K = 128. The parameters of the hyper-priors in the Gaussian process are set to β = ρ = 10 −6 and ξ 2 = 200. The iteration number of Gibbs sampling is set to 100. For setting the remaining parameters of the hyper-priors, two cases are taken into consideration: (1) The HSI is contaminated by Gaussian noise, dead pixel lines or a mixture of them. In this case the parameters are set as follows: (2) The HSI is contaminated by a mixture of Gaussian and Poisson noise or a mixture of Gaussian noise, Poisson noise and dead pixel lines. The parameters are set as follows: Once the types of noises are selected, the parameters of the hyper-priors are then set as described above and have no need to be tuned. By using these hyper parameters, the proposed framework can efficiently learn the latent information and suppress the various noises for the input HSI by sampling the infinite prior space. For one Gibbs sampling iteration, the computational complexity of the proposed method is approximate to O(K(P + M) + PM). It should be pointed out that the proposed framework consumes much higher time than the six compared algorithms.

Experiment on Simulated Data
An image of Pavia University, acquired by the reflective optics system imaging spectrometer (ROSIS) over Pavia, northern Italy, is used for our experiments on simulated data. It consists of 103 spectral bands with a size of 610 × 340 for each band andcontains theentire spectral resolution reflectance characteristics collected from 430 nm to 860 nm in 10 nm steps. Before the simulated experiments, the gray values of the Pavia University data are scaled to the interval [0, 1].
For the sake of comparison, two kinds of evaluations are presented: (1) The visual appearances are shown to qualitatively estimate the experimental results before and after denoising, including spatial images at selected bands and spectral signatures of a few pixels; (2) The two commonly used metrics, peak signal to noise ratio (PSNR) and feature similarity index measure (FSIM) are provided to make a quantitative assessment on the denoising results. PSNR measures gray-level similarity between the restored image and reference image according to MSE. FSIM is designed to estimate the simulated result by integrating the gradient magnitude feature with the phase congruency feature from the view of human's perceptual consistency [17,47]. Better denoising performance is indicated by higher PSNR and FSIM values.
To estimate the effectiveness of the proposed approach, three types of noise are considered for the Pavia data: (1) zero-mean Gaussian noise; (2) Poisson noise. It is parameterized by the expression X Poisson = X × peak, where X Poisson refers to the noisy HSI corrupted by Poisson noise; X is the reference image; peak denotes the Poisson noise intensity; (3) Dead pixel lines. They are added to the same position of the selected bands in HSI, and their width varies from one line to three lines. In the simulated experiments, the noises are added for the Pavia data as the following three cases: Case 1: The noise standard variance σ for each band of the HSI is randomly selected from 0.02 to 0.15. Figure 4 displays the PSNR and FSIM values of each band before and after recovery, which is as the quantitative evaluation for Pavia data. In Figure 4, the noisy HSI curves show big fluctuations with the spectral bands, which is caused by the changing σ with the spectral bands. Therefore, the noise information is very important to substantially improve the denoising performance. By adaptively predicting the noise and fully exploring the spectral-spatial information, the proposed method obtains the higher values of PSNR and FSIM than its competitors in most bands. KSVD and BM3D restore the HSI with the predefined fixed noise variance and do not learn σ in the simulated process. Meanwhile, they are implemented in the noisy HSI band by band and ignore the strongly spectral correlations. KSVD and BM3D show the lower values in both Figure 4a,b. ANLM3D and BM4D suppress the noise by exploiting both spatial and spectral information, which yields better simulated results compared with KSVD and BM3D. ANLM3D presents quite unstable performance, as shown in Figure 3. LRMR takes advantage of the low-rank property in HSI, and it presents a better FSIM value by retaining well the latent features of HSI. By exploring the hierarchical deep learning and fine-tuning, DDL3+FT shows comparable PSNR values as shown in Figure 4a. LRMR and DDL3+FT transform the HSI into a 2D matrix during the recovery process, and both of them cannot effectively exploit the spatial consistencies of HSI. Obviously, the plots obtained by the suggested approach have a more stable trend than the six compared approaches. This demonstrates the effectiveness and robustness of the proposed method on reducing zero-mean Gaussian noise with the σ varying across bands.
Sensors2016, 16,1718 10 of 26 better FSIM value by retaining well the latent features of HSI. By exploring the hierarchical deep learning and fine-tuning, DDL3+FT shows comparable PSNR values as shown in Figure 4a. LRMR and DDL3+FT transform the HSI into a 2D matrix during the recovery process, and both of them cannot effectively exploit the spatial consistencies of HSI. Obviously, the plots obtained by the suggested approach have a more stable trend than the six compared approaches. This demonstrates the effectiveness and robustness of the proposed method on reducing zero-mean Gaussian noise with the  varying across bands. In terms of visual comparison, Figure 5 shows the denoised results of band 101 calculated by different approaches. It can be clearly seen that the proposed method works well in suppressing the noise and preserving the detailed local structure. In terms of visual comparison, Figure 5 shows the denoised results of band 101 calculated by different approaches. It can be clearly seen that the proposed method works well in suppressing the noise and preserving the detailed local structure. In terms of visual comparison, Figure 5 shows the denoised results of band 101 calculated by different approaches. It can be clearly seen that the proposed method works well in suppressing the noise and preserving the detailed local structure. This is further depicted by the amplified area in the restored images of all competing methods. KSVD shows very poor performance and loses useful structural information. BM3D smooths out some important objects, which also works worse on the recovery. ANLM3D can effectively utilize the high nonlocal self-similarity to balance between smoothing and structural preservation, but it still fails to recover the profiles of local targets. The denoised result obtained by BM4D and DDL3+FT lose some fine objects. LRMR can obtain comparable results to the proposed method, but the Gaussian noise is not well reduced as shown in Figure 5g. Clearly, these visual assessment results are totally consistent with the above numerical evaluations.
Case 2: Mixed noise consisting of zero-mean Gaussian noise with standard variance σ = 0.2 and dead pixel lines. For Case 2, the restored results at band 95 calculated by different methods, including KSVD, BM3D, ANLM3D, BM4D, LRMR, DDL3+FT and the proposed approach, are shown in Figure 6. One area of interest is amplified in the recovery images of all methods. As shown in Figure 6a-h, we can find that the result obtained by the suggested method is much closer to the clean HSI in visual performance. The dead pixel lines in Figure 6c-h are still obvious, which means that the six compared methods fail to suppress the dead pixel lines. What's more, according to the amplified area it can be easily observed that our restoration method can effectively restore the homogenous region while preserving the edges. KSVD can partly ameliorate the noisy image quality, but it destroys the structure of sparse coefficients in dictionary learning process, which results in the loss of edges and other structural details as shown in Figure 6c. BM3D reduces the noises by utilizing the statistics of the similar neighboring patches and achieves a much better visual impression than KSVD, but it smooths out the structural information, and its result is blurred.
Sensors2016, 16, 1718 11 of 26 This is further depicted by the amplified area in the restored images of all competing methods. KSVD shows very poor performance and loses useful structural information. BM3D smooths out some important objects, which also works worse on the recovery. ANLM3D can effectively utilize the high nonlocal self-similarity to balance between smoothing and structural preservation, but it still fails to recover the profiles of local targets. The denoised result obtained by BM4D and DDL3+FT lose some fine objects. LRMR can obtain comparable results to the proposed method, but the Gaussian noise is not well reduced as shown in Figure 5g. Clearly, these visual assessment results are totally consistent with the above numerical evaluations.  Figure 6. One area of interest is amplified in the recovery images of all methods. As shown in Figure 6a-h, we can find that the result obtained by the suggested method is much closer to the clean HSI in visual performance. The dead pixel lines in Figure 6c-h are still obvious, which means that the six compared methods fail to suppress the dead pixel lines. What's more, according to the amplified area it can be easily observed that our restoration method can effectively restore the homogenous region while preserving the edges. KSVD can partly ameliorate the noisy image quality, but it destroys the structure of sparse coefficients in dictionary learning process, which results in the loss of edges and other structural details as shown in Figure 6c. BM3D reduces the noises by utilizing the statistics of the similar neighboring patches and achieves a much better visual impression than KSVD, but it smooths out the structural information, and its result is blurred. ANLM3D can effectively utilize the high nonlocal self-similarity to achieve better balance smoothing and structural preserving. It fails however to preserve the edges and the details. BM4D can achieve visual improvements by utilizing the 3D nonlocal self-similarity of data cube, however, it removes some of the fine objects and over smooths the HSI. As shown in Figure 6g, the blurred black lines mean that LRMR only removes part of the dead pixels lines; meanwhile, the blurry white dots indicate that LRMR cannot efficiently remove the heavy Gaussian noise. Clearly, the DDL3+FT method fails to restore the mixed noisy image, which can be seen from the blurry edges and obvious dead pixel lines as shown in Figure 6h. ANLM3D can effectively utilize the high nonlocal self-similarity to achieve better balance smoothing and structural preserving. It fails however to preserve the edges and the details. BM4D can achieve visual improvements by utilizing the 3D nonlocal self-similarity of data cube, however, it removes some of the fine objects and over smooths the HSI. As shown in Figure 6g, the blurred black lines mean that LRMR only removes part of the dead pixels lines; meanwhile, the blurry white dots indicate that LRMR cannot efficiently remove the heavy Gaussian noise. Clearly, the DDL3+FT method fails to restore the mixed noisy image, which can be seen from the blurry edges and obvious dead pixel lines as shown in Figure 6h.  ANLM3D can effectively utilize the high nonlocal self-similarity to achieve better balance smoothing and structural preserving. It fails however to preserve the edges and the details. BM4D can achieve visual improvements by utilizing the 3D nonlocal self-similarity of data cube, however, it removes some of the fine objects and over smooths the HSI. As shown in Figure 6g, the blurred black lines mean that LRMR only removes part of the dead pixels lines; meanwhile, the blurry white dots indicate that LRMR cannot efficiently remove the heavy Gaussian noise. Clearly, the DDL3+FT method fails to restore the mixed noisy image, which can be seen from the blurry edges and obvious dead pixel lines as shown in Figure 6h. From Figures 7 and 8, it can be observed that the profiles produced by the proposed method are closest to those of the original HSI and give the best removal of Gaussian noise and dead pixel lines. The curves in Figures 7b-g and 8b-g are not ideally consistent with those in Figures 7a and 8a and therefore lead to the limited ability for denoising HSI, which greatly supports the analysis above.
Spectral signatures of the clean image and the restored images at pixel (559, 150) are displayed in Figure 9, which are very critical for the classification and identification of the HSI. KSVD and BM3D denoise the HSI band by band and destroy the spectral-spatial correlation. As shown in Figure 9b,c, there are recognizable artifacts in the restored signatures obtained by KSVD and BM3D. Using ANLM3D and BM4D, the restored signatures are much nearer to the initial spectral reflectance curve due to the exploitation of the spectral-spatial information, but they appear to fluctuate strongly compared with the clean spectral signature. The restored signatures LRMR and DDL3+FT have the similar trends and shapes, but the detailed information cannot be well remained. Obviously, spectral signature calculated by the proposed method is the optimal, which demonstrates the superiority of the proposed method in reducing the Gaussian noise and dead pixel lines.  Figure 7,8a and therefore lead to the limited ability for denoising HSI, which greatly supports the analysis above. Spectral signatures of the clean image and the restored images at pixel (559, 150) are displayed in Figure 9, which are very critical for the classification and identification of the HSI. KSVD and BM3D denoise the HSI band by band and destroy the spectral-spatial correlation. As shown in Figure 9b,c, there are recognizable artifacts in the restored signatures obtained by KSVD and BM3D. Using ANLM3D and BM4D, the restored signatures are much nearer to the initial spectral reflectance curve due to the exploitation of the spectral-spatial information, but they appear to fluctuate strongly compared with the clean spectral signature. The restored signatures LRMR and DDL3+FT have the similar trends and shapes, but the detailed information cannot be well remained. Obviously, spectral signature calculated by the proposed method is the optimal, which demonstrates the superiority of the proposed method in reducing the Gaussian noise and dead pixel lines.  Figure 10. One area of all listed images is magnified for making a clear comparison. KSVD shows poor performance as shown in Figure 10c, and BM3D is oversmoothing and loses some useful objects. Also, there are obvious dead pixel lines in Figure 10c,d, which indicates that the KSVD and BM3D methods fail to remove the dead pixel lines. The ANLM3D and BM4D algorithms can only reduce part of the dead pixel lines, as presented in Figure 10d,f. Both of them do not work well in preserving the fine objects. The restored image of LRMR still has dead pixels lines and some Gaussian noise. From Figure 10h, it can be found that DDL3+FT fails in suppressing the dead pixels lines. As presented in Figure 10i, our approach can effectively remove the Gaussian noise, Poisson  Figure 10. One area of all listed images is magnified for making a clear comparison. KSVD shows poor performance as shown in Figure 10c, and BM3D is oversmoothing and loses some useful objects. Also, there are obvious dead pixel lines in Figure 10c,d, which indicates that the KSVD and BM3D methods fail to remove the dead pixel lines. The ANLM3D and BM4D algorithms can only reduce part of the dead pixel lines, as presented in Figure 10d,f. Both of them do not work well in preserving the fine objects. The restored image of LRMR still has dead pixels lines and some Gaussian noise. From Figure 10h, it can be found that DDL3+FT fails in suppressing the dead pixels lines. As presented in Figure 10i, our approach can effectively remove the Gaussian noise, Poisson noise and dead pixel lines while preserving the local details such as edges and textures. Obviously, it performs the best compared to the six compared methods. Figures 11 and 12 display the vertical and horizontal profiles of band 90 at pixel (399, 290) in simulated experiment of Case 3, separately. The spectral reflectance curves of all competing approaches at location (399, 290) before and after denoising are shown in Figure 13.    The rapid fluctuations of results in Figure 10 are caused by the dead pixel lines. A visual comparison is made based on the difference in the shape and amplitude of Figures 11-13. KSVD is inferior to suppress the mixed noise and preserve the spectral information, which can be clearly seen in Figures 11-13b. According to Figures 11-13c, BM3D is over smoothing while preserving the structure and introduces some recovery artifacts. From Figures 11-13d, it can be found that ANLM3D can partly reduce the mixed noise, which can be easily seen by the reduction of rapid fluctuations as shown in Figure 12d. As shown in Figures 11-13e,f, the curves obtained by BM4D and LRMR are much near to the initial curves compared to KSVD, BM3D, ANLM3D and DDL3+FT, but they fail in the detail preservation. By comparing the region marked by red rectangle, it is easily obtained that our method achieves the best approximation to the intrinsic patterns of the clean HSI, which is fully consistent with the analysis above.  The rapid fluctuations of results in Figure 10 are caused by the dead pixel lines. A visual comparison is made based on the difference in the shape and amplitude of Figures 11-13. KSVD is inferior to suppress the mixed noise and preserve the spectral information, which can be clearly seen in Figures 11-13b. According to Figures 11-13c, BM3D is over smoothing while preserving the structure and introduces some recovery artifacts. From Figures 11-13d, it can be found that ANLM3D can partly reduce the mixed noise, which can be easily seen by the reduction of rapid fluctuations as shown in Figure 12d. As shown in Figures 11-13e,f, the curves obtained by BM4D and LRMR are much near to the initial curves compared to KSVD, BM3D, ANLM3D and DDL3+FT, but they fail in the detail preservation. By comparing the region marked by red rectangle, it is easily obtained that our method achieves the best approximation to the intrinsic patterns of the clean HSI, which is fully consistent with the analysis above. The rapid fluctuations of results in Figure 10 are caused by the dead pixel lines. A visual comparison is made based on the difference in the shape and amplitude of Figures 11-13. KSVD is inferior to suppress the mixed noise and preserve the spectral information, which can be clearly seen in Figures 11b, 12b and 13b. According to Figures 11c, 12c and 13c, BM3D is over smoothing while preserving the structure and introduces some recovery artifacts. From Figures 11d, 12d and  13d, it can be found that ANLM3D can partly reduce the mixed noise, which can be easily seen by the reduction of rapid fluctuations as shown in Figure 12d. As shown in Figures 11e,f, 12e,f and 13e,f, the curves obtained by BM4D and LRMR are much near to the initial curves compared to KSVD, BM3D, ANLM3D and DDL3+FT, but they fail in the detail preservation. By comparing the region marked by red rectangle, it is easily obtained that our method achieves the best approximation to the intrinsic patterns of the clean HSI, which is fully consistent with the analysis above.

Experiment on Real Data
Two well-known real data sets are adopted for evaluating the proposed method, including Urban data and Indian Pines data. The visual impressions are presented to qualitatively estimate the experimental results before and after restoration. For real HSI, there is no reference image to implement the numerical evaluation, e.g., PSNR and FSIM. Hence, the classification accuracies on Indian Pines data are utilized to quantitatively estimate the denoising performance.

Denoising for Urban Data
Urban data, with the original size of 307 × 307 × 210, is acquired by the HYDICE sensor. Due to the detector-to-detector difference, it has different intensity strips and mixed noises versus bands. After removing the bands 104-108, 139-151, and 207-210 of Urban data, polluted by the atmosphere and water absorption, a subset with the size of 150 × 150 × 188 is used in the following experiments. Figure 14 presents the recovered images of band 186 obtained by different methods. To facilitate the visual comparison, the yellow arrows are utilized to mark the obvious stripes in Figure 14. Meanwhile, Figure 15 gives the enlarged details in the red rectangle of Figure 14. KSVD shows poor performance on the reduction of stripes and the preservation of structures as shown in Figures 14b and  15b. Obviously, BM3D and BM4D smooth out the important texture and fine targets, and both of them fail in removing the stripes. According to Figures 14d,g and 15d,g, it can be found that ANLM3D and DDL3+FT can partly suppress the stripes, and DDL3+FT has a better ability for preserving the texture and edge information than ANLM3D. LRMR shows better recovery performance than the other five compared methods on the noise reduction and structure preservation. From Figure 14f,h, it can be clearly observed that LRMR is inferior to the proposed method in removing the stripes. Generally, our proposed restoration method can efficiently restore the Urban data and convincingly outperform the six compared denoising approaches.

Experiment on Real Data
Two well-known real data sets are adopted for evaluating the proposed method, including Urban data and Indian Pines data. The visual impressions are presented to qualitatively estimate the experimental results before and after restoration. For real HSI, there is no reference image to implement the numerical evaluation, e.g., PSNR and FSIM. Hence, the classification accuracies on Indian Pines data are utilized to quantitatively estimate the denoising performance.

Denoising for Urban data
Urban data, with the original size of 307 × 307 × 210, is acquired by the HYDICE sensor. Due to the detector-to-detector difference, it has different intensity strips and mixed noises versus bands. After removing the bands 104-108, 139-151, and 207-210 of Urban data, polluted by the atmosphere and water absorption, a subset with the size of 150 × 150 × 188 is used in the following experiments. Figure 14 presents the recovered images of band 186 obtained by different methods. To facilitate the visual comparison, the yellow arrows are utilized to mark the obvious stripes in Figure  14. Meanwhile, Figure 15 gives the enlarged details in the red rectangle of Figure 14. KSVD shows poor performance on the reduction of stripes and the preservation of structures as shown in Figures  14-15b. Obviously, BM3D and BM4D smooth out the important texture and fine targets, and both of them fail in removing the stripes. According to Figures 14-15d and 14-15g, it can be found that ANLM3D and DDL3+FT can partly suppress the stripes, and DDL3+FT has a better ability for preserving the texture and edge information than ANLM3D. LRMR shows better recovery performance than the other five compared methods on the noise reduction and structure preservation. From Figure 14f,h, it can be clearly observed that LRMR is inferior to the proposed method in removing the stripes. Generally, our proposed restoration method can efficiently restore the Urban data and convincingly outperform the six compared denoising approaches.  In addition, Figure 16 presents the results at band 104 before and after denoising. As shown in Figure 16a, there are many heavy stripes within the initial band 104. The false-color images of Urban data before and after restoration are presented in Figure 17, which consist of the 1st, 104th and 135th bands. KSVD updates the dictionary atoms one by one, and destroys the structure of sparse coefficients. As shown in Figures 16b and 17b, it blurs the image structures and shows very poor performance. By exploiting the block matching and 3D collaborative filter, BM3D greatly enhances the image quality of the initial ones, but there are still obvious stripes in Figures 16c and 17c. According to the regions marked by the red ellipse, it can be found that ANLM3D, BM4D, LRMR and DDL3+FT can partly remove the stripes, and LRMR works worse in effectively restoring the images polluted by serious noises. Obviously, the proposed method shows the superior performance on denoising the flat region with the better edge preservation. Especially, by compared the regions marked by a blue rectangle in Figure 16, it can be easily found that the suggested method can restore well the targets and remain the edges while efficiently reducing the stripes and mixed noises. Above all, the restored images obtained by the proposed method present the best visual result than KSVD, BM3D, ANLM3D, BM4D, LRMR and DDL3+FT, with the satisfying detail preservation. (e) (f) (g) (h) In addition, Figure 16 presents the results at band 104 before and after denoising. As shown in Figure 16a, there are many heavy stripes within the initial band 104. The false-color images of Urban data before and after restoration are presented in Figure 17, which consist of the 1st, 104th and 135th bands. KSVD updates the dictionary atoms one by one, and destroys the structure of sparse coefficients. As shown in Figures 16-17b, it blurs the image structures and shows very poor performance. By exploiting the block matching and 3D collaborative filter, BM3D greatly enhances the image quality of the initial ones, but there are still obvious stripes in Figures 16-17c. According to the regions marked by the red ellipse, it can be found that ANLM3D, BM4D, LRMR and DDL3+FT can partly remove the stripes, and LRMR works worse in effectively restoring the images polluted by serious noises. Obviously, the proposed method shows the superior performance on denoising the flat region with the better edge preservation. Especially, by compared the regions marked by a blue rectangle in Figure 16, it can be easily found that the suggested method can restore well the targets and remain the edges while efficiently reducing the stripes and mixed noises. Above all, the restored images obtained by the proposed method present the best visual result than KSVD, BM3D, ANLM3D, BM4D, LRMR and DDL3+FT, with the satisfying detail preservation. In addition, Figure 16 presents the results at band 104 before and after denoising. As shown in Figure 16a, there are many heavy stripes within the initial band 104. The false-color images of Urban data before and after restoration are presented in Figure 17, which consist of the 1st, 104th and 135th bands. KSVD updates the dictionary atoms one by one, and destroys the structure of sparse coefficients. As shown in Figures 16-17b, it blurs the image structures and shows very poor performance. By exploiting the block matching and 3D collaborative filter, BM3D greatly enhances the image quality of the initial ones, but there are still obvious stripes in Figures 16-17c. According to the regions marked by the red ellipse, it can be found that ANLM3D, BM4D, LRMR and DDL3+FT can partly remove the stripes, and LRMR works worse in effectively restoring the images polluted by serious noises. Obviously, the proposed method shows the superior performance on denoising the flat region with the better edge preservation. Especially, by compared the regions marked by a blue rectangle in Figure 16, it can be easily found that the suggested method can restore well the targets and remain the edges while efficiently reducing the stripes and mixed noises. Above all, the restored images obtained by the proposed method present the best visual result than KSVD, BM3D, ANLM3D, BM4D, LRMR and DDL3+FT, with the satisfying detail preservation.

Experimental Results on Indian Pines Data
The second data set is named Indian Pines, which is recorded by the NASA AVIRIS sensor over the Indian Pines region in 1992, and this dataset contains much random noise in some bands during the acquiring process. It comprises 220 spectral bands and the spatial dimension of each spectral band is 145 × 145 pixels. For Indian Pines data, the ground truth has 16 land cover classes and a total of 10,366 labeled pixels.

Experimental Results on Indian Pines Data
The second data set is named Indian Pines, which is recorded by the NASA AVIRIS sensor over the Indian Pines region in 1992, and this dataset contains much random noise in some bands during the acquiring process. It comprises 220 spectral bands and the spatial dimension of each spectral band is 145 × 145 pixels. For Indian Pines data, the ground truth has 16 land cover classes and a total of 10,366 labeled pixels.

Experimental Results on Indian Pines Data
The second data set is named Indian Pines, which is recorded by the NASA AVIRIS sensor over the Indian Pines region in 1992, and this dataset contains much random noise in some bands during the acquiring process. It comprises 220 spectral bands and the spatial dimension of each spectral band is 145 × 145 pixels. For Indian Pines data, the ground truth has 16 land cover classes and a total of 10,366 labeled pixels.   As shown in Figures 18-19d, the image quality improvements obtained by ANLM3D appear very small and can be neglected. BM4D is oversmoothing and loses texture information. By observing the regions marked by the blue rectangle in Figures 18-20, our denoising method has the better ability on the edge and structure preservation than LRMR and DDL3+FT; meanwhile, LRMR and DDL3+FT work worse in efficiently removing the random noises as shown in the regions marked by the red ellipse in Figures 18-20. According to Figures 18-20, our algorithm is much more superior than the six compared methods for seriously corrupted images, which is consistent with the above analysis, so our algorithm can perform best in the removal of random noises from the Indian Pines data, while effectively improving the quality of the noisy HSI and retoring the texture and structure details. For classification-based evaluation, two cases are taken into consideration according to the testing data: (1) classifying the 20 heavily corrupted bands of Indian Pines data, including bands 104-108, 150-163 and 220; (2) classifying the Indian Pines data with 20 heavily As shown in Figures 18-19d, the image quality improvements obtained by ANLM3D appear very small and can be neglected. BM4D is oversmoothing and loses texture information. By observing the regions marked by the blue rectangle in Figures 18-20, our denoising method has the better ability on the edge and structure preservation than LRMR and DDL3+FT; meanwhile, LRMR and DDL3+FT work worse in efficiently removing the random noises as shown in the regions marked by the red ellipse in Figures 18-20. According to Figures 18-20, our algorithm is much more superior than the six compared methods for seriously corrupted images, which is consistent with the above analysis, so our algorithm can perform best in the removal of random noises from the Indian Pines data, while effectively improving the quality of the noisy HSI and retoring the texture and structure details. For classification-based evaluation, two cases are taken into consideration according to the testing data: (1) classifying the 20 heavily corrupted bands of Indian Pines data, including bands 104-108, 150-163 and 220; (2) classifying the Indian Pines data with 20 heavily As shown in Figures 18d and 19d, the image quality improvements obtained by ANLM3D appear very small and can be neglected. BM4D is oversmoothing and loses texture information. By observing the regions marked by the blue rectangle in Figures 18-20, our denoising method has the better ability on the edge and structure preservation than LRMR and DDL3+FT; meanwhile, LRMR and DDL3+FT work worse in efficiently removing the random noises as shown in the regions marked by the red ellipse in Figures 18-20. According to Figures 18-20, our algorithm is much more superior than the six compared methods for seriously corrupted images, which is consistent with the above analysis, so our algorithm can perform best in the removal of random noises from the Indian Pines data, while effectively improving the quality of the noisy HSI and retoring the texture and structure details.
For classification-based evaluation, two cases are taken into consideration according to the testing data: (1) classifying the 20 heavily corrupted bands of Indian Pines data, including bands 104-108, 150-163 and 220; (2) classifying the Indian Pines data with 20 heavily corrupted bands removed. Similar with the setting in [48], the number of training samples for small classes "alfalfa", "grass/pasture mowed", and "oats" is set as 15 samples per class, and the number for the remaining classes is set to 50. Support vector machine (SVM) is utilized as the classification method. As usual, the commonly used overall accuracy (OA) and the kappa coefficient are selected as evaluation metrics and the map of the results is used for visual estimation. Table 1 lists the overall accuracy (OA) and kappa coefficient (κ) of the results for the heavily corrupted bands. Table 2 shows the OA and kappa coefficient of the results for the remaining bands. After achieving the recovery for testing data, the values of OA and kappa coefficient are obviously enhanced as shown in Tables 1 and 2, which demonstrate the necessity of HSI denoising before implementing the classification. Compared with other algorithms, the proposed method obtains the best OA and kappa coefficients in both Tables 1 and 2, which means that our denoising method can greatly restore the structure information (which is essential for classification) in the seriously polluted bands or the remaining 200 bands. We note that ANLM3D method obtains an OA value of 25.17% and κ of 0.2135 in Table 1, which are just 9.43% and 0.1223 higher than initial HSI. This lower classification accuracy and kappa coefficient are totally in line with the poor denoising performance as displayed in Figure 19d. The classification maps of different algorithms are displayed in Figure 21, where the first row is the results of the 20 heavily corrupted bands and the second row presents the results of the remaining 200 bands before and after restoration. According to Figure 21, it can be easily observed that the result of suggested method presents the better visual effect than the six compared algorithms. corrupted bands removed. Similar with the setting in [48], the number of training samples for small classes "alfalfa", "grass/pasture mowed", and "oats" is set as 15 samples per class, and the number for the remaining classes is set to 50. Support vector machine (SVM) is utilized as the classification method. As usual, the commonly used overall accuracy (OA) and the kappa coefficient are selected as evaluation metrics and the map of the results is used for visual estimation.  Table 1 lists the overall accuracy (OA) and kappa coefficient ( ) of the results for the heavily corrupted bands. Table 2 shows the OA and kappa coefficient of the results for the remaining bands. After achieving the recovery for testing data, the values of OA and kappa coefficient are obviously enhanced as shown in Tables 1 and 2, which demonstrate the necessity of HSI denoising before implementing the classification. Compared with other algorithms, the proposed method obtains the best OA and kappa coefficients in both Tables 1 and 2, which means that our denoising method can greatly restore the structure information (which is essential for classification) in the seriously polluted bands or the remaining 200 bands. We note that ANLM3D method obtains an OA value of 25.17% and  of 0.2135 in Table 1, which are just 9.43% and 0.1223 higher than initial HSI. This lower classification accuracy and kappa coefficient are totally in line with the poor denoising performance as displayed in Figure 19d.
The classification maps of different algorithms are displayed in Figure 21, where the first row is the results of the 20 heavily corrupted bands and the second row presents the results of the remaining 200 bands before and after restoration. According to Figure 21, it can be easily observed that the result of suggested method presents the better visual effect than the six compared algorithms.  (h) Our method.

Threshold Parameter η
The value of the parameter η in Equation (4) is selected from 0 to 1. Quantitative evaluations of different η values were implemented by comparing the OA metric on Indian Pines data. Noticing that for the data consisting of the 20 heavily polluted bands, the SSIM values of adjacent bands are very similar with each other as shown in Figure 2c, we just analyzed the remaining 200 bands here. Figure 22 presents the relative SVM classification accuracies. It can be seen that the highest classification accuracies occur in η ∈ (0.4,0.5) and the lowest values is in η ∈ (0,0.1). In particular, the case when η = 0 means that the segmentation points are selected in line with the local dropping points, which result in the absence of sample information for some band-subsets and decrease the classification accuracy. When η = 1, there is no need to divide the bands. Observing the curve in Figure 22, it can be found that the appropriate η may well enhance the classification accuracies, which also demonstrates the necessary for the band-subset segmentation.

Threshold Parameter 
The value of the parameter  in Equation (4) is selected from 0 to 1. Quantitative evaluations of different  values were implemented by comparing the OA metric on Indian Pines data. Noticing that for the data consisting of the 20 heavily polluted bands, the SSIM values of adjacent bands are very similar with each other as shown in Figure 2c, we just analyzed the remaining 200 bands here. Figure 22 presents the relative SVM classification accuracies. It can be seen that the highest classification accuracies occur in (0.4,0.5) and the lowest values is in  (0,0.1). In particular, the case when  = 0 means that the segmentation points are selected in line with the local dropping points, which result in the absence of sample information for some band-subsets and decrease the classification accuracy. When  = 1, there is no need to divide the bands. Observing the curve in Figure 22, it can be found that the appropriate  may well enhance the classification accuracies, which also demonstrates the necessary for the band-subset segmentation. To simplify the analysis, we consider the special case that the number of dictionary atoms, Kc, is set as the same value K' for all band-subsets. Then the computational complexity can be rewritten as O( ( ′ + )( 1 + ⋯ + ) + ′ ). It is noted that the value of the expression ( 1 + ⋯ + ) is the number of bands in HSI and M is much larger than K', so the computational complexity is mainly dominated by the expression ′ . Due to the strong spectral correlations, C is usually smaller than 10, as shown in Figure 2. When C changes from 1 to 10, the empirical value of K' shows a rapid decrease, and the relative value of ′ shows a downwards trend, which implies a lower computational complexity.

The Sparse Noise Term
To evaluate the effectiveness of the sparse noise term, we performed a comprehensive comparison between the restored images obtained by the proposed method with the sparse noise term disabled and enabled. The Urban dataset is applied for the simulation, and the method with disabled sparse noise term is named as SNT-DIS. Figure 23 shows the visual impression of band 133 obtained by SNT-DIS and the proposed method. Compared with the corrupted image in Figure 23a, it can be easily found that both algorithms can greatly improve the image quality as displayed in Figure 23b,c, but the SNT-DIS can only reduce a small part of the stripes, as presented in Figure 23b, and it also fails in preserving the fine objects. As presented in Figure 23c, the suggested method can effectively remove heavy stripes and mixed noises while preserving the local details such as edges and textures. The obvious superiority of the proposed method demonstrates that the denoising performance can be greatly improved by introducing the sparse noise term. The number of the band subsets, C, is only dominated by the parameter η. For the HSI with C band subsets, its computational complexity is approximately equal to O(∑ C c=1 l x l y l c (K c + M) + (K c M) for one iteration. The values of l x , l y and M are fixed in our experiments. Hence, the computational complexity of the proposed method is determined by the parameters l c and K c . To simplify the analysis, we consider the special case that the number of dictionary atoms, K c , is set as the same value K' for all band-subsets. Then the computational complexity can be rewritten as O l x l y (K + M) (l 1 + . . . + l C ) + CK M . It is noted that the value of the expression (l 1 + . . . + l C ) is the number of bands in HSI and M is much larger than K', so the computational complexity is mainly dominated by the expression K M. Due to the strong spectral correlations, C is usually smaller than 10, as shown in Figure 2. When C changes from 1 to 10, the empirical value of K' shows a rapid decrease, and the relative value of CK M shows a downwards trend, which implies a lower computational complexity.

The Sparse Noise Term
To evaluate the effectiveness of the sparse noise term, we performed a comprehensive comparison between the restored images obtained by the proposed method with the sparse noise term disabled and enabled. The Urban dataset is applied for the simulation, and the method with disabled sparse noise term is named as SNT-DIS. Figure 23 shows the visual impression of band 133 obtained by SNT-DIS and the proposed method. Compared with the corrupted image in Figure 23a, it can be easily found that both algorithms can greatly improve the image quality as displayed in Figure 23b,c, but the SNT-DIS can only reduce a small part of the stripes, as presented in Figure 23b, and it also fails in preserving the fine objects. As presented in Figure 23c, the suggested method can effectively remove heavy stripes and mixed noises while preserving the local details such as edges and textures.
The obvious superiority of the proposed method demonstrates that the denoising performance can be greatly improved by introducing the sparse noise term.

Conclusions
In this paper, a novel hierarchical framework combining structural correlations in spectral domain and sparse learning is developed for HSI denoising. First, with the suggested band-subset partition, spectral-spatial cubic patches can be effectively extracted, where each patch has strong structure correlations between adjacent bands and high spatial similarities within the cubic patch. Second, noise-free estimations for each band-subset are generated by solving a sparse problem constructed by a Beta-Bernoulli process. The Gaussian process with Gamma distribution is regarded as the precursor of dictionary elements to fully exploit the spatial consistency in HSI. By utilizing the spectral-spatial information and Gibbs sampling, the suggesting framework can effectively learn the latent detail and structure information in HSI. It has the advantage of adaptively predicting the noise in a data-driven manner, which enables the framework to better depict the existing noises in HSI. Meanwhile, it can automatically infer the size and the atoms of dictionary. Compared with KSVD, BM3D, ANLM3D, BM4D, LRMR and DDL3+FT, the proposed framework achieves superior performance both numerically and visually, which has been verified on the simulated and real HSI. It can suppress various noises in HSI, such as Gaussian noise, Poisson noise, dead pixel lines, stripes or a mixture of them, with better image quality improvement and structure characteristics preservation. Further research will be directed toward exploiting the appropriately hidden structure and reducing the computational time for the analysis of the HSI.