Deep Nonnegative Dictionary Factorization for Hyperspectral Unmixing

As a powerful blind source separation tool, Nonnegative Matrix Factorization (NMF) with effective regularizations has shown significant superiority in spectral unmixing of hyperspectral remote sensing images (HSIs) owing to its good physical interpretability and data adaptability. However, the majority of existing NMF-based spectral unmixing methods only adopt the single layer factorization, which is not favorable for exploiting the complex and structured representation relationship of endmembers implied in HSIs. In order to overcome such an issue, we propose a novel two-stage Deep Nonnegative Dictionary Factorization (DNDF) approach with a sparseness constraint and self-supervised regularization for HSI unmixing. Beyond simply extending one-layer factorization to multi-layer, DNDF conducts fuzzy clustering to tackle the mixed endmembers of HSIs. Moreover, self-supervised regularization is integrated into our DNDF model to impose an effective constraint on the endmember matrix. Experimental results on three real HSIs demonstrate the superiority of DNDF over several state-of-the-art methods.


Introduction
Hyperspectral remote sensing images (HSIs) capture hundreds of narrow spectral channels, and provide detailed spectral information and spatial distribution status about the target ground objects. With their rich spectral information for distinguishing different objects, HSIs have received many successful applications, such as segmentation [1], classification [2], target detection [3], tracking [4], and recognition [5]. Due to the low spatial resolution of sensors and the complexity of the terrain, HSIs generally comprise the so-called "mixed pixel" [6], which causes spectral information of different materials presents in a single pixel of HSIs [7]. Larger spectral variance in mixed pixels makes them difficult to interpret [8], leading negative impacts to the quantitative analysis of HSIs.
To tackle this problem, hyperspectral unmixing (HU) plays a fundamental role in extracting a set of constituent signatures (i.e., endmembers) from the target HSIs and estimates the corresponding proportions (i.e., abundances) of them in each mixed pixel [6]. Nonnegative Matrix Factorization (NMF), as an effective tool for blind source separation, has drawn much attention in the HU applications owing to its good interpretability and data adaptability [9,10]. To improve the unmixing performance of basic NMF, introducing effective constraints into NMF based on the problem-dependent knowledge has been witnessed as one of the effective measures, such as the manifold constraint [11], sparseness constraint [12], low-rank constraint [13], smooth constraint [14], as well as local spectral similarity preserving constraint [15]. Unfortunately, the majority of existing NMF-based unmixing methods only adopt the single layer factorization, which limits their learning ability for hierarchical structure information of complex and mixed HSI data. Recently, inspired by the success of deep learning, deep NMF has been developed and shown promising unmixing performance owing to its multi-layer learning mechanism [16][17][18][19]. However, existing deep NMF-based methods commonly focus on factorizing the coefficient matrix to explore the abstract features of the data [17], which is not favorable for efficiently utilizing the complex hierarchical and multi-layers structured representation information between the endmembers and the mixed pixels included in HSIs.
To overcome the above issues, we propose a novel Deep Nonnegative Dictionary Factorization (DNDF) model in this paper, by which the single-layer NMF is used in succession to decompose the learned dictionary layer by layer within the pre-training stage, then a fine-tuning stage is employed to further optimize all the matrix factors obtained in the previous stage by reconstructing the original data via the production of these factors. As a result, a set of dictionaries that can be mutually represented in a hierarchical relationship are learned in a deep learning fashion in our method, by which the multi-layer structured representation property of HSIs is expected to be fully utilized to accurately extract the target endmembers. In addition, we introduce the L 1/2 sparseness regularization and self-supervised regularization into the DNDF model. Specifically, the sparseness regularization plays a role in enforcing the sparseness of coefficient matrices during each factorization, while the self-supervised regularization aims to keep the similarity between the atoms of dictionaries and the approximate endmember signatures learned by conducting fuzzy C-means clustering on the target HSI. The performance of the proposed method is evaluated on three real HSI datasets, where our DNDF demonstrates the effectiveness and superiority over several state-of-the-art algorithms. For the sake of clarity, the major contributions of this paper are highlighted as follows.

•
We propose a novel Deep Nonnegative Dictionary Factorization method for hyperspectral unmixing. By decomposing the learned dictionary layer by layer in our method, the multi-layer structured representation information implied in HSIs is expected to be captured and used to more accurately extract the target endmembers.

•
Our method incorporates L 1/2 sparse constraint and self-supervised regularization into the deep factorization framework, by which the abundance sparseness prior, the sparsity-representation property of dictionaries, as well as the endmember signatures information presented in the data are jointly exploited to guide the unmixing.

•
Our method consists of two stages, where the pre-training stage is designed to decompose the learned basis matrix layer by layer, and the fine-tuning stage are implemented to further optimize all the matrix factors obtained in the previous stage by reconstructing the original data via the production of these factors. Consequently, a set of dictionaries that capture the hierarchical representation relationship of the data are learned in a deep learning way.

Related Work
Here we briefly review HU methods based on constrained NMF and deep NMF. Then, the differences between our method and the existing methods are highlighted.
NMF has been demonstrated to be a promising method for HU under the linear spectral mixture model (LMM) [6]. It can decompose a nonnegative HSI matrix into a production of nonnegative endmember and abundance matrices, such that each pixel in an HSI can be represented as a linear combination of endmember spectra in the endmember matrix weighted by its fractional abundances in the abundance matrix [20]. This makes NMF not only possible to simultaneously unmix the endmembers and the corresponding abundances, but also ensure the nonnegativity of the estimated endmembers and abundances, and thus the unmixing results have good interpretability [12]. However, the basic NMF is easy to get into local minima, and also suffers from the problem of non-unique solution [21].
In order to enhance the performance of basic NMF for HU, many improvements have been conducted by developing effective regularizations for NMF, such as the geometry-based regularization and sparseness-based regularization. For example, minimum volume constrained NMF (MVCNMF) [22] is developed by introducing geometry-based regularization in attempt to minimize the volume of the simplex formed by the candidate endmembers. Aiming at exploiting the sparse prior of abundance, L 1/2 sparsity-constrained NMF (L 1/2 -NMF) [20] is proposed by employing an effective L 1/2 regularizer. On this basis, graph regularized L 1/2 -NMF (GLNMF) [23] introduces a graph structure-based regularization that can leverage the intrinsic manifold structure of HSIs. Recently, spatial group sparsity regularized NMF (SGSNMF) is presented by utilizing a spatial group-structured prior information of HSIs [24]. Although existing works have improved the performance of NMF-based unmixing methods, these methods only adopt the single layer factorization, which is not favorable for efficiently utilizing the multi-layers structured information implied in HSIs [17].
Recently, deep NMF-based methods have shown their potentials for unmixing [19,25,26]. The advantages of deep NMF lie in that unlike the single layer NMF decomposes the original data into the production of two matrix factors, it learns the hierarchical and abstract features of the original data by factorizing the coefficient matrix layer by layer [16]. As a result, more complex abstract features from data can be captured and used to improve unmixing performance [19]. However, the existing deep NMF-based unmixing methods mainly focus on factorizing the coefficient matrix neglecting to capture the multi-layer representation relationship among basis matrices.
Unlike these methods, we propose the two-stage deep dictionary factorization to effectively exploit the hierarchical representation relationship of dictionaries in a deep learning way, and design effective regularizations for the hyperspectral unmixing purpose, by which the multi-layers structured information included in HSIs, the abundance sparsity property of data, as well as the endmember signatures information implied in the data are expected to be jointly utilized to more accurately estimate the endmembers and abundances.

Notation and Problem Definition
In this paper, we use boldface uppercase letters to represent matrices, and boldface lowercase letters to denote the vectors. For a matrix X ∈ R m×n , X denotes the transpose of X, boldface lowercase x n stands for the n-th column vector of X, and [X] mn represents the entry of X in the m-th row and n-th column. We use X F to denote the Frobenius norm of X. X 0 signifies that each entry of X is equal to or larger than zero. For l ∈ N, we use I l to denote the l × l identity matrix, and 1 l to represent an all-one l × l matrix. The symbol "⊗ and " are used to represent the element-wise multiplication and division operation of the matrix, respectively.
Logically, an HSI can be regarded as an image cube X ∈ R M×C×B , where B denotes the number of bands involved in each pixel, whilst M and C denote the number of rows and columns of pixels, respectively. For the sake of computational convenience, this cube can be represented in a matrix form as Y = [y 1 , y 2 , ..., y N ], where Y ∈ R B×N , N = M × C, and y n ∈ R B denotes the n-th pixel spectrum. According to LMM, given that Y is known, the HU problem can be solved from the perspective of blind source separation, by which Y is approximately reconstructed by Y ≈ AS, where A = [a 1 , a 2 , ..., a K ] ∈ R B×K is the nonnegative endmember matrix with each column a k denoting an endmember; S = [s 1 , s 2 , ..., s N ] ∈ R K×N represents the nonnegative abundance matrix with s n = (s n1 , s n2 , ..., s nK ) standing for the abundance vector of the pixel spectrum y n . In other words, we need to estimate both A and S given that only Y is known in actual HU applications.

DNDF Model
The unmixing performance of constrained NMF-based methods is limited by their single layer factorization ability. Although the existing deep NMF-based unmixing methods have overcome this issue to some extend by decomposing the coefficient matrix in succession, this manner of factorization is unfavorable to exploit the complex multi-layer representation relationship between the endmembers and the mixed pixels included in HSIs. To deal with this challenge, the proposed DNDF model decomposes the basis matrix layer by layer instead of the coefficient matrix. Specifically, the data matrix Y is first factorized into A 1 and S 1 , then A 1 is further decomposed into A 2 and S 2 , next the factorization procedure is continued until the L-th layer is reached. This can be formally expressed as where A l and S l respectively represents the basis and coefficient matrices in the l-th layer. Based on this multi-layer factorization schema, the data matrix Y can be approximately reconstructed by where A L is regarded as the endmember matrix, and S L S L−1 · · · S 1 is used to generate the abundances. As shown in Figure 1, our proposed DNDF unmixing model consists of two parts: pre-training model and fine-tuning model. By the pre-training model, we aim to learn multi-layer dictionaries by decomposing the basis matrix layer by layer, so that the endmembers can be better learned from the dictionaries extracted from the HSI data in a deep leaning fashion. With the guidance of fine-tuning model, the obtained matrix factors by the previous model are expected to be optimized jointly by reconstructing the original data as far as possible. Formally, the objective functions of the pre-training Q p and fine-tuning Q f can be respectively expressed as follows: where Y denotes the known HSI matrix; A l−1 is the basis matrix of the (l − 1)-th layer and A l−1 = Y in case of l = 1, while A l and S l represent the basis and coefficient matrix of the l-th layer, respectively; λ and β are regularization parameters. Specifically, the first term in Q p plays the same role of single layer NMF for decomposing the dictionary A l−1 , whereas the function of first term in Q f is to reconstruct Y using the product of the matrix factors learned based on Q p . The second and third terms in both Q p and Q f are the self-supervised regularization and sparseness regularization, respectively. Since weakly supervised information for the basis matrix helps improve the performance of NMF [27], we employ the self-supervised fashion to learn effective weakly supervised information as guidance for extracting accurate endmembers. Specifically, before each factorization in the layer l, where l = 1, 2, ..., L, the Fuzzy C-means (FCM) clustering method is conducted on Y, then the obtained cluster centers are used to construct the approximate endmember matrix E l , which functions as the weakly supervised information for the endmember matrix. Note that the number of clusters of FCM equals the number of column vectors in the basis matrix of the l-th layer; E l = [e 1 , e 2 , ..., e K l ] denotes the approximate endmember matrix with each e k , k = 1, 2, ..., K l representing a clustering center given by FCM, whereasĀ k andĒ k are respectively defined bȳ It is worth pointing out that the numerator of proposed self-supervised regularization term plays the role of keeping the achieved A l as similar as the approximate endmember matrix E l whilst the denominator aims to make each estimated endmember a k has the maximum difference with the other clustering centers except for e k .
To learn a sparse representation between dictionaries, sparsity constraint is also imposed on the coefficient matrix in the factorization of each layer. The effectiveness of this measures lies in that the sparsity of abundance distribution is an intrinsic property of HSIs [20,23,24], as well as sparse representation enforces the learned endmember signatures to be more similar to the real endmember. Previous studies have demonstrated that L 1/2 regularization has the superiority for enforcing the sparseness of decomposed abundance matrix. Thus, the L 1/2 regularizer S 1 2 = ∑ i,j (s i,j ) 1 2 is introduced into Q p and Q f to impose sparsity constraint on the S l . Figure 1. Illustration of the proposed model of deep nonnegative dictionary factorization that includes two parts: pre-training and fine-tuning stages. By the pre-training stage, we aim to learn a set of dictionaries A 1 , · · · , A L with hierarchical representation relationship by decomposing the basis matrix layer by layer, so that endmembers can be better learned from the dictionaries extracted from the HSI data in a deep leaning fashion. With the guidance of fine-tuning model, the obtained matrix factors A L , S L , · · · , S 1 in the previous stage are expected to be further optimized jointly by reconstructing the HSI matrix Y as far as possible.

Optimization and Implementation
Since the objective functions in Equations (3) and (4) are both non-convex w.r.t two or more matrix factors simultaneously, we deduce the multiplicative update rules (MURs) of the proposed unmixing model by alternating optimization, by which each matrix factor is updated with the others being fixed.
(i) MURs for the pre-training stage. Considering the objective function Q p in Equation (3), when A l remains unchanged, the MUR of S via [20] can be given as follows: where ij . As for the MUR of A l , we first give the gradient of function Q p w.r.t A l by where matrix W = 1 K l − I K l ; The functions G and H are respectively defined by Based on the gradient descent principle, the update rule for A l can be given by Then, we set η mi = A l ⊗ (A l S l S l + λH(A l , E l )) in Equation (6). The MUR for A l can be deduced as (ii) MURs for the fine-tuning stage. For the objective function Q f in Equation (4), when S L , ..., S 1 keep fixed and are regarded as a whole matrix, the MUR of A L can be obtained in terms of the deduction procedure of MUR in Equation (7). It can be expressed as Similarly, when A L S L S L−1 · · · S 2 is considered as a fixed matrix, the MUR of S 1 can be given as For the derivation of the MUR of S l ,where l = 2, 3, ..., L, Q f in Equation (4) can be re-expressed as where Φ = A L S L ...S l+1 and Ψ = S l−1 ...S 1 , respectively. The additive update for S l can be given as The gradient of Q f w.r.t S l can be expressed by By substituting Equation (12) into Equation (11) and setting η uv = S l (ΦΦ S l ΨΨ + λS − 1 2 l ), the MUR of S l can be derived as Based on the above analysis, the pseudo-code of the proposed DNDF algorithm for HU is summarized in Algorithm 1. It mainly consists of two parts: the pre-training part (Line 2-10) and the fine-tuning part (Line 12-18). For the pre-training part, it is worth noting that A l is initialized based on the cluster centers given by the FCM method, while the initial values of S l are provided by the fully constrained least squares algorithm [28]. Update S 1 by (9); 18 until the stopping criteria are not satisfied; 19 Return A = A L and S = S L · · · S 1 .

Experiment
Here, we first briefly introduce the datasets used in the experiments, the baseline methods and algorithm settings, as well as the performance metrics. Then, the quantitative evaluation, factors exploration and parameter analysis are addressed.

Experiment Setting
Datasets. Table 1 summarizes the statistics of three real datasets used in the experiments and Figure 2 shows their visual pseudo-color images. (1) Samson comprises 952 × 952 pixels and has 156 bands covering the wavelengths from 0.401 to 0.889 µm. For computational convenience, a subimage that has the size of 95 × 95 pixels and contains three types of endmembers (Soil, Tree, and Water) is chosen from the original data and used in the experiment [19,29]. (2) Jasper Ridge was collected by the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) sensor. The original data has 512 × 614 pixels, each of which is recorded with 224 bands with the wavelength ranging from 0.380 to 2.5 µm. Following the setting of [30], a subimage with 100 × 100 pixels is adopted. The target scene mainly has four kinds of endmebers: Soil, Water, Road, and Tree. To eliminate the impact of the dense water vapor and atmospheric effects, the spectral bands 1-3, 108-112, 154-166 and 220-224 are removed from the data, leaving 198 bands. (3) Washington DC Mall is captured by the airborne HYperspectral Digital Imagery Collection Experiment (HYDICE) sensor over the Washington DC Mall area, USA.
The original data consists of 210 spectral bands with wavelengths from 0.4 to 2.5 µm. A subimage with 150 × 150 pixels from the original HSI is used in the experiment [31]. There are five types of endmembers in the selected subimage including Grass, Tree, Street, Roof, and Water. We deleted the bands corrupted by noise seriously, as well as the water vapor absorption bands (including bands 103-106, 138-148 and 207-210) from the data and the remaining 191 bands were used.  Baseline Methods. We compare the proposed DNDF with four kinds of NMF-based unmixing methods. (1) Two sparsity-constrained NMF-based algorithms such as L 1/2 -NMF [20] and SGSNMF [24]; (2) An NMF-based method with spectral geometry related constrain, i.e., MVCNMF [22]; (3) A representative graph-regularization based NMF, named GLNMF [23]; (4) A state-of-the-art deep NMF-based unmixing approach, i.e. deep NMF with total variation (SDNMF-TV) [19]. The algorithm parameters for all baseline methods adopt the same values as those in their original works. To make a fair comparison, all the algorithms initialize the endmember and abundance matrics by vertex component analysis (VCA) [32] and FCLS [28], respectively, except for SGSNMF, which initializes the endmember matrix with the help of superpixel segmentation. Additionally, we test our method under the condition that the number of factorization layers is set to 1 and 2, respectively, and the obtained algorithms are referred to as DNDF-L1 and DNDF-L2 accordingly. The maximum number of iterations is set to 500 for pre-training, and 4000 for the fine-tuning stage. We set the experience values of regularization parameters to λ = 0.1 and β = 0.1, respectively. All the algorithms are tested 20 times and the average performance is reported.
Performance Metrics. To evaluate the performance of our method, Spectral Angle Distance (SAD) [19,20], which can measure the similarity between an estimated endmember and the reference one, is adopted. Given thatâ k and a k respectively denote the estimated and the reference endmember, SAD can be defined as where · denotes the Euclidean norm of vectors. Since SAD represents the angular distance between two vectors, a smaller SAD value indicates better performance. According to its definition, the SAD measure has the unit of "radians".

Performance Evaluation
To test the performance of our method, three experiments are conducted on different datasets. The performance evaluation is presented from three aspects: quantitative evaluation, factor exploration, and algorithm parameter analysis.
Quantitative Evaluation. The quantitative comparisons about the experimental results on three datasets are reported in Table 2 including the SAD values of each endmember and their averages of all the endmembers provided by each baseline algorithm. Each row corresponding to the individual endmember shows the performance of all methods w.r.t this endmember, whereas the rows indicated by "Average" denote the average SAD values of all the endmembers of a specific dataset. The value with red color denotes the best performance, whilst the blue value indicates the second best. As can be seen from the average SAD respectively, the proposed DNMF-L2 performs best on all three datasets. It reduces 9.9% mean SAD on the Samson dataset compared with the second best method, i.e., DNMF-L1. According to the results on the Jasper Ridge and Washington DC Mall datasets, DNMF-L2 decreases the mean SAD of 9.7% and 8.6% respectively than the second best method, i.e., GLNMF. As for the performance on individual endmember, DNDF-L2 has the largest number of second-best instances on Samson, and provides a best and a second best instances among four endmembers on Jasper Ridge. On Washington DC Mall, DNDF-L2 shows its superiority by giving one best and two second best instances within five endmembers. Table 2. Performance comparison w.r.t the metric of spectral angle distance (SAD) on three datasets (in radians), where a smaller SAD value indicates better performance signifying the better similarity between the estimated endmember and the reference one. Red bold and blue italic denote the best and second-best results, respectively. Factor Exploration. To provide an intuitive evaluation about DNDF, we provide some visual results. Figure 3 shows an example of the learned hierarchical dictionaries on the Samon by the obtained basis matrices A 1 and A 2 in the pre-training stage of DNDF-L2. The atoms of the dictionary learned by the factorization of first layer are respectively plotted in Figure 3a-c according to their categories, whereas Figure 3d-f show the atoms of dictionary of the second layer. Here, the red solid lines denote the reference endmember signatures while the dotted lines represent the atoms. As can be seen in Figure 3a-c, the learned atoms have good similarity to the reference signatures except for a particular learned signature with yellow color in Figure 3c. It also can be seen from Figure 3d-f, the learned atoms have good consistency with the corresponding reference signatures, revealing the effectiveness of our method. In order to qualitatively interpret the abundance estimations, we illustrate some instances of the abundance maps unmixed by all the methods in Figure 4. All the abundance maps are represented by a heatmap manner, where the pixel with high color temperature denotes a big presence proportion of corresponding endmember.

DNDF-L1 DNDF-L2 GLNMF SDNMF-TV SGSNMF MVCNMF L 1/2 -NMF
Next, we investigate the effectiveness of each term in our proposed model. Table 3 shows the performance comparison of on three datasets, where DNDF denotes the proposed method without any regularization, DNDF-SP, DNDF-SS, and DNDF-L2 represent the proposed method with sparse regularization, self-supervised regularization and both regularizations, respectively. We can see that the integrated model outperforms the partial components, which demonstrates the effectiveness of the joint work of different terms.   Parameter Analysis. Here, we analyze the impact of two algorithm parameters. We first fix β = 0.1, then choose different values for λ from the set {1 × 10 −4 , 1 × 10 −3 , 0.01, 0.1, 0.2, 0.3, 0.4, 0.5}. Figure 5a shows the performance of DNDF-L2 on Jasper Ridge dataset. As can be seen, the performance our algorithm ascends in the first steps gets better SAD values when λ ∈ [0.01, 0.2], and the performance degrades when λ ≥ 0.3. λ = 0.1 is the optimal selection according to SAD criteria. Next, the impact of β on our algorithm is tested on the condition that λ is fixed to 0.1, whilst selecting different values for β in the same collection as λ. As Figure 5b shows that our algorithm has better performance when β ∈ [0.01, 0.3], and β = 0.1 is the best choice.  Running Time. To evaluate the performance of our methods on running time, we tested all the seven methods on three datasets. The average running time of them are reported in Table 4. From Table 4, it can be seen that DNDF-L2 needs shorter time than SDNMF-TV and MVCNMF on all the the datasets. According to the results on the Samson dataset, we can see from Table 4 that the running time of DNDF-L1 is the least, and SDNMF-TV needs the most time. When all the methods are tested on Jasper Ridge dataset, L 1/2 -NMF consumes the least time, followed by DNDF-L1. As to the results on Wash. DC Mall dataset, L 1/2 -NMF performs best, and DNDF-L1 is better than SGSNMF, DNDF-L2, SDNMF-TV, and MVCNMF.

Conclusions
We proposed a novel Deep Nonnegative Dictionary Factorization framework with L 1/2 sparseness constraint and self-supervised regularization for hyperspectral unmixing task, which integrated the pre-training stage and fine-tuning stage into a unified model. By imposing sparse constraints on the coefficient matrices, DNDF jointly exploited the sparseness prior of abundance, as well as the sparsity-representation property of dictionaries. In addition, the self-supervised regularization based on the guidance of fuzzy clustering helped to learn hierarchical dictionaries that captured the latent endmembers. Experiments conducted on three real-world datasets demonstrated the effectiveness and superiority of our approach compared with several state-of-the-art methods based on single-layer NMF and deep NMF.