Unsupervised and Supervised Feature Extraction Methods for Hyperspectral Images Based on Mixtures of Factor Analyzers

This paper proposes three feature extraction (FE) methods based on density estimation for hyperspectral images (HSIs). The methods are a mixture of factor analyzers (MFA), deep MFA (DMFA), and supervised MFA (SMFA). The MFA extends the Gaussian mixture model to allow a low-dimensionality representation of the Gaussians. DMFA is a deep version of MFA and consists of a two-layer MFA, i.e, samples from the posterior distribution at the first layer are input to an MFA model at the second layer. SMFA consists of single-layer MFA and exploits labeled information to extract features of HSI effectively. Based on these three FE methods, the paper also proposes a framework that automatically extracts the most important features for classification from an HSI. The overall accuracy of a classifier is used to automatically choose the optimal number of features and hence performs dimensionality reduction (DR) before HSI classification. The performance of MFA, DMFA, and SMFA FE methods are evaluated and compared to five different types of unsupervised and supervised FE methods by using four real HSIs datasets.


Introduction
Hyperspectral images (HSIs) provide abundant spectral information about a scene [1]. In general, an HSI contains hundreds of spectral bands with high spectral resolution [2][3][4]. Having sufficient spectral information makes it possible to discriminate different materials within a scene by using a classifier [5][6][7][8]. However, the high dimensionality of HSIs makes the processing computationally and memory costly. To achieve an acceptable classification accuracy for an image of high dimensionality many conventional HSI processing require many training samples [9][10][11]. This is known as the Hughes phenomenon or the curse of dimensionality [12]. Thus when we have a limited number of training samples, we have a trade-off between classification accuracy and the number of spectral bands [13][14][15][16][17][18][19][20][21]. Dimensionality reduction (DR) is a very effective way to solve this problem [22][23][24][25][26][27][28][29][30][31][32]. Dimensionality reduced data should be a good representation of the original data. In addition, both the computing time and the number of training samples required will become less when the data dimensionality is lower. Therefore, DR is a very important pre-processing step for HSI classification [33][34][35][36][37][38][39]. In general, DR can be divided into feature selection (FS) and feature extraction (FE). In this paper, we focus on FE. There exist several classical and novel statistical FE methods in the literature that have been used in HSI processing. FE methods are either unsupervised or supervised. Principal component analysis (PCA) [40] is a classical unsupervised FE method. PCA projects the original data onto a lower dimensional linear subspace of the original data space and can also be expressed as the maximum likelihood solution of a probabilistic latent variable model [41]. This reformulation of PCA is called probabilistic principal component analysis (PPCA) [42] and is an example of a linear-Gaussian framework, in which all of the marginal and conditional distributions are assumed to be Gaussian. Factor analysis (FA) [41,43] is also a linear Gaussian latent variable model closely related to PPCA. For FA, the conditional distribution of the observed variables given the latent variable have diagonal rather than an isotropic covariance matrix. In addition to these classical unsupervised FE methods, there are several novel unsupervised FE methods in the literature, such as orthogonal total variation component analysis (OTVCA) [44], edge-preserving filtering [45], Gaussian pyramid based multi-scale feature extraction (MSFE) [46], sparse and smooth low-rank analysis (SSLRA) [47], etc. For supervised FE methods, the new features should contain most discriminative information based on the labeled samples. There exist several supervised FE methods, such as linear discriminant analysis (LDA) [48], nonparametric weighted feature extraction (NWFE) [49], manifold-learning based HSI feature extraction [50], low-rank representation with the ability to preserve the local pairwise constraints information (LRLPC) [51], etc. Supervised methods are usually better than unsupervised methods for HSI classification [52][53][54], since they have access to labeled data. However, the effectiveness depends on how well the labeled dataset represents the whole original dataset.
For both PPCA and FA, all the marginal and conditional distributions of the HSI are assumed to be Gaussian. However, in practice, most HSIs cannot be assumed to obey a Gaussian distribution. To overcome this problem, we propose mixtures of factor analyzers (MFA), deep MFA (DMFA), and supervised MFA (SMFA) FE methods for HSI. We also propose an image segmentation method based on the Gaussian mixture model for MFA, DMFA, and SMFA to solve the problem of a non-normal distribution. MFA assumes a low-dimensionality representation of the Gaussians in the Gaussian mixture model. DMFA consists of a two-layer MFA, which inputs the samples from the posterior distribution at the first layer to an MFA model at the second layer. SMFA is a supervised FE method that uses labeled samples to extract features of HSI. Based on these three FE methods, a framework for HSI classification is also proposed in this paper. While the dimensionality of the desired features needs to be selected by the user in conventional DR methods, the proposed framework automatically determines the dimensionality of features according to classification accuracy without prior supervision by the user. The contribution of the paper are summarized as follows: • The paper is organized as follows. Section 2 briefly describes the three FE methods and a framework which automatically extracts optimal features for HSI classification. Section 3 presents experimental results and analysis of the results. Finally, Section 4 concludes the paper.

Proposed FE Methods And Framework
where N (z; 0, I) means that z is Gaussian vector with zero mean and d × d identity matrix I as the covariance matrix. The parameters of the m-th factor analyzer include a mixing proportion π m , mean µ m , a D × d factor loading matrix W m , and a D × D diagonal matrix Ψ which represents the independent noise variances for each band. The parameters z, µ m , W m , and Ψ of MFA are estimated (trained) by using an expectation maximization (EM) algorithm [55]. An example demonstrating how MFA works is shown in Figure 1a,b. The schematic of the MFA is shown in Figure 2.   The performance of MFA for classification can be improved by increasing the dimensionality d of the latent factors per component of the mixture of factor analyzers or the number M of mixture components. However, for high dimensionality data, this approach quickly leads to overfitting. Below we discuss a cross-validation scheme to select d while avoiding overfitting. Figure 1c shows a case where the posteriors have non-normal distribution, to solve this problem the DMFA model was proposed. Instead of a simple standard normal prior, the DMFA model uses a more powerful MFA prior:

DMFA
where θ (2) m is the model parameter in the second layer and it emphasizes that the new MFA's parameters are at the second layer and specific to component m of the first layer MFA, while holding the first layer parameters fixed. Thus, the DMFA is equivalent to fitting component-specific second layer MFAs with vectors drawn from p(z, m, |x; θ (1) m ) as data, where θ (1) m is the model parameter in the first layer. Using p(k m |m) = π (2) k m to denote the second layer mixing proportion of mixture component k m , and K m denote the total number of factor analyzers in the second layer for specific m of the first layer, so For convenience, denote all possible second layer mixture components with s = 1, ..., S, where S = ∑ M m=1 K m . The mixing proportions are π (2) s = p(m(s))p(k m (s)|m(s)), where m(s) and k m (s) are the first and second layer mixture components m and k m to which s corresponds.
For the DMFA algorithm, the same scheme can be extended to training third-layer MFAs, but in this paper, we only consider the two-layer DMFA model.
The DMFA model can be trained by using a greedy layers-wise algorithm. The first layer of DMFA is trained as described above in Section 2.1, when training the second layer of DMFA, freezing the first layer parameters and treating the sampled first layer factor values for each mixture component {z (1) n } m as training data for the second layer of DMFA. The DMFA model is summarized in Algorithm 1, and an illustration of the DMFA are shown in Figures 1c and 3, respectively.
Step 2: Train the first layer of DMFA on X with M mixture components and d (1) dimensional latent factors using the EM algorithm.
Step 3: Use the first layer latent factor dataset Y m = {z (1) n } m for each of the M mixture components as training data for the second layer of DMFA.
Step 4: Train the second layer of DMFA on Y m with d (2) dimensional latent factors and K m mixture components using the EM algorithm.
Step 5: Output DR results Z = {z SMFA is a supervised FE method, let y denote an output value (label) for each D-dimensional labeled spectral vector x. The SMFA model can be defined as p(y|z, m) = N (y; W ym z + µ ym , Ψ y ), (13) where the parameters of the m-th factor analyzer include mean µ m , a D × d factor loading matrix W m , and a D × D diagonal matrix Ψ which represents the independent noise variances for each band. W ym , µ ym , and Ψ y are similar defined.
The parameters z, µ m , W m , Ψ, W ym , µ ym , and Ψ y of SMFA are also estimated by using the EM algorithm [55]. The schematic of the SMFA is shown in Figure 4.

Framework
Traditionally, in DR, the dimensionality of desired features has to be initialized by the user. In this paper, we propose a framework that automatically selects the optimal dimensionality of desired features for HSI. We use the classification accuracy of a classifier on validation samples to automatically determine the dimensionality of the features. Different classifiers such as maximum likelihood (ML), support vector machine (SVM), and random forest (RF), can be used in this framework. The framework based on MFA, SMFA, and DMFA for HSI classification are summarized in Algorithms 2 and 3, respectively.

Algorithm 2 Framework based on MFA and SMFA
Step 1: Input HSI X, training samples; Step 2: Automatically select the optimal number of features d and mixture components M: Run MFA (or SMFA); Five-fold cross-validation of SVM (ML, or RF) on training samples; Save the cross-validation (CV) score CV M,d ; end end Return M andd corresponding to the best CV; Step 3: Run MFA (or SMFA) with M andd; Step 4: Run SVM (ML or RF) classification; Step 5: Output Classification results.

Algorithm 3 Framework based on DMFA
Step 1: Input HSI X, training samples; Step 2: Automatically select the optimal number of features in the first layer d (1) , in the second layer d (2) , mixture components in the first layer M, and in the second layer K m : for Step 4: Run SVM (ML or RF) classification; Step 5: Output Classification results.

Experiments and Results
The experiments were done using ML, RF, and SVM classifiers, but since ML and RF gave inferior or slightly inferior results compared to SVM results, only the results of the SVM classifier are reported.

Experimental Datasets
The Indian Pines dataset was collected by the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) sensor at Indian Pines. The image contains 145 × 145 pixels with a spatial resolution of 20 m and 220 spectral bands from 400 nm to 2500 nm. In the experiments, noisy bands and atmospheric vapor absorption bands are excluded leaving 200 spectral bands. Figure 5 shows the false-color composite of the Indian Pines image and the corresponding ground reference map, respectively. The nine largest classes are considered for classification [24,56]. For the Indian Pines, the University of Pavia, and the Salinas datasets, 10% of the labeled samples for each class are randomly selected as training samples, and the remaining 90% are used as the test set, respectively. Tables 1, 2    The Houston dataset was provided by the IEEE Geoscience and Remote Sensing Society (GRSS) for the Data Fusion Contest in 2013. This image is of the University of Houston campus and the neighboring urban area. The dataset has 349 × 1905 pixels with a spatial resolution of 2.5 m and 114 spectral bands coverage ranging from 380 nm to 1050 nm. This HSI contains fifteen classes of interest. Figure 6     The University of Pavia dataset was captured by the Reflective Optics System Imaging Spectrometer sensor over the city of Pavia, Italy. This image has 610 × 340 pixels with a spatial resolution of 1.3 m and 115 spectral bands coverage ranging from 0.43 µm to 0.86 µm. In the experiments, this data contains nine classes of interest and has 103 spectral bands after removing 12 noisy bands. Figure 7 shows the false-color composite of the University of Pavia image and the corresponding ground reference map. The Salinas dataset was acquired by the AVIRIS sensor over Salinas Valley, California. This image has 512 × 217 pixels with a spatial resolution of 3.7 m and 204 spectral bands after removing 20 water absorption bands. This image contains sixteen classes of interest. Figure 8 shows the false-color composite of the Salinas image and the corresponding ground reference map.

Experimental Setup
An SVM classifier is used to evaluate the performance of the proposed methods. The SVM classifier is a supervised classification method that uses a kernel method to map the data with a non-linear transformation to a higher dimensional space and in that space tries to find a linear separating hyperplane from different classes. In the experiments, for SVM, the LibSVM Toolbox for MATLAB was applied with a radial basis function (RBF) kernel [57,58]. The five-fold cross-validation is used to find the best parameters, i.e., the kernel parameter and regularization parameter, in SVM. The evaluation metrics used are overall accuracy (OA), average accuracy (AA), and Kappa coefficient (KC), as well as standard deviation (STD). To further evaluate the performance of the proposed algorithms, the following statistical and DR methods: PCA, PPCA, FA, LDA, NWFE, MFA, DMFA, and SMFA are used for comparison. Each experiment is run ten times, and the average of these ten experiments is reported. The assessment of the effect of the tuning parameters (M and d) on the performance of the proposed methods was of interest. Since we were interested in the classification accuracy, we investigated the effect of the number of mixture components M and the dimensionality of latent factors d on OA. Figure 9a,b shows the 3-dimensional surface of the OA of MFA and SMFA with respect to the values of parameters M and d for the SVM classifier, respectively. It can be seen that the OA gradually increases in the beginning as the d increases, and then keeps stable with slight fluctuation as the d increases, but decreases slightly when the d reaches some value. It can also be observed that the OA is insensitive to M. For the DMFA algorithm, we need to estimate the number of mixture components M and K m , and the dimensionality of latent factors d (1) and d (2) , in the first and second layer of DMFA, respectively. In the experiments, M is the same as in the MFA, K m ∈ {2, ..., M}, we set K m = 2 for all m in our experiments. d (1) is the same as d, d (2) ∈ {3, 4, ..., d (1) }. Five-fold cross-validation can be used to obtain the optimal parameters. To analyze the impact of the number of mixture components M and the dimensionality of latent factors d (1) in the first layer on the performance of DMFA, the output results of the first layer of DMFA were used for HSI classification. Figure 9c shows the OAs with respect to the values of parameters M and d (1) . Figure 9a,b show similar things for MFA and SMFA, respectively.

Classification
The first experiment was performed on the Indian Pines dataset. Figure 10 shows the classification maps obtained by different methods. From the figures, it can be seen that the classification maps obtained by PCA, PPCA, FA, LDA, and NWFE are not very satisfactory since they have lots of visible noise. By contrast, MFA, DMFA, and SMFA give much better classification maps, all of them have a smoother appearance and preserve more details on edges. Besides visual comparison, Table 5 presents the quantitative classification results for all the methods, and there it can be observed that the MFA, DMFA, and SMFA achieve much higher classification accuracies than PCA, PPCA, and FA, respectively. These results imply that the performance of DR could be improved by considering the Gaussian mixture model. Moreover, DMFA and SMFA clearly outperform MFA, and the performance of DMFA and SMFA is similar, both of them present the highest OA, AA, and KC and achieves most of the top classification accuracy values for individual classes. This indicates that MFA, DMFA, and SMFA could extract more useful information for classification from a complicated HSI. Moreover, SMFA is better than LDA and NWFE, this means that SMFA is an effective supervised DR method. MFA, DMFA, and SMFA improve the OA by 11.11%, 13.38%, and 13.42% using SVM compared to other methods in the experiment, respectively. It is interesting to note that all DR methods based on FA (SMFA, DMFA, MFA, and FA) gave a better performance than PCA and PPCA. The reason for this is that noise could be distributed inconsistently for different components in real HSI. Table 5 also gives the STDs of classification results for different DR methods for the Indian Pines dataset. It can be seen that all the methods give similar and stable classification results. Table 6 compares the CPU processing time (in seconds) used by different DR methods for the Indian Pines dataset. All methods were implemented in Matlab R2019a on a computer having Intel(R) Core(TM) i7-6700 processor (3.40 GHz), 8.00 GB of memory, and 64-bit Windows 10 Operating System. It can be seen that the running times for MFA, DMFA, and SMFA were 3.85, 18.07, and 0.12 s, respectively. It is worth noting that the running time for the supervised methods (LDA, NWFE, and SMFA) is affected considerably by the number of labeled (training) samples used, and the unsupervised methods are affected by the total size of the dataset.  The second experiment was performed on the Houston dataset. Figure 11 shows the classification maps obtained by different methods. From the figures, we can see that the proposed MFA, DMFA, and SMFA algorithms also outperform the other algorithms. Table 7 presents the quantitative classification results of the different DR methods. As shown in Table 7, the performance of DMFA and SMFA is better than MFA and much better than PCA, PPCA, FA, LDA, and NWFE. MFA, DMFA, and SMFA improve the OA by 4.60%, 6.67%, and 7.35% compared to other methods in the experiment, respectively. Table 7 also presents the STDs of classification results. It can be seen that FA and LDA present the most stable results. MFA, DMFA, and SMFA have slight fluctuation for each experiment and give relatively stable results.   The third and fourth experiments were performed on the University of Pavia and Salinas datasets. It should be noted that the NWFE method does not work for the Salinas dataset. Therefore, in the experiments of the Salinas dataset, there are no experimental results for NWFE. Figures 12 and 13 show the classification maps obtained by different methods on the University of Pavia and Salinas datasets, respectively. From Figures 12 and 13, it can be seen that the classification maps obtained by MFA, DMFA, and SMFA are much better than PCA, PPCA, FA, LDA, and NWFE, respectively. Tables 8 and 9 present the quantitative classification results. As shown in Tables 8 and 9, the classification accuracies of the proposed MFA, DMFA, and SMFA methods are much better than PCA, PPCA, and FA methods. These results further demonstrate that, instead of using a single Gaussian distribution model, the performance of DR could be improved by considering the Gaussian mixture model. For the University of Pavia dataset, MFA, DMFA, and SMFA improve the OA by 6.05%, 7.02%, and 7.02% compared to other methods used in the experiment, respectively. For the Salinas dataset, MFA, DMFA, and SMFA improve the OA by 5.11%, 5.49%, and 5.62% compared to other methods used in the experiment, respectively. Moreover, in the experiments, DMFA and SMFA are clearly better than MFA, and DMFA and SMFA give similar and highest OA, AA, and KC and also achieve most of the top classification accuracy values for the individual classes. This also further indicates that MFA, DMFA, and SMFA could extract more effective information for classification from a complicated HSI. Tables 8 and 9 also give the STDs of classification results for different DR methods for the University of Pavia and Salinas datasets, respectively. It can be seen that all the methods give similar and relatively stable classification results. This also further demonstrates that MFA, DMFA, and SMFA are stable and effective DR methods.

Conclusions
In this paper, MFA, DMFA, and SMFA were proposed for feature extraction of HSIs and were then used for classification of them. MFA, DMFA, and SMFA are probabilistic DR methods, instead of assuming that a whole HSI obeys a Gaussian distribution, the methods use a Gaussian mixture model to extract more effective information for DR. The Gaussian mixture model is used for MFA to allow a low-dimensionality representation of the Gaussian. A two-layer MFA, DMFA, utilizes the samples from the posterior at the first layer to an MFA model at the second layer. MFA and DMFA are two unsupervised DR method. The methods are particularly suitable for DR of HSI with a non-normal distribution and unlabeled samples. SMFA is a supervised DR method and uses labeled samples to extract features. SMFA can be effectively used to DR of HSI with a non-normal distribution and labeled samples.
Based on the three DR methods, we also proposed a framework for HSI classification, the overall accuracy of a classifier on validation samples is used to automatically determine the optimal number of features of DR for HSI classification. This framework can automatically extract the most effective feature for HSI classification. To validate the performance of DR, we conduct experiments in terms of SVM classification based on four real HSIs. The experimental results show that MFA, DMFA, and SMFA can give better results than statistical DR comparison methods.
In the future, more validations on other applications, e.g., hyperspectral unmixing, target detection, will be incorporated in our future work.

Conflicts of Interest:
The authors declare no conflict of interest.