A Novel Tri-Training Technique for the Semi-Supervised Classification of Hyperspectral Images Based on Regularized Local Discriminant Embedding Feature Extraction

Depin Ou 1,† , Kun Tan 1,2,†,* , Qian Du 3, Jishuai Zhu 1,4 , Xue Wang 1 and Yu Chen 1,* 1 Key Laboratory for Land Environment and Disaster Monitoring of NASG, China University of Mining and Technology, Xuzhou 221116, China; tb17160017b2@cumt.edu.cn (D.O.); zhujishuai@charmingglobe.com (J.Z.); tb16160015b2@cumt.edu.cn (X.W.) 2 Key Laboratory of Geographic Information Science (Ministry of Education), East China Normal University, Shanghai 200241, China 3 Department of Electrical and Computer Engineering, Mississippi State University, Starkville, MS 39762, USA; du@ece.msstate.edu 4 Chang Guang Satellite Technology Co. Ltd., Changchun 130033, China * Correspondence: tankun@cumt.edu.cn (K.T.); chenyu@cumt.edu.cn (Y.C.); Tel.: +86-051683591309 (K.T.) † These authors contributed equally to this work.


Introduction
Hyperspectral sensors have hundreds of spectrally contiguous bands, which can provide abundant spectral information [1]. Due to the high spectral resolution, hyperspectral images (HSIs) have been widely used in applications such as agricultural mapping [2], water quality analysis [3], and mineral identification [4]. The key component in these applications is the classification. Some of the conventional supervised classifiers can offer satisfactory classification performances, but the performance is dependent on both the quantity and quality of the training samples. However, labeled training samples can be costly, difficult, and time-consuming to obtain, and it is difficult for the traditional supervised classifiers to obtain good performances when the number of labeled training samples is limited [5]. Despite the fact that deep learning based methods have now been developed for HSI classification, including convolutional neural networks (CNNs) [6][7][8], 3D convolutional neural networks (3D-CNNs) [9,10], and long short-term memory (LSTM) networks [11,12], these problems still exist. Therefore, how to use unlabeled samples to improve the classification performance has become a hot research topic. The use of unlabeled samples to improve the classification performance is known as semi-supervised learning [13]. Common semi-supervised learning algorithms include multi-view learning algorithms [14], self-learning algorithms [15], tri-training algorithms [16], graph-based approaches [17], and the transductive support vector machine (TSVM) algorithm [18]. High-dimensional data processing needs more storage and computation time [19,20]. In addition, the spectral bands in an HSI are highly correlated, and the classification performance deteriorates as the dimensionality increases (the Hughes phenomenon) with limited training samples [21,22]. Therefore, in order to reduce the time consumption and improve the classification performance, it is necessary to extract the useful spectral information before performing classification.
The basic technique of spectral information extraction is dimension reduction, the goal of which is to embed the high-dimensional data in a low-dimensional space containing the crucial information [23,24]. Research into dimension reduction has experienced rapid development in recent years. Linear dimension reduction methods obtain the spectral information in the low-dimensional space by building a linear model. Typical methods include principal component analysis (PCA) [25], linear discriminant analysis (LDA) [26], direct linear discriminant analysis (DLDA) [27], and the maximum margin criterion (MMC) [28]. These methods are simple to operate, efficient, and have a strong generalization ability for linear datasets. However, these methods cannot obtain satisfactory performances in nonlinear datasets. Therefore, nonlinear dimension reduction methods have been proposed for use with nonlinear datasets [29]. Common nonlinear dimension reduction methods include kernel based approaches [30,31] and manifold learning algorithms [32]. In [33], kernel PCA was first proposed to solve the sparsity and dimensionality problems of nonlinear datasets. In [34], a new nonlinear dimension reduction method combining a kernel function with Fisher discriminant analysis was used in the classification of HSIs. In [35,36], Song et al. proposed models to learn a set of robust hash functions to map the high-dimensional data points into binary hash codes by effectively utilizing the local structural information. However, how to select a suitable kernel function lacks a theoretical basis.
The manifold learning algorithms depict the intrinsic structure of high-dimensional data by constructing a representation of the data lying in a low-dimensional manifold [31]. Tenenbaum [37] tried to preserve the geodesic distances based on multi-dimensional scaling, and proposed the isometric feature mapping (Isomap) method. In [38], locally linear embedding (LLE) was used to embed data points in a low-dimensional space by finding the optimal linear reconstruction in a small neighborhood. He et al. [39] subsequently proposed the neighborhood preserving embedding algorithm based on LLE, and regarded the error minimization as the objective function. In [40], the local discriminant embedding (LDE) algorithm was used to extend global LDA to a local version, so as to perform the local discriminant embedding in a graph embedding framework. However, the aforementioned manifold learning algorithms have singularity and cannot preserve the data diversity in the case of limited training samples. Therefore, in this paper, we propose a new feature extraction method-regularized local discriminant embedding (RLDE)-to preserve the local feature information and overcome the singularity when training samples are limited. In order to make full use of the unlabeled samples, we select the semi-supervised tri-training algorithm. We also use an active learning method to select the unlabeled samples and use ensemble learning to improve the classification result.

Spatial Mean Filtering and Feature Extraction
x m ] ∈ R n×m denotes the training dataset with n-dimensional feature vectors; Y = [y 1 , y 2 , · · · , y m ] ∈ R represents the corresponding labels; m is the number of training samples; and all the datasets are denoted as {x i } l i=1 ∈ R n , where l is the number of datasets.

Spatial Mean Filtering
To reduce noise and smooth the homogeneous regions, we first use spatial mean filtering to preprocess the HSIs. The spatial mean filtering of a labeled pixel X i is denoted as: where w is the width of the neighborhood window; s = w 2 − 1 is the number of neighbors of X i ; v k = exp −γ 0 ||X i − X ik || 2 stands for the spectral distance of the neighboring pixels to the central pixel; and γ 0 represents the degree of filtering.

Local Discriminant Embedding (LDE)
LDE is a nonlinear supervised dimension reduction method. The local information of homogeneous and heterogeneous samples is preserved by defining inter-class graphs and within-class graphs [41,42]. The basic idea is to simultaneously attain between-class separation and within-class local structure preservation. The objective function of LDE is denoted as: where V is the optimal projection matrix; and ω , ω are the weight matrix of the heterogeneous neighboring sample points and the weight matrix of the nearest-neighbor sample points, which are defined as: where t is a constant parameter, and the value of t is the square of the mean value of the Euclidean distances between the sample points. N(x) is the k neighborhood samples of training sample x. Equation (2) can be converted into: After conversion, we can obtain: Thus, the objective function can be written as follows: where D and D are diagonal matrices, and the diagonal elements are D i,i = ∑ ω i,j and D i,i = ∑ ω i,j . W and W are affinity weight matrices, which are sparse and symmetric, as computed by Equations (3) and (4), respectively. The optimal LDE projection is obtained by finding the eigenvectors corresponding to nonzero small eigenvalues of the following generalized Eigen-decomposition problem:

Regularized Local Discriminant Embedding (RLDE)
The manifold structure of all the data can be obtained after simulating the manifold structure of the training data through the LDE and local Fisher discriminant analysis (LFDA) algorithms [43,44]. These algorithms can not only detect the internal structure, but can also preserve the discriminative structure of the data [45]. However, the LDE and LFDA algorithms have the following shortcomings: (1) when the number of training samples is smaller than the spectral dimension, the singular value problem occurs in the process of solving the projection vector and (2) in attempting to preserve the local difference information, the over-fitting problem occurs [46]. Therefore, we propose the RLDE method to solve the above problems. The objective function of this method is derived from Equation (2): where is the added regular constraint, and α is a regularization parameter with a value of [0,1]. Equation (10) is equivalent to: The optimized objective of LDE is to maximize ∑ i,j V T X i − V T X j 2 ω i,j and minimize where XX T is utilized to preserve the maximal data variance. The diagonal regularization in the denominator improves the stability of the solution, without impacting the local intra-class neighborhood preserving ability. RLDE is suitable for the small-sample-size HSI classification problem. The item V T X(D − W)X T V is used to maintain the intra-class relationships. The item XX T is used to keep the maximal data variance. The optimal RLDE projection is obtained by finding the eigenvectors corresponding to nonzero small eigenvalues of the following generalized Eigen-decomposition problem:

Cooperative Training Strategy Combining Local Features
In [47], the optimal classifier combination selected by the diversity measures was multinomial logistic regression (MLR), k-nearest neighbor (KNN), and extreme learning machine (ELM). In this study, the correlation coefficient, disagreement metric, and double-fault measure were implemented to select the optimal classifier combination. It was found that the combination of MLR, KNN, and random forest (RF) achieved the best performance. Hence, the base classifiers were selected as MLR, KNN, and RF in this research. The procedure of the proposed method can be summarized as follows.
(1) A mean filtering process is employed to reduce the noise in the HSI.
(2) The local feature information of training samples L i is extracted by the RLDE method, and is labeled L i . (3) The classifier h i is trained with L i , to obtain the predicted classification result S i . (4) For the classifier h i , another two classifiers are selected which agree on the labeling of these samples to build the candidate set U i . (5) The active learning method is used to select the most useful and informative samples L i from the candidate sets The process is terminated if the stopping condition is met; otherwise, go to Step (2).
The final classification result is obtained by the majority voting method.

Pseudo-code Describing the RLDE Tri-Training Algorithm
Algorithm: RLDE tri-training Input: L: Original labeled sample set U: Unlabeled sample set BT: Breaking ties algorithm MV: Majority voting algorithm Process:

Experimental Results and Analysis
In the spatial mean filtering (SMF) operation, the parameters for the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) dataset were set as follows: the degree of filtering γ 0 = 0.9 and the filtering window w = 9. The parameters for the Reflective Optics System Imaging Spectrometer (ROSIS) dataset were set as γ 0 = 0.9 and w = 7. These parameters can prevent over-filtering and increase the similarity and consistency of the neighboring pixels. In the feature extraction, the parameter in RLDE was selected as α = 0.5 for the AVIRIS dataset and 0.7 for the ROSIS dataset. We selected L = 5, 10, and 15 samples per class as the initial labeled training sets. We set k = 3 for KNN, and the parameter settings of MLR and RF were set as the default values. The number of most useful and informative samples in each iteration was set as 100. All the experiments were carried out 10 times, and the average results are reported. The initial training samples also have an impact on the accuracy (see Section 4). The experiments were therefore performed with the optimal feature number for each dataset.

Data Used in the Experiments
In the experiments, two real HSIs were used to evaluate the proposed approach. The HSI used in the first experiment was collected by the AVIRIS sensor over the Indian Pines test site in Northwestern Indiana in 1992. This dataset has a spatial size of 145 × 145 pixels and is made up of 224 spectral bands in the wavelength range of 0.4-2.5 um at 10 nm intervals, with a spatial resolution of 20 m. In total, 202 bands were used in the experiment after the noisy and water absorption bands were removed. For illustrative purposes, the image scene in pseudocolor is shown in Figure 1a. The ground-truth map available for the scene with 16 mutually exclusive ground-truth classes is shown in Figure 1b.
The HSI used in the second experiment was collected by the ROSIS sensor over the urban area of the University of Pavia, Italy. This dataset has a spatial size of 610 × 340 pixels and is made up of 115 spectral bands in the wavelength range of 0.43-0.68 um, with a spatial resolution of 1.3 m. In total, 103 bands were used in the experiment after the noisy and water absorption bands were removed. For illustrative purposes, the image scene in pseudocolor is shown in Figure 2a. The ground-truth map available for the scene with nine mutually exclusive ground-truth classes is shown in Figure 2b. ground-truth map available for the scene with 16 mutually exclusive ground-truth classes is shown in Figure 1(b). The HSI used in the second experiment was collected by the ROSIS sensor over the urban area of the University of Pavia, Italy. This dataset has a spatial size of 610 × 340 pixels and is made up of 115 spectral bands in the wavelength range of 0.43-0.68 um, with a spatial resolution of 1.3 m. In total, 103 bands were used in the experiment after the noisy and water absorption bands were removed. For illustrative purposes, the image scene in pseudocolor is shown in Figure 2(a). The ground-truth map available for the scene with nine mutually exclusive ground-truth classes is shown in Figure  2 Table 1 and Figure 3 show the classification results of the tri-training algorithm based on the RLDE method, using spatial mean filtering (SMF) and non-spatial mean filtering (non-SMF). As the unlabeled samples are continuously added, the classification accuracy increases. However, when the iterations reach seven, the classification accuracy starts to level off. In the AVIRIS experiment, with 5, 10, and 15 initial training samples per class, the overall accuracy (OA) of SMF increases by 12.19%, 11.39%, and 11.3% compared with non-SMF. In the ROSIS experiment, the OA of SMF increases by 7.56%, 6.45%, and 6.57% compared with non-SMF. Therefore, we used SMF to process the datasets in the subsequent experiments. ground-truth map available for the scene with 16 mutually exclusive ground-truth classes is shown in Figure 1(b). The HSI used in the second experiment was collected by the ROSIS sensor over the urban area of the University of Pavia, Italy. This dataset has a spatial size of 610 × 340 pixels and is made up of 115 spectral bands in the wavelength range of 0.43-0.68 um, with a spatial resolution of 1.3 m. In total, 103 bands were used in the experiment after the noisy and water absorption bands were removed. For illustrative purposes, the image scene in pseudocolor is shown in Figure 2(a). The ground-truth map available for the scene with nine mutually exclusive ground-truth classes is shown in Figure  2 Table 1 and Figure 3 show the classification results of the tri-training algorithm based on the RLDE method, using spatial mean filtering (SMF) and non-spatial mean filtering (non-SMF). As the unlabeled samples are continuously added, the classification accuracy increases. However, when the iterations reach seven, the classification accuracy starts to level off. In the AVIRIS experiment, with 5, 10, and 15 initial training samples per class, the overall accuracy (OA) of SMF increases by 12.19%, 11.39%, and 11.3% compared with non-SMF. In the ROSIS experiment, the OA of SMF increases by 7.56%, 6.45%, and 6.57% compared with non-SMF. Therefore, we used SMF to process the datasets in the subsequent experiments.  Table 1 and Figure 3 show the classification results of the tri-training algorithm based on the RLDE method, using spatial mean filtering (SMF) and non-spatial mean filtering (non-SMF). As the unlabeled samples are continuously added, the classification accuracy increases. However, when the iterations reach seven, the classification accuracy starts to level off. In the AVIRIS experiment, with 5, 10, and 15 initial training samples per class, the overall accuracy (OA) of SMF increases by 12.19%, 11.39%, and 11.3% compared with non-SMF. In the ROSIS experiment, the OA of SMF increases by 7.56%, 6.45%, and 6.57% compared with non-SMF. Therefore, we used SMF to process the datasets in the subsequent experiments.   Figure 4 shows the classification results of the tri-training algorithm alone and the classification results of the tri-training algorithm based on the RLDE, LDE, and LFDA methods with the AVIRIS data. Specifically, the tri-training algorithm based on the LFDA method was proposed by Zhang and Jia in 2011 [48]. From Table 2 Figure 4 shows the classification results of the tri-training algorithm alone and the classification results of the tri-training algorithm based on the RLDE, LDE, and LFDA methods with the AVIRIS data. Specifically, the tri-training algorithm based on the LFDA method was proposed by Zhang and Jia in 2011 [48]. From Table 2 and Figures 4 and 5, we can see that the classification accuracy is not significantly related to the number of initial samples when the number of unlabeled samples reaches 900 or more. For example, the classification accuracy using the LDE feature extraction method is 92.03%, 93.09%, and 94.01% when the number of initial samples is 5, 10, and 15, respectively. This indicates that the proposed algorithm is both reliable and robust. The proposed tri-training classification algorithm based on RLDE feature extraction performs the best among all the methods with different initial training samples. The OA is improved by 4.85%, 6.13%, and 2.42% compared with tri-training alone, LDE, and LFDA when the initial samples are 5. The OA is 4.84 %, 5.75%, and 2.78% higher than that of tri-training alone, LDE, and LFDA when the initial samples are 10. When the initial samples are 15, the classification accuracy is 4.53 %, 4.97%, and 2.48% higher than that of tri-training alone, LDE, and LFDA. Meanwhile, the classification accuracy based on the RLDE feature extraction method reaches 98.98%, which indicates that the proposed tri-training classification algorithm is superior to the other methods. 900 or more. For example, the classification accuracy using the LDE feature extraction method is 92.03%, 93.09%, and 94.01% when the number of initial samples is 5, 10, and 15, respectively. This indicates that the proposed algorithm is both reliable and robust. The proposed tri-training classification algorithm based on RLDE feature extraction performs the best among all the methods with different initial training samples. The OA is improved by 4.85%, 6.13%, and 2.42% compared with tri-training alone, LDE, and LFDA when the initial samples are 5. The OA is 4.84 %, 5.75%, and 2.78% higher than that of tri-training alone, LDE, and LFDA when the initial samples are 10. When the initial samples are 15, the classification accuracy is 4.53 %, 4.97%, and 2.48% higher than that of tri-training alone, LDE, and LFDA. Meanwhile, the classification accuracy based on the RLDE feature extraction method reaches 98.98%, which indicates that the proposed tri-training classification algorithm is superior to the other methods.     Figure 6 shows the classification results of the tri-training algorithm alone and the classification results of the tri-training algorithm based on the RLDE, LDE, and LFDA methods with the ROSIS data. From Table 3 and Figures 6 and 7, we can see that, as the unlabeled samples are continuously added, the classification accuracy increases. However, when the unlabeled samples reach 700, the OA becomes stable. The classification accuracy is not significantly related to the number of initial samples when the number of unlabeled samples reaches 900 or more. For example, the classification accuracy using the LDE feature extraction method is 96.16%, 96.66%, and 96.66% when the number of initial samples is 5, 10, and 15, respectively. This indicates that the proposed algorithm is both reliable and robust. The proposed tri-training classification algorithm based on RLDE feature extraction performs the best among all the methods under the different initial training samples. The OA is improved by 10.79%, 1.73%, and 2.06% compared with tri-training alone, LDE, and LFDA when the initial samples are five. The OA is 10.97%, 1.73%, and 2.06% higher than that of tri-training alone, LDE, and LFDA when the initial samples are 10. When the initial samples are 15, the OA is 11.36%, 1.96%, and 2.08% higher than that of tri-training alone, LDE, and LFDA, respectively. Meanwhile, the classification accuracy based on the RLDE feature extraction method reaches 98.62%.   Figure 6 shows the classification results of the tri-training algorithm alone and the classification results of the tri-training algorithm based on the RLDE, LDE, and LFDA methods with the ROSIS data. From Table 3 and Figures 6 and 7, we can see that, as the unlabeled samples are continuously added, the classification accuracy increases. However, when the unlabeled samples reach 700, the OA becomes stable. The classification accuracy is not significantly related to the number of initial samples when the number of unlabeled samples reaches 900 or more. For example, the classification accuracy using the LDE feature extraction method is 96.16%, 96.66%, and 96.66% when the number of initial samples is 5, 10, and 15, respectively. This indicates that the proposed algorithm is both reliable and robust. The proposed tri-training classification algorithm based on RLDE feature extraction performs the best among all the methods under the different initial training samples. The OA is improved by 10.79%, 1.73%, and 2.06% compared with tri-training alone, LDE, and LFDA when the initial samples are five. The OA is 10.97%, 1.73%, and 2.06% higher than that of tri-training alone, LDE, and LFDA when the initial samples are 10. When the initial samples are 15, the OA is 11.36%, 1.96%, and 2.08% higher than that of tri-training alone, LDE, and LFDA, respectively. Meanwhile, the classification accuracy based on the RLDE feature extraction method reaches 98.62%.

Discussion
In this section, the hyperparameters, w, 0 , and α are experimentally analyzed. In the SMF, both w and 0 affect the final precision. Hence, parameter w was chosen from the range of {1, 3, 5, 7, 9, 11}, and parameter 0 was chosen from the range of {0.1, 0.2, 0.3, …, 0.9}. In this parameter analysis, α was always set to 0.1. In the RLDE feature extraction method, α is the essential parameter, and was chosen from the range of {0, 0.1, 0.2, …, 1}. Parameter w was set to 3, and 0 was set to 0.2. Fifteen samples in each class were selected as the training set, and no addition operation was conducted with the training samples. Figure 8 shows the OA versus w and 0 for the AVIRIS and ROSIS datasets, where it is shown that 0 has less impact on the classification accuracy than w. The optimal value of w is 9 for the AVIRIS dataset and 7 for the ROSIS dataset. The classification accuracy tends to be stable with parameter w within a range from 5 to 9. Figure 9 shows the OA versus α for the AVIRIS and ROSIS datasets. The optimal value of parameter α is 0.5 for the AVIRIS dataset and 0.7 for the ROSIS dataset.

Discussion
In this section, the hyperparameters, w, γ 0 , and α are experimentally analyzed. In the SMF, both w and γ 0 affect the final precision. Hence, parameter w was chosen from the range of {1, 3, 5, 7, 9, 11}, and parameter γ 0 was chosen from the range of {0.1, 0.2, 0.3, . . . , 0.9}. In this parameter analysis, α was always set to 0.1. In the RLDE feature extraction method, α is the essential parameter, and was chosen from the range of {0, 0.1, 0.2, . . . , 1}. Parameter w was set to 3, and γ 0 was set to 0.2. Fifteen samples in each class were selected as the training set, and no addition operation was conducted with the training samples. Figure 8 shows the OA versus w and γ 0 for the AVIRIS and ROSIS datasets, where it is shown that γ 0 has less impact on the classification accuracy than w. The optimal value of w is 9 for the AVIRIS dataset and 7 for the ROSIS dataset. The classification accuracy tends to be stable with parameter w within a range from 5 to 9. Figure 9 shows the OA versus α for the AVIRIS and ROSIS datasets. The optimal value of parameter α is 0.5 for the AVIRIS dataset and 0.7 for the ROSIS dataset.

Discussion
In this section, the hyperparameters, w, 0 , and α are experimentally analyzed. In the SMF, both w and 0 affect the final precision. Hence, parameter w was chosen from the range of {1, 3, 5, 7, 9, 11}, and parameter 0 was chosen from the range of {0.1, 0.2, 0.3, …, 0.9}. In this parameter analysis, α was always set to 0.1. In the RLDE feature extraction method, α is the essential parameter, and was chosen from the range of {0, 0.1, 0.2, …, 1}. Parameter w was set to 3, and 0 was set to 0.2. Fifteen samples in each class were selected as the training set, and no addition operation was conducted with the training samples. Figure 8 shows the OA versus w and 0 for the AVIRIS and ROSIS datasets, where it is shown that 0 has less impact on the classification accuracy than w. The optimal value of w is 9 for the AVIRIS dataset and 7 for the ROSIS dataset. The classification accuracy tends to be stable with parameter w within a range from 5 to 9. Figure 9 shows the OA versus α for the AVIRIS and ROSIS datasets. The optimal value of parameter α is 0.5 for the AVIRIS dataset and 0.7 for the ROSIS dataset.   The initial training sample conditions has an impact on the accuracy. In this section, optimal feature selection is discussed. In this analysis, the range of the spectral information dimension was set from 1 to 30. With 5, 10, and 15 initial training samples per class, and different feature extraction methods, we selected the optimal feature information for all the dimensions, as shown in Table 4 and Figure 10.
For the AVIRIS data, when the number of initial training samples per class is 5, the maximum OA and the dimension of LDE are 64.35% and 20, respectively. RLDE and LFDA can obtain the maximum OA when the feature information dimension is 12 and 30, respectively. When the number of initial training samples per class is 10, the maximum OA is obtained (75.16%) and the dimension of LDE is 26. RLDE and LFDA can obtain the maximum OA when the feature information dimension is 10 and 30, respectively. When the number of initial training samples per class is 15, the maximum OA is obtained (78.35%) and the dimension of PCA is 30. RLDE and LFDA can obtain the maximum OA when the feature information dimension is 10 and 24, respectively. Among the four different feature extraction methods, RLDE can obtain the highest classification accuracy and requires the smallest feature information dimension. With 5, 10, and 15 initial training samples per class, the feature information dimensions of all the methods were set as shown in Table 2 in the experiments.
For the ROSIS data, when the number of initial training samples per class is 5, the maximum OA and the dimension of LDE are 70.20% and 21, respectively. RLDE and LFDA can obtain the maximum OA when the feature information dimension is 8 and 24, respectively. When the number of initial training samples per class is 10, the maximum OA and the dimension of LDE are 77.93% and 24, respectively. RLDE and LFDA can obtain the maximum OA when the feature information dimension is 11 and 38, respectively. When the number of initial training samples per class is 15, the maximum OA and the dimension of LDE are 82.61% and 24, respectively. RLDE and LFDA can obtain the maximum OA when the feature information dimension is 12 and 8, respectively. Among the four different feature extraction methods, RLDE can obtain the highest classification accuracy and requires the smallest feature information dimension. With 5, 10, and 15 initial training samples per class, the feature information dimensions of all the methods were set based on Table 1 in the experiments.  The initial training sample conditions has an impact on the accuracy. In this section, optimal feature selection is discussed. In this analysis, the range of the spectral information dimension was set from 1 to 30. With 5, 10, and 15 initial training samples per class, and different feature extraction methods, we selected the optimal feature information for all the dimensions, as shown in Table 4 and Figure 10.
For the AVIRIS data, when the number of initial training samples per class is 5, the maximum OA and the dimension of LDE are 64.35% and 20, respectively. RLDE and LFDA can obtain the maximum OA when the feature information dimension is 12 and 30, respectively. When the number of initial training samples per class is 10, the maximum OA is obtained (75.16%) and the dimension of LDE is 26. RLDE and LFDA can obtain the maximum OA when the feature information dimension is 10 and 30, respectively. When the number of initial training samples per class is 15, the maximum OA is obtained (78.35%) and the dimension of PCA is 30. RLDE and LFDA can obtain the maximum OA when the feature information dimension is 10 and 24, respectively. Among the four different feature extraction methods, RLDE can obtain the highest classification accuracy and requires the smallest feature information dimension. With 5, 10, and 15 initial training samples per class, the feature information dimensions of all the methods were set as shown in Table 2 in the experiments. For the ROSIS data, when the number of initial training samples per class is 5, the maximum OA and the dimension of LDE are 70.20% and 21, respectively. RLDE and LFDA can obtain the maximum OA when the feature information dimension is 8 and 24, respectively. When the number of initial training samples per class is 10, the maximum OA and the dimension of LDE are 77.93% and 24, respectively. RLDE and LFDA can obtain the maximum OA when the feature information dimension is 11 and 38, respectively. When the number of initial training samples per class is 15, the maximum OA and the dimension of LDE are 82.61% and 24, respectively. RLDE and LFDA can obtain the maximum OA when the feature information dimension is 12 and 8, respectively. Among the four different feature extraction methods, RLDE can obtain the highest classification accuracy and requires the smallest feature information dimension. With 5, 10, and 15 initial training samples per class, the feature information dimensions of all the methods were set based on Table 1 in the experiments.

RLDE
72.76% (8) 80.95% (11) 86.62% (12) LFDA 71.09% (24) 76.43% (28) 82.50% ( Finally, we compared the proposed method with the other state-of-the-art deep learning methods of 1D-CNN, the CNN classifier proposed by Hu et al. [7], the five-layer CNN classifier proposed by Mei et al. [49], and the M3D-DCNN classifier proposed by He et al. [50]. All the methods, were compared under the same experimental settings (number of training samples, patch size, etc.) The OAs achieved by the different methods with the different HSI datasets are listed in Table 5. As can be seen, the proposed method shows a performance that is better than or comparable to the performance of the other four methods. Finally, we compared the proposed method with the other state-of-the-art deep learning methods of 1D-CNN, the CNN classifier proposed by Hu et al. [7], the five-layer CNN classifier proposed by Mei et al. [49], and the M3D-DCNN classifier proposed by He et al. [50]. All the methods, were compared under the same experimental settings (number of training samples, patch size, etc.) The OAs achieved by the different methods with the different HSI datasets are listed in Table 5. As can be seen, the proposed method shows a performance that is better than or comparable to the performance of the other four methods.

Conclusions
Hyperspectral sensors acquire hundreds of spectrally contiguous bands and provide abundant (but redundant) spectral information. In order to reduce the time consumption and improve the classification performance, it is necessary to extract the discriminant information before performing classification. In this paper, a novel semi-supervised tri-training algorithm for HSI classification has been proposed in conjunction with RLDE. The RLDE algorithm finds the optimal feature information, preserves the local information, and overcomes the singularity in the case of limited training samples. In the proposed algorithm, active learning is used to select the unlabeled samples, and ensemble learning is used to improve the classification result. In a comparison with other state-of-the-art deep learning methods, the proposed method achieved the highest classification accuracy with the least feature information.