A Novel Tri-Training Technique for Semi-Supervised Classiﬁcation of Hyperspectral Images Based on Diversity Measurement

: This paper introduces a novel semi-supervised tri-training classiﬁcation algorithm based on diversity measurement for hyperspectral imagery. In this algorithm, three measures of diversity, i

Semi-supervised learning has been of great interest to hyperspectral remote sensing image analysis.In [24], semi-supervised probabilistic principal component analysis, semi-supervised local fisher discriminant analysis and semi-supervised dimensionality reduction with pairwise constraints were extended to extract features in a hyperspectral image.In [25], a new classification methodology based on spatial-spectral label propagation was proposed.Dopido  framework for semi-supervised learning, which exploits active learning (AL) for unlabeled samples' selection [26].In [27], a new semi-supervised algorithm combined spatial neighborhood information in determining class labels of selected unlabeled samples.Tan proposed a semi-supervised SVM with a segmentation-based ensemble algorithm to use spatial information extracted by a segmentation algorithm for unlabeled samples' selection in [28].
Meanwhile, Blum and Mitchell proposed a prominent approach called co-training, which has become popular in semi-supervised learning [19].This algorithm requires two sufficient and redundant views, but this requirement cannot be met for hyperspectral imagery.Then, Gold and Zhou proposed a new co-training method called statistical co-training [29], which employed two different learning algorithms based on a single view.In [30], another new co-training method called democratic co-training was proposed.However, the aforementioned algorithms employ a time-consuming cross-validation technique to determine how to label the selected unlabeled samples and how to produce the final hypothesis.Therefore, Zhou and Li developed tri-training in [31].It neither requires the instance space to be described with sufficient and redundant views nor imposes any constraints on supervised learning algorithms, and its applicability is broader than previous co-training style algorithms.However, tri-training has some drawbacks in three aspects: (1) selecting a complementary classifier may be difficult; (2) unlabeled samples may have error labels that are added to the training set during semi-supervised learning; (3) the final classification map may be contaminated by salt and pepper noise.In this paper, a novel tri-training algorithm is proposed.We use three measures of diversity, i.e., the double-fault measure, the disagreement metric and the correlation coefficient, to determine the optimal classifier combination, then unlabeled samples are selected using an active learning (AL) method and consistent results of any two classifiers combined with a spatial neighborhood information extraction strategy to predict the labels of unlabeled samples.Moreover, a multi-scale homogeneity (MSH) method is utilized to refine the classification result.
The remainder of this paper is organized as follows.Section 2 briefly introduces the standard tri-training algorithm, then describes the proposed approach.Section 3 presents experiments on three real hyperspectral datasets with a comparative study.Finally, Section 4 concludes the paper.

Tri-Training
In the standard tri-training algorithm, three classifiers are initially trained by a dataset generated via bootstrap sampling from the original labeled data.Then, for any classifier, an unlabeled sample can be labeled as long as another two classifiers agree on the labeling of this sample.This training process will stop when the results of the three classifiers reach consistency.The final predication is produced with a variant of majority voting among all of the classifiers.

Classifier Selection
The principle of classifier selection is that classifiers should be different from each other and their performance should be complementary; otherwise, the overall decision will not be better than each individual decision.Three measures of diversity are implemented to select three classifiers from SVM [32][33][34], multinomial logistic regression (MLR) [35,36], KNN [27,37] and extreme learning machine (ELM) [38,39].The three measures of diversity are the double-fault measure, the disagreement metric and the correlation coefficient [40], which are described as below.
where N ab ij is the number of samples z o of Z for which y oi = a and y oj = b (see Table 1).With the increase of ρ, the diversity of classifiers becomes smaller.
The disagreement between classifier outputs (correct/wrong) can be measured as: where N ab ij is the number of samples z o of Z for which y oi = a and y oj = b (see Table 1).With the increase of D, the diversity of classifiers becomes larger.
(3) Double-fault measure (DF): The double-fault between classifier outputs (correct/wrong) can be measured as: where N ab ij is the number of samples z o of Z for which y oi = a and y oj = b (see Table 1).With the increase of DF, the diversity of classifiers becomes larger.

Unlabeled Sample Selection
In the standard tri-training algorithm, for any classifier, an unlabeled sample can be labeled when another two classifiers agree on the labeling of this sample.However, the training set may be small; the label of unlabeled samples that two classifiers agree on may be wrong.Therefore, for any classifier, we use a spatial neighborhood information extraction strategy with an AL algorithm to select the most useful spatial neighbors as the new training set on the condition that two classifiers agree on the labeling of these samples.
Figure 1 illustrates how to select unlabeled samples, and the selection process includes two key steps, i.e., the construction of the candidate set and active learning.(1) The construction of the candidate set: For any classifier, we consider spatial neighborhood information with the consistent results of two classifiers to build the candidate set.Firstly, unlabeled samples are selected based on the consistency of two classifiers' outputs, and those samples are considered reliable according to the standard tri-training algorithm.With a local similarity assumption, the neighbors of labeled training samples are identified using a second-order spatial connectivity, and the candidate set is built by analyzing the spectral similarity of these spatial neighbors.Since the output of a classifier is based on spectral information, the candidate set is obtained based on spectral and spatial information.Thus, these samples are more reliable.
(2) Active learning: In semi-supervised learning, the main objective is to select the most useful and informative samples from the candidate set.However, some of the samples in the candidate set may not be useful for training the third classifier, because they may be too similar to the labeled samples.To prevent the introduction of such redundant information, the breaking ties (BT) [17] algorithm is adopted to select the most informative samples.
The decision criterion of BT is: where is the probability when the label of sample m x' is k and K is the number of classes.

Multi-Scale Homogeneity Method
Some of the existing hyperspectral image classification algorithms produce classification results with salt and pepper noise.To solve this problem, we use the multi-scale homogeneity method.Let S be the initial classification result, be the scale of a homogeneous region, ) 3 , 2 , 1 ( = i i θ be the threshold of those homogeneous regions and ρ be the number of the samples that have the same label in a homogeneous region.(1) The construction of the candidate set: For any classifier, we consider spatial neighborhood information with the consistent results of two classifiers to build the candidate set.Firstly, unlabeled samples are selected based on the consistency of two classifiers' outputs, and those samples are considered reliable according to the standard tri-training algorithm.With a local similarity assumption, the neighbors of labeled training samples are identified using a second-order spatial connectivity, and the candidate set is built by analyzing the spectral similarity of these spatial neighbors.Since the output of a classifier is based on spectral information, the candidate set is obtained based on spectral and spatial information.Thus, these samples are more reliable.
(2) Active learning: In semi-supervised learning, the main objective is to select the most useful and informative samples from the candidate set.However, some of the samples in the candidate set may not be useful for training the third classifier, because they may be too similar to the labeled samples.To prevent the introduction of such redundant information, the breaking ties (BT) [17] algorithm is adopted to select the most informative samples.
The decision criterion of BT is: where k + = arg max k∈K p(y m = k|x m ) is the most probable class for sample x m , p(y m = k|x m ) is the probability when the label of sample x m is k and K is the number of classes.

Multi-Scale Homogeneity Method
Some of the existing hyperspectral image classification algorithms produce classification results with salt and pepper noise.To solve this problem, we use the multi-scale homogeneity method.Let S be the initial classification result, α, β, γ(α < β < γ) be the scale of a homogeneous region, θ i (i = 1, 2, 3) be the threshold of those homogeneous regions and ρ be the number of the samples that have the same label in a homogeneous region.
(1) An α × α homogeneous region is built in the initial classification result.If ρ ≥ θ 1 , the samples in this region will have the same label; otherwise, the label of the samples does not change.Let this new result be the second classification result. (2)A β × β homogeneous region is built in the second classification result.If ρ ≥ θ 2 , the samples in this region will have the same label; otherwise, the label of the samples does not change.Let this new result be the third classification result.(3) A γ × γ homogeneous region is built in the third classification result.If ρ ≥ θ 3 , the samples in the homogeneity region will have the same label; otherwise, the label of the samples does not change.This new result will be the final classification result.

Semi-Supervised Classification Framework
3) be the classifiers and S i (i = 1, 2, 3) be the classification results.
The procedure of the proposed method is summarized as follows.
(1) Train the classifier D i with L and obtain the predicted classification result S i ; (2) For the classifier D i , select another two classifiers agreeing on the labeling of these samples to build the first candidate set; (3) For x m ∈ L, the neighbors of x m (using second-order spatial connectivity) will be labeled based on Tobler's first law, and build the second candidate set; (4) Conduct comparative analysis of the first and the second candidate set, and select these samples that have the same label to build the third candidate set; (5) Use the BT method to select the most useful and information samples L from the third candidate set, L = L ∪ L , U = U − L ; (6) Train the classifier D i with the new L and obtain the predicted classification result S i ; (7) Terminate if the final condition is met; otherwise, go to Step (2); (8) Obtain S i that has the highest classification accuracy in these three classifiers and use the multi-scale homogeneity method to process S i to obtain the final classification result.

Data Used in the Experiments
In this study, three real hyperspectral images are used to evaluate the proposed approach.
(1) The first hyperspectral image was collected by the AVIRIS (Airborne Visible/Infrared Imaging Spectrometer) sensor over the Indian Pines region in Northwestern Indiana in 1992.This datum has a spatial size of 145 × 145 pixels.It comprises 224 spectral channels in the wave-length range from 0.4 to 2.5 um at 10-nm intervals with a spatial resolution of 20 m, and 202 channels were used in the experiment after noise and water absorption bands were removed.For illustrative purposes, the image scene in pseudocolor is shown in Figure 2a.The ground truth map available for the scene with 16 mutually-exclusive ground-truth classes is shown in Figure 2b.(2) The second hyperspectral image was collected by the ROSIS (Reflective Optics System Imaging Spectrometer) sensor over the urban area of the University of Pavia, Italy.This datum has a spatial size of 610 × 340 pixels.It comprises 115 spectral channels in the wave-length range from 0.43 to 0.68 um with a spatial resolution of 1.3 m, and 103 channels were used in the experiment after noise and water absorption bands were removed.For illustrative purposes, the image scene in pseudocolor is shown in Figure 3a.The ground truth map available for the scene with 9 mutually-exclusive ground-truth classes is showed in Figure 3b [41].
(3) The third hyperspectral image was collected by the AVIRIS sensor over Salinas Valley, Southern California, in 1998.This datum has a spatial size of 512 × 217 pixels.It comprises 224 spectral channels in the wave-length range from 0.4 to 2.5 um with a spatial resolution of 3.7 m, and 204 channels were used in the experiment after noisy and water absorption bands were removed.For illustrative purposes, the image scene in pseudocolor is shown in Figure 4a.The ground truth map available for the scene with 16 mutually-exclusive ground-truth classes is shown in Figure 4b.
Remote Sens. 2016, 8, 749 6 of 16 (3) The third hyperspectral image was collected by the AVIRIS sensor over Salinas Valley, Southern California, in 1998.This datum has a spatial size of 512 × 217 pixels.It comprises 224 spectral channels in the wave-length range from 0.4 to 2.5 um with a spatial resolution of 3.7 m, and 204 channels were used in the experiment after noisy and water absorption bands were removed.
For illustrative purposes, the image scene in pseudocolor is shown in Figure 4a.The ground truth map available for the scene with 16 mutually-exclusive ground-truth classes is shown in Figure 4b.

Parameter Setting
In our experiments, the involved parameters were set as follows.
(1) Classifier parameter: k = 3 for KNN; the number of hidden neurons is 50; and the activation function is 'sigmoid' in the ELM; the parameter of MLR uses the default value. (2)Multi-scale homogeneity: All experiments are carried out 10 times, and the averaged results are reported.
It is noteworthy that TT_AL_MSH denotes the proposed approach, and TT is the standard tri-training methods.Additionally, the performance of those approaches is objectively evaluated in terms of global accuracy (GA), which includes the overall accuracy (OA), average accuracy (AA) and the kappa coefficient (kappa).SVM and MLR have been widely used for hyperspectral image classification.ELM is a recently-developed simple and fast neural network classifier, and KNN is the traditional classifier whose kernel algorithm is the distance operation.The formation mechanisms of those classifiers are different.Therefore, we choose four base classifiers from a classifier pool, which are SVM(1), MLR(2), KNN(3) and ELM(4).In addition, three measures are used to compute their diversity (as shown in Table 2) by using the AVIRIS Indian Pines dataset.From Table 2, the same combination is selected by the D and ρ diversity measures, which contain MLR, KNN and ELM.The combination of SVM, KNN and ELM is selected by DF.In order to select the optimal combination, we selected the TT algorithm to test the performance of different classifier combination.As shown in Table 3, the combination of MLR, KNN and ELM is the optimal one.For two methods to be compared, let f 11 denote the number of samples that both methods can correctly classify, f 22 the number of samples that both cannot, f 12 the number of samples misclassified by Method 1, but not Method 2, and f 21 the number of samples misclassified by Method 2, but not Method 1 [42].Then, the decision criterion of McNemar's test statistic is: Remote Sens. 2016, 8, 749 For a 5% level of significance, the corresponding |z| value is 1.96; a |z| value greater than this quantity means that two methods have significant performance discrepancy.
Table 4 shows that the significance level of TT_MKE (i.e., MKE is the combination of MLR, KNN and ELM) compares against TT_AL_MSH_MKE, with 5, 10 and 15 initial training samples per class.Obviously, the performance of the proposed TT_AL_MSH_MKE is statistically different from TT_MKE.

Experiment on the Indian Pine Dataset
Table 5 shows the OA statistical results of TT_AL_MSH_MKE, TT_AL_MSH_SKE (i.e., SKE is the combination of SVM, KNN and ELM), TT_MKE and TT_SKE.It can be obviously seen that the proposed TT_AL_MSH_MKE produces higher classification accuracy than the standard TT_MKE.With 5, 10 and 15 initial training samples per class, the OA of TT_AL_MSH_MKE increases by 17.09%, 20.14% and 17.09%, respectively, compared with TT_MKE.Figure 5 shows that the OA greatly increases with the number of unlabeled samples.When the number of unlabeled samples reaches 700, the OA becomes stable.For illustrative purposes, the classification maps of AVRIS data are provided in Figure 6.Observed from these maps, the proposed methods can effectively reduce the salt and pepper noise.

Experiment on the University of Pavia Dataset
Table 6 shows the OA of TT_AL_MSH_MKE and TT_MKE.The proposed TT_AL_MSH_MKE can produce higher accuracy than the standard TT_MKE.With 5, 10 and 15 initial training samples per class, the OA of TT_AL_MSH_MKE increases by 15.69%, 12.84% and 13.31%, respectively, compared with TT_MKE.Figure 7 shows that the OA greatly increases with the number of unlabeled samples, and the performance of TT_AL_MSH_MKE is obviously superior to the performance of TT_MKE.However, the performance of TT_MKE is not stable, which is because unlabeled samples that are mislabeled are introduced into the training process.The classification maps of ROSIS Pavia University data are shown in Figure 8, where the proposed methods can produce smoother maps.

Experiment on the University of Pavia Dataset
Table 6 shows the OA of TT_AL_MSH_MKE and TT_MKE.The proposed TT_AL_MSH_MKE can produce higher accuracy than the standard TT_MKE.With 5, 10 and 15 initial training samples per class, the OA of TT_AL_MSH_MKE increases by 15.69%, 12.84% and 13.31%, respectively, compared with TT_MKE.Figure 7 shows that the OA greatly increases with the number of unlabeled samples, and the performance of TT_AL_MSH_MKE is obviously superior to the performance of TT_MKE.However, the performance of TT_MKE is not stable, which is because unlabeled samples that are mislabeled are introduced into the training process.The classification maps of ROSIS Pavia University data are shown in Figure 8, where the proposed methods can produce smoother maps.

Experiment on the University of Pavia Dataset
Table 6 shows the OA of TT_AL_MSH_MKE and TT_MKE.The proposed TT_AL_MSH_MKE can produce higher accuracy than the standard TT_MKE.With 5, 10 and 15 initial training samples per class, the OA of TT_AL_MSH_MKE increases by 15.69%, 12.84% and 13.31%, respectively, compared with TT_MKE.Figure 7 shows that the OA greatly increases with the number of unlabeled samples, and the performance of TT_AL_MSH_MKE is obviously superior to the performance of TT_MKE.However, the performance of TT_MKE is not stable, which is because unlabeled samples that are mislabeled are introduced into the training process.The classification maps of ROSIS Pavia University data are shown in Figure 8, where the proposed methods can produce smoother maps.

Experiment on the Salinas Valley Dataset
Table 7 shows the OA of TT_AL_MSH_MKE and TT_MKE.The proposed TT_AL_MSH_MKE can produce higher accuracy than the standard TT_MKE.With 5, 10 and 15 initial training samples per class, the OA of TT_AL_MSH_MKE increases by 7.24%, 6.04% and 6.68%, respectively, compared with TT_MKE.Figure 9 shows that the OA greatly increases with the number of unlabeled samples, and the performance of TT_AL_MSH_MKE is obviously superior to the performance of TT_MKE.However, the performance of TT_MKE is not stable when the initial training samples per class is 5, which is because unlabeled samples that are mislabeled are introduced into the training process.The classification maps of Salinas data are shown in Figure 10, where the proposed methods can produce smoother maps.

Experiment on the Salinas Valley Dataset
Table 7 shows the OA of TT_AL_MSH_MKE and TT_MKE.The proposed TT_AL_MSH_MKE can produce higher accuracy than the standard TT_MKE.With 5, 10 and 15 initial training samples per class, the OA of TT_AL_MSH_MKE increases by 7.24%, 6.04% and 6.68%, respectively, compared with TT_MKE.Figure 9 shows that the OA greatly increases with the number of unlabeled samples, and the performance of TT_AL_MSH_MKE is obviously superior to the performance of TT_MKE.However, the performance of TT_MKE is not stable when the initial training samples per class is 5, which is because unlabeled samples that are mislabeled are introduced into the training process.The classification maps of Salinas data are shown in Figure 10, where the proposed methods can produce smoother maps.Then, unlabeled samples were selected using the AL method and the consistent results of another two classifiers combined with spatial neighborhood information to predict the labels of unlabeled samples.Moreover, we utilize the multi-scale homogeneity method to refine the final classification result.To confirm the effectiveness of the proposed TT_AL_MSH_MKE, experiments were conducted on three real hyperspectral data, in comparison with the standard TT_MKE.Moreover, some methods that combine semi-supervised spectral-spatial classification with active learning are selected to validate the performance of the proposed method.Experiment results demonstrate that the OA of the proposed approaches is improved more than 10% compared with TT_MKE, and the proposed method outperforms other methods in particular for the small training datasets with 10 initial samples/class or less.Meanwhile, the proposed method can effectively reduce the salt and pepper noise in the classification maps.

Figure 1 .
Figure 1.The process of selecting unlabeled samples.

Figure 1 .
Figure 1.The process of selecting unlabeled samples.

Figure 2 .Figure 3 .
Figure 2. (a) Pseudocolor color composite of the AVIRIS Indian Pines data set; (b) the map with 16 mutually-exclusive ground-truth classes.

Figure 4 .
Figure 4. (a) Pseudocolor color composite of the AVIRIS Salinas Valley scene; (b) the test area with16 mutually-exclusive ground-truth classes.

Figure 4 .
Figure 4. (a) Pseudocolor color composite of the AVIRIS Salinas Valley scene; (b) the test area with16 mutually-exclusive ground-truth classes.Figure 4. (a) Pseudocolor color composite of the AVIRIS Salinas Valley scene; (b) the test area with16 mutually-exclusive ground-truth classes.
The parameter α is set to follow Tobler's first law, and the parameter β is set through many experiments to ascertain the optimum value.(3) Training sets: L = 5, 10, 15.We select 5, 10 and 15 samples per class as the initial labeled training sets.(4) Other sets: The number of the most useful and informative samples L in one iteration is 100.

Figure 5 .Figure 6 .
Figure 5. Overall classification accuracies obtained for the AVIRIS Indian Pines dataset using two different techniques by using 5, 10 and 15 labeled samples per class (estimated labels of unlabeled samples were used in all of the experiments).

Figure 5 . 16 Figure 5 .Figure 6 .
Figure 5. Overall classification accuracies obtained for the AVIRIS Indian Pines dataset using two different techniques by using 5, 10 and 15 labeled samples per class (estimated labels of unlabeled samples were used in all of the experiments).

Figure 7 .
Figure 7. Overall classification accuracies obtained for the ROSIS Pavia University dataset using two different techniques by using 5, 10 and 15 labeled samples per class (estimated labels of unlabeled samples were used in all of the experiments).

Figure 7 .
Figure 7. Overall classification accuracies obtained for the ROSIS Pavia University dataset using two different techniques by using 5, 10 and 15 labeled samples per class (estimated labels of unlabeled samples were used in all of the experiments).

Figure 9 .
Figure 9. Overall classification accuracies obtained for the AVIRIS Salinas Valley dataset using two different techniques by using 5, 10 and 15 labeled samples per class (estimated labels of unlabeled samples were used in all the experiments).

Figure 9 . 16 Figure 9 .
Figure 9. Overall classification accuracies obtained for the AVIRIS Salinas Valley dataset using two different techniques by using 5, 10 and 15 labeled samples per class (estimated labels of unlabeled samples were used in all the experiments).
and Li developed a new

Table 1 .
The relationship between a pair of classifiers.

Table 2 .
The diversity value (in terms of D, DF and ρ).The greatest diversity is marked in bold italics.

Table 3 .
The optimal combination selected by the diversity measures and tri-training (TT) (overall accuracy).

Table 4 .
The value of Z-test in the different dataset.AL, active learning.

Table 5 .
Overall accuracy using two different techniques for the AVIRIS Indian Pines data, with 5, 10 and 15 initial training samples per class.The best OA results of each table are marked in bold italics.

Table 6 .
Overall accuracy using two different techniques for ROSIS Pavia University data, with 5, 10 and 15 initial training samples per class.The best OA results of each table are marked in bold italics.

Table 7 .
Overall accuracy using two different techniques for AVIRIS Salinas Valley data, with 5, 10 and 15 initial training samples per class.The best OA results of each table are marked in bold italics.

Table 7 .
Overall accuracy using two different techniques for AVIRIS Salinas Valley data, with 5, 10 and 15 initial training samples per class.The best OA results of each table are marked in bold italics.