Hyperspectral Image Classification Based on Semi-Supervised Rotation Forest

Ensemble learning is widely used to combine varieties of weak learners in order to generate a relatively stronger learner by reducing either the bias or the variance of the individual learners. Rotation forest (RoF), combining feature extraction and classifier ensembles, has been successfully applied to hyperspectral (HS) image classification by promoting the diversity of base classifiers since last decade. Generally, RoF uses principal component analysis (PCA) as the rotation tool, which is commonly acknowledged as an unsupervised feature extraction method, and does not consider the discriminative information about classes. Sometimes, however, it turns out to be sub-optimal for classification tasks. Therefore, in this paper, we propose an improved RoF algorithm, in which semi-supervised local discriminant analysis is used as the feature rotation tool. The proposed algorithm, named semi-supervised rotation forest (SSRoF), aims to take advantage of both the discriminative information and local structural information provided by the limited labeled and massive unlabeled samples, thus providing better class separability for subsequent classifications. In order to promote the diversity of features, we also adjust the semi-supervised local discriminant analysis into a weighted form, which can balance the contributions of labeled and unlabeled samples. Experiments on several hyperspectral images demonstrate the effectiveness of our proposed algorithm compared with several state-of-the-art ensemble learning approaches.


Introduction
Hyperspectral (HS) image classification always suffers from varieties of difficulties, such as high dimensionality, limited or unbalanced training samples, spectral variability, and mixing pixels.It is well known that increasing data dimensionality and high redundancy between features might cause problems during data analysis, for example, in the context of supervised classification.A considerable amount of literature has been published with regard to overcoming these challenges, and performing hyperspectral image classification effectively [1].Machine learning techniques such as artificial neural networks (ANNs) [2], support vector machine (SVM) [3], multinomial logistic regression [4], active learning, semi-supervised learning [5], and other methods like hyperspectral unmixing [6], object-oriented classification [7], and the multiple classifier system [8] have been popularly investigated recently as well.
Multiple classifier system (MCS), which is also sometimes named as classifier ensemble or ensemble learning (EL) in the machine learning field, is a popular strategy for improving the classification performance of hyperspectral images by combining the predictions of multiple classifiers, thereby reducing the dependence on the performance of a single classifier [8][9][10][11].The concept of MCS, on the other hand, does not refer to a specific algorithm but to the idea of combining outputs Remote Sens. 2017, 9, 924 2 of 14 from more than one classifier to enhance classification accuracy [12].These outputs may result from either the same classifier of different variants or different classifiers of the same/different training samples.Previous studies have demonstrated both theoretically and experimentally that one of the main reasons for the success of ensembles is the diversity among the individual learners (namely the base classifiers) [13], because combining similar classification results would not further improve the accuracy.
MCSs have been widely applied to HS remote sensing image classification.Two approaches for constructing classifier ensembles are perceived as "classic", bagging and boosting [14,15], and afterwards numerous algorithms were successively derived from them.Bagging creates many classifiers with each base learner trained by a new bootstrapped training data set [16].Boosting processes the data with iterative retraining, and concentrates on the difficult samples, with the goal of correctly classifying these samples in the next iteration [17,18].Ho [19] proposed random subspace ensembles, which used random subsets of features instead of the entire feature set for each individual classifier.The rationale of the random subspace is to break down a complex high dimensional problem into several lower dimensional problems, thereby alleviating the curse of dimensionality.By integrating bagging and random subspace approaches, Breiman [20] proposed the well-known random forest (RF) algorithm [21,22].The characteristics of RF, including reasonable computational cost, inherent support of parallelism, highly accurate predictions, and ability to handle a very large number of input variables without overfitting, make it a popular and promising classification algorithm for remote sensing data [23][24][25].Generally, decision tree (DT) is used as the base classifier in ensemble learning because of its high computation efficiency, easy implementation, and sensitivity to slight changes in data.Recently, some researchers incorporated several prevalent machine learning algorithms into ensemble learning.Gurram and Kwon [26] proposed a sparse kernel-based support vector machine (SVM) ensemble algorithm that yields better performance compared with the SVM trained by cross-validation.Samat et al. [27] proposed Bagging-based and Adaboost-based extreme learning machines to overcome the drawbacks of input parameter randomness of traditional extreme learning machines.For a more detailed description about EL, refer to [28,29].
In a paper by Rodriguez and Kuncheva [30], the authors proposed a new ensemble classifier called rotation forest (RoF).By applying feature extraction (i.e., principal component analysis, PCA) to the random feature subspace, RoF greatly promotes the diversity and accuracy of the classifiers.Thereafter, several improved algorithms were proposed based on the idea of RoF, for example, Anticipative Hybrid Extreme Rotation Forest [31], rotation random forest with kernel PCA (RoRF-KPCA) [32].Chen et al. [33] proposed to combine rotation forest with multi-scale segmentation for hyperspectral data classification, which incorporated spatial information to generate the classification maps with homogeneous regions.
A massive number of research studies show that RoF surpasses conventional RF due to the high diversity in training sample and features.Nevertheless, it is well documented in the literatures that PCA is not particularly suitable for feature extraction (FE) in classification because it does not include discriminative information in calculating the optimal rotation of the axes [30,34,35].Although the authors explain that PCA is also valuable as a diversifying heuristic, it is expected to achieve better classification results if we try to find good class discriminative directions.Therefore, in this paper, we present an improved ensemble learning method, which uses the semi-supervised feature extraction technique instead of PCA during the "rotation" process of classical RoF approach.The proposed algorithm, named semi-supervised rotation forest (SSRoF), applies the semi-supervised local discriminant analysis (SLDA) FE method, which was proposed in our previous work [36], to fully take advantage of both the class separability and local neighbor information, with the aim of finding better rotation directions.In addition, to further enhance the diversity of features, we propose to use a weighted form of SLDA, which can balance the values of labeled samples and unlabeled samples.The main contributions of this paper are as follows: (1) an exploration of the benefit of the unlabeled samples in conventional ensemble learning methods; (2) an adjustment of the previous SLDA technique to a weighted generalized eigenvalue problem; (3) the construction of an ensemble of classifiers, in which the weights can be randomly selected, thereby reducing the human effort for determining the optimal parameters.The remainder of this paper is organized as follows.Section 2 describes the study data sets, and elaborates the proposed semi-supervised rotation forest algorithm.For better understanding, the SLDA feature extraction method is also briefly introduced.Section 3 reports the experiments and results.Finally, the conclusions are drawn in Section 4.

Materials and Methodology
In this section, we first introduce the experimental data sets, then we elaborate the proposed ensemble learning algorithm.

Study Data Sets
The experimental data sets include four HS images acquired by different sensors and resolutions.Each HS image is attached with a co-registered ground truth image.
( A subset of size 640 × 320 is used, which contains 12 classes in the corresponding ground truth image.Figure 1 shows the experimental data sets.

Weighted Semi-Supervised Local Discriminant Analysis
Semi-supervised local discriminant analysis is a semi-supervised feature extraction method that has been applied in hyperspectral image classification.It combines the supervised FE method-local Fisher discriminant analysis and unsupervised FE method-neighborhood preserving embedding, and thus attempts to discover the local discriminative information of the data while preserving the local neighbor information [36].Compared with other typical semi-supervised FE methods, SLDA focuses more on the exploration of local information, and gives a more accurate description of the distribution of samples.For better illustration, we first briefly review the feature extraction methods.
truth image has eight classes inside [39].Let x i ∈ R d be a d-dimensional sample vector, and X = {x i } n i=1 be the matrix of n samples.Z = T T X, (Z ∈ R r×n ) is the low-dimensional representation of the sample matrix, where T ∈ R d×r is the transformation matrix, T denotes the transpose.
Many dimensionality reduction techniques developed so far involve an optimization problem of the following form [40]: Generally speaking, S b (and S w ) corresponds to the quantity that we want to increase (and decrease), for example, between-class scatter (and within-class scatter).Equation ( 1) is equal to the solution of the following generalized eigenvalue problem: where {ϕ k } d k=1 is the generalized eigenvectors associated with the generalized eigenvalues k=1 is composed of the first r eigenvectors corresponding to the largest eigenvalues {λ k } r k=1 .Particularly, when S b is the total scatter matrix of all samples, and S w = I d×d , where I denotes the identity matrix.Equation (1) turns into the PCA method.

Local Fisher Discriminant Analysis (LFDA)
Suppose y i = c , c ∈ {1, 2, . . . ,C} is the associated class labels of the sample vector x i .C is the number of classes.n c is the number of samples in class c, then ∑ C c=1 n c = n.Let S b and S w be the local between-class and within-class scatter matrices, respectively, defined by [41], then Equation ( 2) turns into a local Fisher discriminant analysis problem, where W b and W w are n × n matrices, A i,j is the affinity value between x i and x j .A i,j is large if the two samples are close, and vice versa.The definition of A i,j can be found in [42].Note that we do not weight the values for the sample pairs in different classes.If ∀ i, j, A i,j = 1, then LFDA degenerates into the classical Fisher discriminant analysis (FDA or linear discriminant analysis, LDA) [43].Thus, LFDA can be regarded as a localized variant of FDA, which overcomes the weakness of LDA against within-class multimodality or outliers.

Neighborhood Preserving Embedding (NPE)
NPE is an unsupervised feature extraction method that seeks a projection that preserves neighboring data structure in the low-dimensional feature space [44].It can characterize the local structural information of massive unlabeled samples.The first step of NPE is also to construct an adjacency graph, and then compute the weight matrix Q by solving the following objective function, In other words, for each sample, we use its K-nearest neighbors (KNN) to reconstruct it.Thus, the goal of NPE is to preserve this neighbor relationship in the projected low-dimensional space, min where By imposing the following constraint, the transformation matrix can be optimized by solving the following generalized eigenvalue problem, where ϕ denotes generalized eigenvectors, and M = (I − Q) T (I − Q).

Weighted SLDA
It has been demonstrated that the performance of LFDA (and all other supervised dimensionality reduction methods) tends to degrade if only a small number of labeled samples are available [40], while PCA or NPE (and other unsupervised feature extraction (FE) methods) will generally lose the discriminative information of labeled information.Thus, combining supervised and unsupervised FE methods [45] is believed to compensate for each other's weaknesses.In this paper, we consider the combination of the aforementioned LFDA and NPE methods.As mentioned above, feature extraction techniques can be transformed into eigenvalue problems, thus, a possible way to combine LFDA and NPE is to merge the above generalized eigenvalue problems as follows [40], where β ∈ [0, 1] is a trade-off parameter.Calculating the S b and S w of LFDA is time-consuming; an efficient implementation can be used according to [41].Let S m denote the local mixture scatter matrix, where Since Equation ( 3) can be expressed as where D w is the n-dimensional diagonal matrix with D w i,i = ∑ n j=1 W w i,j .Similarly, S m can be expressed as where D m is the n-dimensional diagonal matrix with D m i,i = ∑ n j=1 W m i,j .Therefore, the generalized eigenvalue problem of LFDA, namely Equation (2), can be rewritten as where ), from which we can see that the eigenvalue problem of LFDA has a similar form with NPE, i.e., Equation (9).Suppose the training sample vectors are arranged by X = X L , X U , where i=1 denotes the labeled samples, and i=1 denotes the unlabeled samples, where n = n l + n u is the total number of available samples.We can define the following matrices Therefore, the weighted SLDA is equal to the solution of the following generalized eigenvalue problem and β is the trade-off parameter.In general, 0 < β < 1 inherits the characteristics of both LFDA and NPE, and thus makes full use of both the class discriminative and local neighbor spatial information.In practice, searching for the optimal β is time-consuming and sometimes impractical if there are insufficient labeled samples available for validation.Several research studies suggest that ensemble learning methods can be employed to avoid the huge effort of searching for the optimal parameters [46,47].On the other hand, different parameters also lead to diversity among features or classifiers, which benefits the generalization performance of the ensembles.Hence, we present an EL method based on the idea of RoF and the weighted SLDA algorithm.

Proposed Semi-Supervised Rotation Forest
Rotation forest was developed from conventional random forest to building independent decision trees on different sets of features.It consists of splitting the feature set into several random disjoint subsets, running PCA separately on each subset, and reassembling the extracted features [30,48].By applying different splits of the features, diverse classifiers are obtained.The main steps of RoF are briefly presented as follows: 1.
The original feature set is divided randomly into K disjoint subsets with each subset containing M features; 2.
Use the bootstrap approach to select a subset of the training samples for each feature subset (typically 75% of the total training samples); 3.
Run PCA on each feature subset and store the transformation coefficients; 4.
Reorder the coefficients to match the original features, rotate the samples using the obtained coefficients (i.e., feature extraction); 5.
Perform DT on the rotated training and testing samples; 6.
The process is repeated L times to obtain multiple classifiers, followed by a majority voting rule to integrate the classification results.
By substituting SLDA for the PCA method, we propose the following SSRoF ensemble algorithm.Apart from the different FE methods between Algorithm 1 and RoF, we use the different weights (β) to balance the discriminative information and structure information, thereby enhancing the diversity of features.Although the computation of the eigenvector matrix is repeated ten times (corresponding to different β) for each feature subset, it can be noticed that since the within-class and between-class scatter matrices are invariant for different weights, the computation cost is greatly reduced.Of course, the discrete values of β can be set by different steps; we recommend the values above by considering both the diversity and computation time.Randomly select a subset of samples from X L and X U , respectively, (typically 75% of samples) using bootstrap approach; 3.
Perform the weighted SLDA algorithm by the subset of X L and X U to obtain the pairs of between-class and within-class scatter matrices in Equation ( 17); For β = 0.1 : 0.1 : 1 4.
Construct the transformation matrix T β = T 1,β , T 2,β , . . . ,T K,β by merging the eigenvector matrices, and rearrange the columns of T β to match the order of original features; 6.
Build DT sub-classifier using T T β X L ; 7. Perform classification for T T β X T by using the sub-classifier; End for End for 8. Use a majority voting rule for the L × 10 sub-classifiers to compute the confidence of X T and assign a class label for each testing sample;

Experimental Results and Discussion
In this section, we report the experiments on the four groups of hyperspectral images.First, the presented method is compared with several other EL algorithms to show the advantages.Then, we also introduce the performance evaluation of our method under different parameters.

Experimental Setup
In order to demonstrate the advantages of the proposed algorithm, we conducted the experiments under different numbers of training samples, and compared with several state-of-the-art ensemble learning methods, namely random forest (RF), semi-supervised feature extraction combined RF ensemble method (SSFE-RF) [22], rotation forest (RoF) [30], and rotation random forest-KPCA (RoRF-KPCA) [32].For better comparison, the SLDA method was also used as a preprocessing step that combined with the original RoF method (we refer to it as SLDA-RoF).Finally, the LFDA and NPE methods were also used as rotation means like RoF method.
The numbers of trees were all set to L = 10, and the classification and regression tree (CART) was adopted as the base classifier.The numbers of features in each subset were all set to M = 10 for SSFE-RF, RoF, RoF-LFDA, RoF-NPE, and SSRoF.For RoRF-KPCA, Xia et al. [32] suggest that a small number of features per subset will increase the classification performance, as such, we set M = 5.For RF, the number of features considered at each node was set as the square root of the used feature number.The numbers of extracted features were set equal to M for RoF, RoRF-KPCA, RoF-LFDA, RoF-NPE, and SSRoF.For SLDA, the number of extracted features was set to half of the original features, and other parameters were set to the same as RoF.For RoRF-KPCA, it is quite difficult to select the optimal kernel parameters.Xia et al. [32] declares that parameter tuning is needed, but different kernel functions (linear, radial basis function, and Polynomial) provide very similar results, making this choice not critical in this context.Considering the performance enhancement and the computation cost, in our experiments, we use the polynomial kernels with the degree equals to two.
The performance is evaluated by the overall accuracy (OA), and Kappa coefficient.In all cases, we conduct ten independent Monte Carlo runs with respect to the labeled training set from the ground truth images.And the results are the average values of the 10 runs.The numbers of available samples are listed in Table 1.

Performance Evaluation
The comparison of different EL algorithms is presented here.We randomly selected 1%, 2%, and 5% samples of each class as training samples for the first three data sets, and 5%, 10%, and 20% for the last data set.The remaining samples were used for testing purposes.Table 2 lists the classification results of the four algorithms under different numbers of samples.The upper line in each cell denotes the overall accuracies, and the lower line is the Kappa values.For clarity, the best results are shown in different colors.
From the table, it can be seen obviously that all the other methods yielded much higher accuracies than the conventional RF method.SSFE-RF achieved higher accuracies than RF due to the increment in the number of classifiers and the semi-supervised feature extraction method.Particularly, it had splendid performance on the San Diego data set.Moreover, except for the SLDA-RoF, all of the other RoF-based approaches also surpassed the RF-based methods in most cases, which demonstrates the promotion of diversity owing to the random feature extraction.RoRF-KPCA yielded similar results with RoF, although it considers the nonlinear characteristics of hyperspectral data, and would have constructed reliable rotation matrices to generate high precision classification results.A probable reason may be the selection of sub-optimal parameters for kernel functions.However, as we have mentioned, searching for the optimal parameters remains problematic, and RoRF-KPCA is not sensitive to the changes of the kernel function.A smaller value of M may also affect the classification accuracy, although a smaller M means a larger K, which leads to a higher computational complexity due to the construction of the kernel matrix.Regardless of the computation time, it can be expected that RoRF-KPCA can surpass RoF to some extent.It can also be seen that RoF-LFDA and RoF-NPE also produced similar results as RoF.RoF-LFDA sometimes performed better than RoF and RoF-NPE when more samples were available, since it only uses the discriminative information of the labeled samples.In fact, no matter which simple rotation method was used in RoF, it seems that the results were very close to each other on the whole.However, the SLDA combined RoF method has relatively lower accuracies compared with other RoF-based method, although it has been demonstrated to perform well for other conventional classifiers [36] (e.g., MLC, SVM).Thus, it seems to be not suitable for rotation forest algorithms.
By contrast, the proposed SSRoF outperformed the others clearly in most cases from both OA and Kappa values, especially on the Indian and Pavia data sets (4.35% and 1.45% higher than RoF for the Indian and Pavia data sets on average, respectively).Although the conventional RF and RoF-based algorithms performed well on the last data set, the proposed algorithm still showed slight superiority.
The main reason why the proposed SSRoF method surpasses RoF-LFDA and RoF-NPE is that SSRoF uses a weighted form to better explore the discriminative information and structure information of the available samples, thus greatly promoting the diversity of features.Particularly, aside from the number of ensembles L and the number of features per subset (M), the proposed approach needs fewer additional parameters, which makes the approach much easier to implement.

Impact of Parameters
In this sub-section, we will discuss the impact of two basic parameters, i.e., the number of ensembles (L), and the number of features in each subset (M).For brevity, we simply show the results performed on the data sets of Indian Pines and University of Pavia by setting different number of trees, i.e., L = 2, 5, 10, 20, and 30.Likewise, the experiments are conducted under different numbers of training samples.The results are shown in Table 3.In order to give an intuitive evaluation, OAs and Kappa values are shown in different colors.
From Table 3 we can see that, obviously, with the increment of ensemble number, the overall accuracy and Kappa coefficient grow continuously, for instance, from nearly 67% to 75% under 1% samples for the Indian Pines data set, which demonstrates the benefit of EL.An interesting factor is that when the number of trees increases to 10, the classification accuracy grows slower and tends to reach convergence.This makes our approach more promising, since we can use less ensembles to achieve a relatively stable result, thereby reducing the computational burden.To investigate the impact of the number of features in each subset, we also performed tests on the Indian Pines data set regarding different feature divisions.For better comparison, the same process was also applied on RoF algorithm, and the results are shown in Figure 2, where the blue color denotes the OAs, and the magenta color denotes the Kappa values.The solid lines denote the RoF method, while the dot dash lines represent the SSRoF method.The figure indicates that when the number of features involved in each subset increases, i.e., the number of feature subsets (K) decreases, the classification results tend to degenerate for both RoF and SSRoF.In fact, this is also consistent with the conclusions of [32], and that is why we selected a small number of M for the RoRF-KPCA method.Although when the training set increased, this problem seemed to be alleviated in a manner (for instance, in Figure 2e, 91.48% for M = 5 and 90.94% for M = 30 (SSRoF), when 20% of training samples were used), a small value of M is usually preferred.However, on the other hand, a smaller M means a larger K, which means the rotation process will be executed more times, and this will lead to a huge computational cost.Apart from the above analysis, we can also see that the proposed approach seemed to be more stable than RoF with the increment in the number of features per subset.

Conclusions
Since existing rotation forest-based techniques fail to take account of the discriminative information of training samples during feature extraction, this paper proposed a semi-supervised rotation forest that uses the weighted semi-supervised local discriminant analysis method to jointly utilize the class discriminative information and local structural information provided by the labeled and unlabeled samples, respectively.The proposed algorithm aims to find the projection directions that provide better class separability, thus enhancing the performance of existing rotation forest algorithms.Furthermore, the proposed algorithm does not need additional parameters compared with the classical rotation forest method, which makes it easy to implement.Experiments have shown that the proposed algorithm outperforms several typical ensemble learning methods.Our future work will aim to reduce the computational time and assemble some other state-of-the-art machine learning algorithms.

Conclusions
Since existing rotation forest-based techniques fail to take account of the discriminative information of training samples during feature extraction, this paper proposed a semi-supervised rotation forest that uses the weighted semi-supervised local discriminant analysis method to jointly utilize the class discriminative information and local structural information provided by the labeled and unlabeled samples, respectively.The proposed algorithm aims to find the projection directions that provide better class separability, thus enhancing the performance of existing rotation forest algorithms.Furthermore, the proposed algorithm does not need additional parameters compared with the classical rotation forest method, which makes it easy to implement.Experiments have shown that the proposed algorithm outperforms several typical ensemble learning methods.Our future work will aim to reduce the computational time and assemble some other state-of-the-art machine learning algorithms.

Figure 1 .
Figure 1.Experimental hyperspectral and corresponding ground truth images.Figure 1. Experimental hyperspectral and corresponding ground truth images.

Algorithm 1 :
Procedures of SSRoFInput: Training samples X L = x classifiers L, number of feature subsets K, ensemble L = ∅ Output: Class labels of X T For i = 1 : L 1. Randomly split the features into K subsets; For j = 1 : K 2.
1) The first data set is the well-known scene taken in 1992 by the Airborne Visible Infrared Imaging Spectrometer (AVIRIS) sensor over the Indian Pines region in Northwestern Indiana.It has 144 × 144 pixels and 200 spectral bands with a pixel resolution of 20 m.Nine classes including different categories of crops have been labeled in the ground truth image.(2) The second data set was collected over the University of Pavia, Italy, by the Reflective Optics

Table 1 .
Number of available samples in each data set.

Table 2 .
The overall accuracies (%) and Kappa coefficients of different algorithms.

Table 3 .
The classification results of SSRoF under different number of ensembles (L).OA: overall accuracy.