Active Semi-Supervised Random Forest for Hyperspectral Image Classiﬁcation

: Random forest (RF) has obtained great success in hyperspectral image (HSI) classiﬁcation. However, RF cannot leverage its full potential in the case of limited labeled samples. To address this issue, we propose a uniﬁed framework that embeds active learning (AL) and semi-supervised learning (SSL) into RF (ASSRF). Our aim is to utilize AL and SSL simultaneously to improve the performance of RF. The objective of the proposed method is to use a small number of manually labeled samples to train classiﬁers with relative high classiﬁcation accuracy. To achieve this goal, a new query function is designed to query the most informative samples for manual labeling, and a new pseudolabeling strategy is introduced to select some samples for pseudolabeling. Compared with other AL-and SSL-based methods, the proposed method has several advantages. First, ASSRF utilizes the spatial information to construct a query function for AL, which can select more informative samples. Second, in addition to providing more labeled samples for SSL, the proposed pseudolabeling method avoids bias caused by AL-labeled samples. Finally, the proposed model retains the advantages of RF. To demonstrate the e ﬀ ectiveness of ASSRF, we conducted experiments on three real hyperspectral data sets. The experimental results have shown that our proposed method outperforms other state-of-the-art methods.


Introduction
Hyperspectral remote sensing can obtain a great deal of information about an object via hundreds of narrow, continuous spectral bands. Hyperspectral imaging techniques have been widely used in many applications, such as landmine detection [1], agricultural monitoring [2], land cover classification [3], and target detection [4]. Many of these applications are based on hyperspectral image (HSI) classification at the pixel level. In the past few years, various supervised classification methods, e.g., support vector machines (SVMs) [5,6], neural networks [7,8], and random forests (RFs) [9][10][11] have been successfully used for HSI classification. However, supervised methods often require many informative samples with labels to train high-performing classifiers. In other words, the quality and quantity of the training data are very important for training good classifiers [12]. However, labeling samples manually requires significant labor, hence, we need a classifier that can perform well with only a few labeled samples. Semi-supervised learning (SSL) [13] and active learning (AL) [14,15] provide promising solutions to improve generalization performance in the case of limited samples. In this paper, we consider combining AL and SSL into random forest for HSI classification. the classification performance. In addition, we add the spectral-spatial constraint into the query function for AL, which makes use of the spectral-spatial relationship of the candidate samples.
This paper proposes an HSI classification framework that embeds active learning and semi-supervised learning into random forest. The proposed method relies on active semi-supervised random forest (ASSRF), and collaboratively utilizes AL and SSL to improve generalization performance. In active learning, a new query function, termed decision uncertainty with a spectral-spatial constraint (DUSSC), is proposed to select the most informative and diverse samples for manual labeling. It considers the uncertainty of the decision classes of the candidate samples and the degree of confusion in the neighborhood spectra of candidate samples. In other words, the samples with the highest uncertainty in the decision class and the most confused neighborhood spectral information will be selected for manual labeling. To investigate the structural information of the data, supervised clustering is adopted to divide the unlabeled samples into two parts, one for active learning and the other for pseudolabeling. To avoid the bias caused by AL-labeled samples, we assign pseudolabels to some unlabeled samples. Specifically, a random forest classifier trained on previous labeled data is used to classify the samples, and only the samples with high classification confidence are assigned pseudolabels. Experimental results on three public hyperspectral data sets verify the effectiveness of our proposed method.
The main contributions of our work are as follows: (1) A new query function considering spectral-spatial information, termed DUSSC, is proposed for active learning. (2) Supervised clustering algorithm is used to mine the structure of the data and divide the data for active learning and pseudolabeling. (3) We assign pseudolabels to some unlabeled samples to avoid the bias caused by AL-labeled samples. (4) A unified framework embedding AL and SSL into random forest is proposed for HSI classification.
The rest of this paper is organized as follows. Section 2 introduces related work including semi-supervised random forest, active learning, and clustering technique. The proposed method is described in Section 3. Section 4 presents the experimental results. The discussion is reported in Section 5. Finally, in Section 6, we present conclusions of our study and introduce several topics for future research.

Semi-Supervised Random Forest
Random forest has many advantages, including high speed, strong parallelism, noise robustness, and an inherently multi-class nature. Due to these features, it is widely used in remote sensing image analysis [9][10][11][17][18][19][20][21][22][23][24]. However, RF suffers from the same problem as other popular classification methods: it requires many labeled samples to leverage its full potential. To address this issue, Leistner et al. [46] proposed semi-supervised random forest (SSRF), which makes use of both labeled and unlabeled samples to train the classifier. Amini et al. [47] successfully used SSRF for HSI classification. Next, we will briefly describe the principle of SSRF.
Many SSL methods use the unlabeled data to regularize the supervised loss functions. SSRF also regularizes the loss for the labeled samples through a loss over the unlabeled samples, where the loss is used to maximize the margins of the labeled and unlabeled samples. Hence, we need to know how to compute the margin of a sample in the RF method.
Breiman [16] defined the classification margin of a labeled sample (x l , y) as m l (x l , y) = p(y|x l ) − max k∈Y,k y p(k|x l ), where p(y|x l ) is the probability of class y given the sample x l , and p(k|x l ) represents the probability that the forest classifies the sample x l as belonging to class k.
For an unlabeled sample x u , since there is no known true margin, Leistner et al. [46] defined the margin for x u as m u (x u ) = max k∈Y g k (x u ), (2) where g k is the margin for the k th class and g k (x u ) = 1/(1 + exp(−p(k|x u ))).
Based on the above definitions of margins for labeled and unlabeled samples, the overall loss can be written as where α represents the contribution rate of the unlabeled samples and l is a given loss function. Equation (4) is no-convex since it has two parts, namely, the labeled and unlabeled samples to be optimized. Following [46], deterministic annealing (DA) is used to optimize Equation (4) by introducing a distributionp over the predicted labels of the unlabeled data and adding a controlled uncertainty into the optimization process. The new loss function with DA can be rewritten as where T represents the temperature and H(p) = − K k=1p (k|x u ) log(p(k|x u )) reflects the entropy over the predicted distribution.
To minimize Equation (5), parametersp and g are optimized alternately. We first fix the distribution p and optimize the model. For the fixed distribution, a labelŷ u is chosen randomly for each unlabeled sample. The optimization objective for the n th tree becomes At the second step, we use the trained forest to compute the optimal probability distribution. The optimal probability distribution can be obtained according tô Finally, based on the procedure for solving Equation (7) in [46], the probability that each unlabeled sample x u belongs to each class iŝ where Z(x) = K k=1p * (k|x u ) is the partition function.
In the SSRF method, the labels of the unlabeled data are treated as variables need to optimize. When the temperature is high, the probability of unlabeled data is equivalent to the uniform distribution. When the temperature is low, the distribution is approximately the Dirac delta function. We can see that the main procedure of the DA-based SSRF method is optimizing the probability distribution through Equation (8). For more details about SSRF, please refer to [46].

Active Learning
Query functions play a very important role in AL methods because they determine which samples are selected for manual labeling, which directly affects the final performance. According to Demir et al. [48], good query functions must have two properties: (1) the most informative samples are queried and (2) the selected samples for manual labeling are highly diverse.
Many query functions have been adopted to select the most informative samples. The first strategy is uncertainty sampling [49], which tries to select samples close to the decision boundary. This strategy has been successfully used in SVM classification [48,50,51]. The second strategy is query by committee (QBC), which selects the samples with maximum disagreement in the committee of classifiers [52][53][54].
To speed up the whole learning process, batch mode AL methods have been widely used in HSI classification [42,43,48,55]. These methods aim to select a batch of samples at each iteration. Since multiple samples are selected for manual labeling, the diversity of the selected samples is critical. Most previous methods have been used for SVM classification, and they do not sufficiently consider the diversity of the selected samples.
Spatial information is important to HSIs and has been widely used in AL-based HSI classification [56][57][58][59][60][61]. For example, Shi et al. [56] proposed a spatial coherence-based batch-mode AL method, where the spatial coherence is represented by a two-level segmentation map. Demir et al. [57] designed spatial density assessment function to localize candidate small areas for AL. The neighborhood information of the images was also considered by Xue et al. [58] to enhance the uncertainty of candidate samples. Guo et al. [60] integrated the spectral and spatial features extracted from superpixels into AL framework. Patra et al. [61] proposed a novel query function that uses uncertainty, diversity, and cluster assumption criteria by exploiting the properties of three different types of classifiers trained on spectral-spatial features. The above methods mainly added spatial information or spectral-spatial information into the AL-based classification model to analyze HSIs, none of them consider using the spectral-spatial information to constrain the query functions in AL.

Clustering Methods in HSI Classification
The clustering method is often used for unsupervised learning. Clustering method divides the data into several clusters, and the samples in the same cluster are grouped into one class. This technique can mine the structural information of the data without extra efforts. Several clustering assumption-based AL methods have been used for HSI classification [6,48,[62][63][64]. Demir et al. [48] proposed a kernel-clustering technique-based query function, which is used to assess the diversity of candidate samples. Patra and Bruzzone [62] proposed a cluster assumption-based method that selects samples to be labeled from low density regions. They also applied cluster assumption to self-organizing map neural networks and SVM classifiers-based AL [6]. Volpi et al. [63] used uncertainty-based function to select some samples at first, and then the selected samples are grouped by clustering method. At last, only one representative sample in each cluster is manually labeled. Tuia et al. [64] segmented the whole image into hierarchical trees by using cluster-based hierarchical segmentation model, and then labeled the samples on the pruned trees by human efforts.
The clustering assumption is equivalent to the low-density separation assumption, which regards that the decision boundaries are located in low-density regions. The aim of AL is to assist the algorithm partition the low-density regions well by means of labeling the informative and discriminative samples manually. AL focuses on labeling the discriminative and informative samples while ignoring the samples that are easy to classify, which easily results in bias of the model after several iterations. So, we consider using clustering technique to mine the discriminative samples as candidate samples for AL. Meanwhile, we assign pseudolabels to the samples with high classification confidence, which can balance the bias caused by AL-labeled samples.
Supervised clustering methods, which introduce supervised information into unsupervised clustering, have achieved great success in the past decades. For example, Gaddam et al. [65] combined cascading k-means and ID3 decision tree together for anomaly detection, where the Remote Sens. 2019, 11, 2974 6 of 21 cascading k-means belongs to supervised clustering. Michel et al. [66] used supervised clustering method to infer the brain states through fMRI-based images. Ding et al. [67] adopted supervised clustering to mine feature-based hot spots. Supervised clustering was first used for remote sensing image classification by Wang et al. [43], where supervised clustering is used to mine representative information, the experimental results from their report show that supervised clustering is very effective. The commodity of the above methods is that they all used supervised information during the procedure of clustering.

Proposed ASSRF Method
Although SSRF can obtain labels for the unlabeled data through multiple iterations of the optimization procedure, it suffers from the limited labeled data, which will affect the final performance because the iterative process depends on the initial labeled data. Our main goal was to collaboratively utilize AL and SSL to improve generalization performance.
The proposed method embeds active learning and semi-supervised learning into random forest. It adopts manual labor and classification to obtain more labeled samples during the iterations. So, the quality and quantity of the labeled samples improve over time, which allows the algorithm to optimize the unlabeled samples in the right direction. The query function is the most important part in active learning. To speed up the learning process, researchers [48,55,68] have proposed several batch mode active learning methods, which select a batch of samples for manual labeling at each iteration. In batch mode active learning, the diversity of the selected samples is very important. To ensure that the selected samples are informative and diverse enough, we propose a new query function in this paper. To mine the structural information of the hyperspectral data, supervised clustering is adopted; it can investigate the candidate samples for active learning. Since active learning focuses on samples on the decision boundaries, AL-labeled samples may exhibit bias after several iterations. Thus, our method assigns pseudolabels to unlabeled samples that have high classification confidence, which will make the distribution of labeled samples balanced. In the next subsections, we introduce the proposed query function, then describe supervised clustering, and finally show the details of ASSRF classification.

Proposed Query Function for Active Learning
As is known that the goal of AL is to manually label the most informative samples in the manner of iteration to make the model more accurate. In this paper, we used the entropy-based uncertainty sampling to select the samples to be manually labeled, where the entropy-based uncertainty sampling is measured by probability distribution of the sample belonging to different classes. Our main idea was to increase the diversity among the samples selected by query function. Therefore, the spectral-spatial constraint, which is assessed by the similarity between the candidate samples and their neighborhood samples, was added into the query function. According to this rule, the candidate samples that are less similar to their neighborhood samples may be located in spatial boundary. The goal was to select the samples with the most uncertain decisions, and these samples lie as far as possible in spatial boundaries. By adding the spectral-spatial constraint into entropy-based uncertainty sampling rule, the whole query function can query the samples with the most uncertainty and diversity for manual labeling.
We proposed a new AL query function called decision uncertainty with a spectral-spatial constraint (DUSSC). DUSSC consists of two parts: one for measuring the uncertainty, and the other represents the spectral-spatial constraint. For a candidate sample x used for active learning, we define the value of DUSSC as Let is the probability of predicting class k. Actually, f 1 (x) is an information entropy function, which represents the decision uncertainty. The greater the entropy, the more uncertain the decision label. When the probability distribution is almost uniform, f 1 (x) will achieve a large value, which means that the decision label for sample x is ambiguous. Let f 2 (x) = 1 N N j=1 SID x, x_n j , where x_n j represents the j th spatial neighborhood of sample x. The spectral information divergence (SID) [69] is used to measure the spectral similarity of two samples. Meer [70] compared SID with other spectral similarity measures, including the spectral angle, correlation coefficient and spectral correlation metric, and found that SID is superior to other measures.
The SID between samples x i = (x i1 , · · · , x iM ) T and x j = x j1 , · · · , x jM T is defined as where represents the spectral dimensionality, and x it and x jt represent the t th element of vectors x i and x j , respectively. The greater the value of SID x i , x j , the more dissimilar x i and x j are. In Equation (9), we use the average SID value between x and its spatial neighborhood to represent the similarity of x and its spatial neighbors. When the value of f 2 (x) is large, the neighbor spectrum of x becomes confused; in other words, the sample x may fall on the spatial boundaries. In contrast, a small value of f 2 (x) indicates that x and its spatial neighbors may belong to the same region. Thus, the objective of Equation (9) is to query the samples whose decision labels are uncertain and whose spatial neighborhoods are confused in the spectrum. The parameter β in Equation (9) is used to control the strength of the spectral-spatial constraint.
In the proposed active learning method, we selected multiple samples one by one from the candidate pool at each iteration. We selected the first sample from the candidate pool by using the query function DUSSC. Then, we removed the neighborhood samples of the previous sample to avoid them being selected in next iteration; this guarantees the diversity of the selected samples in terms of the spatial relation. We repeated the above steps several times to obtain a batch of samples for manual labeling.

Supervised K-Means Clustering
In ASSRF classification, supervised clustering algorithm was used to mine the structure of the whole data. Following Wang et al. [43], the k-means method was also used for supervised clustering. Next, we introduced the detailed procedure of supervised k-means clustering.
The data D contains labeled data L and unlabeled data U. First, D is clustered into k clusters by using the k-means method. It can be written as where k is the number of class labels in L. Labeled samples may exist in each cluster D k . D k can be decomposed as follows: where D k_u is the set of the unlabeled samples in D k , and D k_l represents the set of labeled samples in D k . If the subset D k_l is empty or the labels of the samples in subset D k_l are the same, we stopped clustering D k . Otherwise, we clustered D k into k clusters, where k is the number of classes of labeled samples in D k_l . At last, all the clusters are pure; i.e., each cluster either does not contain labeled samples or contains only samples with the same class label. The detailed procedure for supervised clustering is described in Algorithm 1. We could partition the unlabeled samples into multiple clusters following Algorithm 1. Some clusters have no labeled samples, so we could add these samples to the candidate pool for active learning. Although the clusters with one class of labeled samples should be assigned labels, the results from clustering method might not always be very reliable. We used a random forest classifier trained on labeled samples to verify these clusters. The samples with high classification confidence will be assigned pseudolabels, and the other samples will be added to the candidate pool for active learning. By using this method, representative samples will be found for active learning and verified samples with high classification confidence will be assigned pseudolabels.

Algorithm 1 Supervised k-means clustering
Input: data set D containing labeled pixels L and unlabeled pixels U 1: Divide D into k clusters through k-means, where k represents class number of labeled pixels L; 2: Repeat 3: Count the labeled pixels in each cluster; 4: If the labeled pixels in each cluster do not all belong to the same category, then 5: Divide the impure cluster into k clusters through k-means algorithm, where k represents class number of the labeled pixels in this cluster; 6: End if 7: Generate a set of clusters; 8: Until the labeled pixels in each cluster have the same class label or the cluster doesn't have labeled pixels. Output: the pure clusters.

Details of ASSRF Classification
ASSRF is an iterative method that collaboratively utilizes AL and SSL to improve the classification performance. To improve the quality and quantity of labeled samples, we designed a new query function to select the most informative and diverse samples for manual labeling, and assigned pseudolabels to the samples with high classification confidence. In this way, the labeled samples would contribute SSL to obtain more accurate class probability distribution, which would improve the accuracy of the model at next iteration. The ASSRF classification included the following steps.
Initialization part (step 1): (1) Set the pseudolabeled data P and manually labeled data M as empty. Initial labeled data is set to L, train the random forest RF on the initial labeled data L.

Clustering part (steps 2-3):
(2) Use Algorithm 1 to divide the unlabeled data U into many pure clusters.
(3) The clusters that contain labeled samples are merged into set C 1 , and the clusters that do not contain labeled samples are merged into set C 2 .

Verification part (steps 4-5):
(4) Train a temporary random forest rf m on labeled data L.
(5) The samples in set C 1 are classified by rf m , and g samples with high classification confidence are assigned with pseudolabels. Let g samples form a set P, and let R = C 1 /P.

AL part (steps 6-7):
(6) Let the candidate pool S = C 2 ∪ R. The detailed procedure for ASSRF classification is described in Algorithm 2.
Algorithm 2 Active semi-supervised random forest classification Input: a training data set D containing labeled pixels L and unlabeled pixels U, the size of the forest N, the batch size h for active learning, the batch size of the pseudolabeled samples g, an initial temperature T 0 and a cooling function c(T, m). Initialization: the manually labeled data M = ∅ and the pseudolabeled data P = ∅.

Hyperspectral Image Data Sets
To evaluate the performance of the proposed methods, we used three public HSI data sets in our experiments.
(1) The Kennedy Space Center (KSC) data was acquired by the NASA Airborne Visible Infrared Imaging Spectrometer sensor over the KSC, Florida, on 23 March 1996. The original data has 224 spectral bands. We used only 176 bands for our experiments because water absorption bands and bands with low signal-to-noises were excluded. The data set contained 13 classes with a size of 512 pixels × 614 pixels and the spatial resolution was 18 m/pixel. There were a total of 5211 labeled pixels in the data set.  Detailed class name and number on these three data sets are shown in Table 1. The three-band pseudocolor images of the three hyperspectral data sets and their corresponding reference maps are illustrated in Figures 1-3. data set was 1476 pixels × 256 pixels in size and had a spatial resolution of 30 m/pixel. There were 3248 labeled pixels covering 14 classes in total.
Detailed class name and number on these three data sets are shown in Table 1. The three-band pseudocolor images of the three hyperspectral data sets and their corresponding reference maps are illustrated in Figures 1-3.

Experimental Setup
In each hyperspectral data set, we randomly divided the available samples into two parts (60% for training and 40% for testing). For the training data, we randomly selected 10 samples for each class as the initial labeled samples. The remaining samples were used as unlabeled for active learning [43].  Detailed class name and number on these three data sets are shown in Table 1. The three-band pseudocolor images of the three hyperspectral data sets and their corresponding reference maps are illustrated in Figures 1-3.

Experimental Setup
In each hyperspectral data set, we randomly divided the available samples into two parts (60% for training and 40% for testing). For the training data, we randomly selected 10 samples for each class as the initial labeled samples. The remaining samples were used as unlabeled for active learning [43].  Several parameters need to be set in ASSRF classification. The random subspace ratio m was set to the square root of the number of spectral bands, which is a default value for RF [16]. The size of the forest N was set to 500, a reasonable value according to [9]. The parameter g, which was used to control the number of samples for pseudolabeling at each iteration, was set to 10. Following Amini et al. [47], the parameter α in Equation (8) (the contribution of the unlabeled data in the training process) was set to 0.15, while parameter β, which controls the strength of the spectral-spatial constraint, was intuitively moderate and set to 0.5 after several trials. In our parameter sensitivity analysis, we would analyze the influence of these parameters on the classification performance.
The parameters of DA were set following Amini et al. [47]; i.e., the iteration epoch was set to 20 and the simple exponential cooling function in Equation (13) was adopted to compute the parameter T in Equation (8), where T0 = 5 is the initial temperature and Tc = 5 is a cooling constant value: ).
The parameter h is used to control the number of samples that need to be manually labeled at each iteration. Many researchers [42,43,48] have investigated this parameter, and following Wang et al. [43], we selected 10 as the batch size for our experiments.
We first compared ASSRF classification with RF and SSRF classification. In ASSRF classification, 10 samples were selected for manual labeling at each iteration, and a total of 200 samples were manually labeled after 20 iterations. To test the effectiveness of active learning, we used two types of training data for training the RF and SSRF. The first type contained only the initial labeled samples. The second type contained the initial labeled samples and 200 randomly selected labeled samples. RF1 and SSRF1 represent the classifiers that were trained on the first type of training data. RF2 and SSRF2 represent the classifiers that were trained on the second type of training data.
We then compared ASSRF classification with other state-of-the-art methods. These methods include CASSL [42], DRDbSSAL [43], MCLU-ECBD (multi-class level uncertainty enhanced clustering-based diversity) [48], MS-cSV (margin sampling by closest support vectors) [54,63], and EQB (entropy query by bagging) [54,63]. The DRDbSSAL and CASSL methods combine AL and SSL and both use the MCLU-ECBD query function. MS-cSV is a margin sampling method based on closest support vectors. EQB is an extension of the algorithm for QBC. The experimental results from Volpi et al. [63] showed that EQB is superior to MS-cSV in most cases. For all the compared algorithms, we used the default parameter settings in the corresponding papers.
Three measures, including average accuracy (AA), overall accuracy (OA), and the kappa coefficient (k), were used evaluate the performance of the different methods. For each method, ten runs of experiments were executed on each data set to obtain the average results.

Experimental Setup
In each hyperspectral data set, we randomly divided the available samples into two parts (60% for training and 40% for testing). For the training data, we randomly selected 10 samples for each class as the initial labeled samples. The remaining samples were used as unlabeled for active learning [43].
Several parameters need to be set in ASSRF classification. The random subspace ratio m was set to the square root of the number of spectral bands, which is a default value for RF [16]. The size of the forest N was set to 500, a reasonable value according to [9]. The parameter g, which was used to control the number of samples for pseudolabeling at each iteration, was set to 10. Following Amini et al. [47], the parameter α in Equation (8) (the contribution of the unlabeled data in the training process) was set to 0.15, while parameter β, which controls the strength of the spectral-spatial constraint, was intuitively moderate and set to 0.5 after several trials. In our parameter sensitivity analysis, we would analyze the influence of these parameters on the classification performance.
The parameters of DA were set following Amini et al. [47]; i.e., the iteration epoch was set to 20 and the simple exponential cooling function in Equation (13) was adopted to compute the parameter T in Equation (8), where T 0 = 5 is the initial temperature and T c = 5 is a cooling constant value: The parameter h is used to control the number of samples that need to be manually labeled at each iteration. Many researchers [42,43,48] have investigated this parameter, and following Wang et al. [43], we selected 10 as the batch size for our experiments.
We first compared ASSRF classification with RF and SSRF classification. In ASSRF classification, 10 samples were selected for manual labeling at each iteration, and a total of 200 samples were manually labeled after 20 iterations. To test the effectiveness of active learning, we used two types of training data for training the RF and SSRF. The first type contained only the initial labeled samples. The second type contained the initial labeled samples and 200 randomly selected labeled samples. RF 1 and SSRF 1 represent the classifiers that were trained on the first type of training data. RF 2 and SSRF 2 represent the classifiers that were trained on the second type of training data.
We then compared ASSRF classification with other state-of-the-art methods. These methods include CASSL [42], DRDbSSAL [43], MCLU-ECBD (multi-class level uncertainty enhanced clustering-based diversity) [48], MS-cSV (margin sampling by closest support vectors) [54,63], and EQB (entropy query by bagging) [54,63]. The DRDbSSAL and CASSL methods combine AL and SSL and both use the MCLU-ECBD query function. MS-cSV is a margin sampling method based on closest support vectors. EQB is an extension of the algorithm for QBC. The experimental results from Volpi et al. [63] showed that EQB is superior to MS-cSV in most cases. For all the compared algorithms, we used the default parameter settings in the corresponding papers.
Three measures, including average accuracy (AA), overall accuracy (OA), and the kappa coefficient (k), were used evaluate the performance of the different methods. For each method, ten runs of experiments were executed on each data set to obtain the average results. Tables 2-4 show the class-specific accuracies, AAs, OAs, and kappa coefficients obtained by the different methods when applied to the KSC, PaviaU, and BOT data sets, respectively. The best results for different method are highlighted in bold. We could obtain several findings from Tables 2-4. First, ASSRF achieved the best class-specific accuracy in most cases. Specifically, ASSRF achieved 11, 5, and 11 best class-specific accuracies on KSC, PaviaU, and BOT data sets, respectively. The class-specific accuracies of the other classes obtained by ASSRF were slightly lower than the best results. Second, among all the data sets, ASSRF achieved the best results on OAs, AAs, and the kappa coefficients; SSRF obtained the second-best results; and RF obtained the worst results. Third, compared with RF 1 and SSRF 1 classification, RF 2 and SSRF 2 classification exhibited improved performance on OA, AA, and kappa coefficient since RF 2 and SSRF 2 not only use the initial labeled samples but also use the extra 200 randomly labeled samples to train the classifiers. Fourth, SSRF classification performed better than RF classification, because it benefits from a semi-supervised method that acquires the probability distribution of the training samples. Fifth, ASSRF significantly improved the classification performance because it used a unified framework that combines Al and SSL to train the classifier. Compared with SSRF 1 classification, we could see that SSRF 2 improved the accuracy slightly, while ASSRF improved it significantly. Although both SSRF 2 and ASSRF classification used the extra 200 labeled samples for training, the 200 labeled samples used in SSRF 2 and ASSRF classification were different. The 200 labeled samples used in SSRF 2 were randomly selected from the training set at the beginning of training the classifier, while those in ASSRF were selected in an iterative manner by our proposed active learning method. The classification maps for different methods on KSC and PaviaU data sets are depicted in Figures 4 and 5, respectively. We could obtain several findings from Tables 2-4. First, ASSRF achieved the best class-specific accuracy in most cases. Specifically, ASSRF achieved 11, 5, and 11 best class-specific accuracies on KSC, PaviaU, and BOT data sets, respectively. The class-specific accuracies of the other classes obtained by ASSRF were slightly lower than the best results. Second, among all the data sets, ASSRF achieved the best results on OAs, AAs, and the kappa coefficients; SSRF obtained the second-best results; and RF obtained the worst results. Third, compared with RF1 and SSRF1 classification, RF2 and SSRF2 classification exhibited improved performance on OA, AA, and kappa coefficient since RF2 and SSRF2 not only use the initial labeled samples but also use the extra 200 randomly labeled samples to train the classifiers. Fourth, SSRF classification performed better than RF classification, because it benefits from a semi-supervised method that acquires the probability distribution of the training samples. Fifth, ASSRF significantly improved the classification performance because it used a unified framework that combines Al and SSL to train the classifier. Compared with SSRF1 classification, we could see that SSRF2 improved the accuracy slightly, while ASSRF improved it significantly. Although both SSRF2 and ASSRF classification used the extra 200 labeled samples for training, the 200 labeled samples used in SSRF2 and ASSRF classification were different. The 200 labeled samples used in SSRF2 were randomly selected from the training set at the beginning of training the classifier, while those in ASSRF were selected in an iterative manner by our proposed active learning method. The classification maps for different methods on KSC and PaviaU data sets are depicted in Figures 4  and 5, respectively.

Comparison with Other State-Of-The-Art Methods
In this section, we compared ASSRF with other state-of-the-art methods, including DRDbSSAL, CASSL, MCLU-ECBD, MS-cSV, and EQB, which were introduced in Section 4.2. To verify the impact of the number of AL-labeled samples on the performance of different algorithms, we calculated the average OA on the testing sets over 10 runs until the number of labeled samples expanded to 1000. Figure 6 shows the OAs obtained from the testing sample sets. Tables 5-7 exhibit quantitative evaluations of different algorithms on the three hyperspectral data sets.

Comparison with Other State-Of-The-Art Methods
In this section, we compared ASSRF with other state-of-the-art methods, including DRDbSSAL, CASSL, MCLU-ECBD, MS-cSV, and EQB, which were introduced in Section 4.2. To verify the impact of the number of AL-labeled samples on the performance of different algorithms, we calculated the average OA on the testing sets over 10 runs until the number of labeled samples expanded to 1000. Figure 6 shows the OAs obtained from the testing sample sets. Tables 5-7 exhibit quantitative evaluations of different algorithms on the three hyperspectral data sets.

Comparison with Other State-Of-The-Art Methods
In this section, we compared ASSRF with other state-of-the-art methods, including DRDbSSAL, CASSL, MCLU-ECBD, MS-cSV, and EQB, which were introduced in Section 4.2. To verify the impact of the number of AL-labeled samples on the performance of different algorithms, we calculated the average OA on the testing sets over 10 runs until the number of labeled samples expanded to 1000. Figure 6 shows the OAs obtained from the testing sample sets. Tables 5-7 exhibit quantitative evaluations of different algorithms on the three hyperspectral data sets.

Comparison with Other State-Of-The-Art Methods
In this section, we compared ASSRF with other state-of-the-art methods, including DRDbSSAL, CASSL, MCLU-ECBD, MS-cSV, and EQB, which were introduced in Section 4.2. To verify the impact of the number of AL-labeled samples on the performance of different algorithms, we calculated the average OA on the testing sets over 10 runs until the number of labeled samples expanded to 1000. Figure 6 shows the OAs obtained from the testing sample sets. Tables 5-7 exhibit quantitative evaluations of different algorithms on the three hyperspectral data sets.   The results in Figure 6 proved that for all compared methods on each hyperspectral data set, the average OA increased as the number of labeled samples increased. ASSRF consistently outperformed the other methods on the KSC and PaviaU data sets, no matter how many samples were manually labeled. The average OA of ASSRF on the BOT data set was slightly lower than that of DRDbSSAL when the number of labeled samples was less than 400. However, when the number of labeled samples was larger, the performance of ASSRF was slightly better than that of DRDbSSAL. The quantitative evaluations from Tables 5-7 also demonstrated that ASSRF obtained better performance than other methods. The standard deviations of the accuracies from different methods indicate that ASSRF was more robust than CASSL, MCLU-ECBD, MS-cSV, and EQB and had approximately the same robustness as DRDbSSAL.
All the methods in our experiments were implemented using MATLAB 2015b platform with 3.4GHz Intel i7-6700 CPU and 8 GB RAM. Table 8 compares the training time of each method on KSC and PaviaU data sets. The results indicate that MS-cSV is the most time-consuming method, MCLU-ECBD and EQB require the least time cost, CASSL, DRDbSSAL, and ASSRF need a medium computation time. In addition, the time cost of ASSRF was a little shorter than CASSL and DRDbSSAL. The results show that the computation time of ASSRF was acceptable.

Parameter Sensitivity Analysis
As we discussed in the previous sections, the parameter α represents the contribution of the unlabeled data to the procedure of semi-supervised regularization and the parameter β represents the constraint strength of the spectral-spatial information on the decision labels. To explore the influence of these two parameters on the classification accuracy, we conducted experiments using different values of α and β. When we analyzed each parameter, the other parameters were fixed. First, we set α in the range [0.01, 0.5] with a step size of 0.01 to evaluate the overall classification accuracy. Then, we set β in the range [0.1, 1] with a step size of 0.05 to evaluate the overall classification accuracy. The influences of the parameters α and β on the classification accuracy for the three hyperspectral data sets are shown in Figures 7 and 8

Parameter Sensitivity Analysis
As we discussed in the previous sections, the parameter α represents the contribution of the unlabeled data to the procedure of semi-supervised regularization and the parameter β represents the constraint strength of the spectral-spatial information on the decision labels. To explore the influence of these two parameters on the classification accuracy, we conducted experiments using different values of α and β. When we analyzed each parameter, the other parameters were fixed. First, we set α in the range [0.01, 0.5] with a step size of 0.01 to evaluate the overall classification accuracy. Then, we set β in the range [0.1, 1] with a step size of 0.05 to evaluate the overall classification accuracy. The influences of the parameters α and β on the classification accuracy for the three hyperspectral data sets are shown in Figures 7 and 8, respectively.   Figures 7 and 8 show that the accuracy of ASSRF first increased, then reached the peak value, and at last decreased or maintained this value. The main reason for these results is as follows. When α is too small, the dominant term for SSL is the labeled data, while the unlabeled data have little influence. However, when α is too large, the dominant term for SSL is the unlabeled data, which will lead to that the uncertainty becomes too large to optimize. For the parameter β, a moderate value makes the spectral-spatial constraint work better, while too-small or too-large values will result in low accuracy. Thus, the value for α should be small but not too small, and the value for β should be moderate. We conclude that the recommended value for α ranges from 0.1 to 0.25 and the recommended value for β ranges from 0.35 to 0.55.

Further Analysis of ASSRF
In this section, we analyzed the contribution of two parts in ASSRF on the improvement of the classification performance. First, to evaluate the importance of spectral-spatial constraint to the query

Parameter Sensitivity Analysis
As we discussed in the previous sections, the parameter α represents the contribution of the unlabeled data to the procedure of semi-supervised regularization and the parameter β represents the constraint strength of the spectral-spatial information on the decision labels. To explore the influence of these two parameters on the classification accuracy, we conducted experiments using different values of α and β. When we analyzed each parameter, the other parameters were fixed. First, we set α in the range [0.01, 0.5] with a step size of 0.01 to evaluate the overall classification accuracy. Then, we set β in the range [0.1, 1] with a step size of 0.05 to evaluate the overall classification accuracy. The influences of the parameters α and β on the classification accuracy for the three hyperspectral data sets are shown in Figures 7 and 8, respectively.   Figures 7 and 8 show that the accuracy of ASSRF first increased, then reached the peak value, and at last decreased or maintained this value. The main reason for these results is as follows. When α is too small, the dominant term for SSL is the labeled data, while the unlabeled data have little influence. However, when α is too large, the dominant term for SSL is the unlabeled data, which will lead to that the uncertainty becomes too large to optimize. For the parameter β, a moderate value makes the spectral-spatial constraint work better, while too-small or too-large values will result in low accuracy. Thus, the value for α should be small but not too small, and the value for β should be moderate. We conclude that the recommended value for α ranges from 0.1 to 0.25 and the recommended value for β ranges from 0.35 to 0.55.

Further Analysis of ASSRF
In this section, we analyzed the contribution of two parts in ASSRF on the improvement of the classification performance. First, to evaluate the importance of spectral-spatial constraint to the query  Figures 7 and 8 show that the accuracy of ASSRF first increased, then reached the peak value, and at last decreased or maintained this value. The main reason for these results is as follows. When α is too small, the dominant term for SSL is the labeled data, while the unlabeled data have little influence. However, when α is too large, the dominant term for SSL is the unlabeled data, which will lead to that the uncertainty becomes too large to optimize. For the parameter β, a moderate value makes the spectral-spatial constraint work better, while too-small or too-large values will result in low accuracy. Thus, the value for α should be small but not too small, and the value for β should be moderate. We conclude that the recommended value for α ranges from 0.1 to 0.25 and the recommended value for β ranges from 0.35 to 0.55.

Further Analysis of ASSRF
In this section, we analyzed the contribution of two parts in ASSRF on the improvement of the classification performance. First, to evaluate the importance of spectral-spatial constraint to the query function of AL, we used the query function without spectral-spatial constraint for AL. The query function degraded from Equation (9) is described as Equation (14).
Compared with Equation (9), the query function in Equation (14) only contains one part, i.e., information entropy. This method is referred to as entropy-only. We did experiments for the entropy-only method to demonstrate the importance of spectral-spatial constraint. Second, to verify the role of the supervised clustering algorithm in ASSRF, we did experiments for ASSRF without supervised clustering.
The results obtained on KSC and PaviaU data sets are illustrated in Figure 9. We could observe that both the spectral-spatial constraint and supervised k-means clustering algorithm played an important role in ASSRF, and supervised k-means clustering made more contributions to the accuracy improvement than that of the spectral-spatial constraint.
Remote Sens. 2019, 11, x FOR PEER REVIEW 17 of 21 function of AL, we used the query function without spectral-spatial constraint for AL. The query function degraded from Equation (9) is described as Equation (14). ).
Compared with Equation (9), the query function in Equation (14) only contains one part, i.e., information entropy. This method is referred to as entropy-only. We did experiments for the entropyonly method to demonstrate the importance of spectral-spatial constraint. Second, to verify the role of the supervised clustering algorithm in ASSRF, we did experiments for ASSRF without supervised clustering.
The results obtained on KSC and PaviaU data sets are illustrated in Figure 9. We could observe that both the spectral-spatial constraint and supervised k-means clustering algorithm played an important role in ASSRF, and supervised k-means clustering made more contributions to the accuracy improvement than that of the spectral-spatial constraint.

Discussion
The experiments on three real hyperspectral data sets revealed several interesting points.
• As shown in Section 4.3 and Section 4.4, compared with other methods, ASSRF performed better classification performance. The good performance of ASSRF could be attributed to the following three reasons. First, supervised clustering can extract the structure of the whole data and divide it into two parts, one for active learning and one for pseudolabeling. Second, the proposed query function DUSSC can select the most informative and diverse samples for manual labeling. Third, the unified framework combining AL and SSL can increase the learning performance by increasing the quantity and quality of the labeled samples. • The results from Table 8 show that the computation time of ASSRF was acceptable. Several reasons could explain this result. First, the training process of each tree in the forest was parallel. Second, the time complexity of k-means algorithm was linear, and the time cost of supervised kmeans method was at most m times the k-means, where m is the number of labeled samples. Third, the process of DA-based SSL could be solved as an analytical solution, which made the computation of SSL is on line. • The parameter analysis in Section 4.5 shows that ASSRF was robust to parameters. According to our experiments, the recommended value for α ranges from 0.1 to 0.25 and the recommended value for β ranges from 0.35 to 0.55. • The experimental results from Section 4.6 show that both the spectral-spatial constraint and supervised k-means clustering algorithm played an important role in ASSRF. This is mainly

Discussion
The experiments on three real hyperspectral data sets revealed several interesting points.
• As shown in Sections 4.3 and 4.4, compared with other methods, ASSRF performed better classification performance. The good performance of ASSRF could be attributed to the following three reasons. First, supervised clustering can extract the structure of the whole data and divide it into two parts, one for active learning and one for pseudolabeling. Second, the proposed query function DUSSC can select the most informative and diverse samples for manual labeling. Third, the unified framework combining AL and SSL can increase the learning performance by increasing the quantity and quality of the labeled samples.

•
The results from Table 8 show that the computation time of ASSRF was acceptable. Several reasons could explain this result. First, the training process of each tree in the forest was parallel. Second, the time complexity of k-means algorithm was linear, and the time cost of supervised k-means method was at most m times the k-means, where m is the number of labeled samples. Third, the process of DA-based SSL could be solved as an analytical solution, which made the computation of SSL is on line.
• The parameter analysis in Section 4.5 shows that ASSRF was robust to parameters. According to our experiments, the recommended value for α ranges from 0.1 to 0.25 and the recommended value for β ranges from 0.35 to 0.55.

•
The experimental results from Section 4.6 show that both the spectral-spatial constraint and supervised k-means clustering algorithm played an important role in ASSRF. This is mainly because that ASSRF without supervised clustering does not provide enough pseudolabeled samples for the classification model. In other words, the model focuses on informative samples but ignores the samples easy, which are to classify, leading to the bias of the model and affecting the final accuracy. Another finding was that the importance of spectral-spatial constraint was less than supervised clustering. • ASSRF is a unified framework combing AL and SSL into random forest. Generally, ASSRF can be used for any data represented in vector form. However, we added spectral-spatial constraints into query function for AL, where the spectral-spatial constraint utilized the neighborhood structure information of images. So, for usual data represented in vector form, the entropy-only-based ASSRF was appropriate.

Conclusions
In this paper, we proposed an active semi-supervised random forest (ASSRF) classifier for HSI classification. ASSRF collaboratively utilized active learning and semi-supervised learning to improve the final classification performance. To mine the structure of the whole data, supervised clustering was used to categorize the unlabeled data. In addition, a new query function called DUSSC was proposed to select the most informative and diverse samples for manual labeling. The proposed method was compared with random forest, semi-supervised random forest, and other state-of-the-art methods on three public hyperspectral data sets. Experiments on KSC, PaviaU, and BOT data sets demonstrated that compared with state-of-the-art method, the proposed ASSRF significantly improved the classification performance, especially on KSC and PaviaU data sets. In addition, the computational cost of ASSRF was moderate and acceptable. At last, the proposed method was robust to parameters.
In future work, we would investigate the proposed unified framework on rotation forest [71]. Furthermore, we would consider introducing morphological properties into the proposed classifier for HSI classification.