Multiclass Non-Randomized Spectral–Spatial Active Learning for Hyperspectral Image Classiﬁcation

: Active Learning (AL) for Hyperspectral Image Classiﬁcation (HSIC) has been extensively studied. However, the traditional AL methods do not consider randomness among the existing and new samples. Secondly, very limited AL research has been carried out on joint spectral–spatial information. Thirdly, a minor but still worth mentioning factor is the stopping criteria. Therefore, this study caters to all these issues using a spatial prior Fuzziness concept coupled with Multinomial Logistic Regression via a Splitting and Augmented Lagrangian (MLR-LORSAL) classiﬁer with dual stopping criteria. This work further compares several sample selection methods with the diverse nature of classiﬁers i.e., probabilistic and non-probabilistic. The sample selection methods include Breaking Ties (BT), Mutual Information (MI) and Modiﬁed Breaking Ties (MBT). The comparative classiﬁers include Support Vector Machine (SVM), Extreme Learning Machine (ELM), K-Nearest Neighbour (KNN) and Ensemble Learning (EL). The experimental results on three benchmark hyperspectral datasets reveal that the proposed pipeline signiﬁcantly increases the classiﬁcation accuracy and generalization performance. To further validate the performance, several statistical tests are also considered such as Precision, Recall and F1-Score.


Introduction
Hyperspectral imaging (HSI) examines the reflection of light of an object across a wide range of electromagnetic spectra instead of just associating primary colors to a pixel [1]. Light interacting with a pixel is divided into several bands in order to render complete information about a target [2]. Therefore, HSI has gained significant importance in many applications including. but not limited to, chemical imaging [3], agriculture [4], surveillance [5], remote sensing [6], household materials [7], and environmental sciences [8]. The main challenge of HSI analysis is its high dimensional characteristic of data due to a large number of bands with significantly high resolution across the electromagnetic spectrum. Therefore, the classification of HSI data is a complex and challenging task [9].
Some widely used supervised classification approaches for HSI analysis are Multinomial Logistic Regression (MLR) [10], Support Vector Machine (SVM) [11], Maximum Likelihood [10,12], Ensemble Learning (EL) [13,14], Random Forests (RF) [15,16], Deep Learning (DL) [17][18][19], Transfer Learning [20,21], k-Nearest Neighbors (KNN) [22] and Extreme Learning Machine (ELM) [23]. The major limitation of supervised HSI classification is the poor performance due to the Hughes phenomena [24]. It occurs when the ratio of spectral bands is significantly less as compared to the labeled training samples available in hyperspectral data [25]. The acquisition of most informative labeled training examples is often an expensive and time-intensive task as it generally requires human experts or a ground campaign [26].
Limited availability of reliable labeled training examples brings the idea to utilize semi-supervised learning [27]. The basic concept of such a learning mechanism is that new training examples can be obtained from the unlabeled data, without considerable time and cost, by utilizing limited available labeled examples [28,29]. A few examples of such techniques are kernel techniques [30], such as EL techniques [31,32], Tri-training [33,34] and Graph-based learning [35]. However, the performance of these techniques is relatively low when combined with the limited availability of reliable labeled training examples for high dimensional datasets e.g., hyperspectral datasets.
To cope with the aforesaid issues, one of the commonly used semi-supervised approaches is the expansion of the initial training set by efficiently utilizing unlabeled data. This method is known as Active Learning (AL) which significantly improves the performance of classification techniques by adding new examples in the training set for the next cycle of training, unless a stopping criteria is met, i.e., required classification accuracy. Depending on the criteria of adding new examples to the training set, AL techniques can be categorized as stream or pool-based and the selection of new examples is based on ranking scores that are reckoned from measures like representativeness, uncertainty, variance, inconsistency and error [6].
For instance, uncertainty measures consider the unlabeled examples more important than those close to the class boundary of the current classification results. Representativeness selects those unlabeled examples as more significant, which can represent a new group of examples, i.e., a cluster. Inconsistency assumes the unlabeled samples more useful and has high predictive divergence among multiple classifiers [36]. However, all these methods have one major limitation which is randomness among the samples. The new training samples are always selected with certain criteria without considering whether the selected samples are similar to the previously selected samples or not which induces the redundancy among the samples. The other main issue with many AL methods is their stopping criteria which are most certainly based on the accuracy number.
This paper proposed a multi-class non-randomized AL method based on Multinomial Logistic Regression (MLR)-LORSAL classifier [31,37] in conjunction with fuzziness [6] as sample selection method whilst exploiting both spatial and spectral information of hyperspectral data. We further compared the MLR-LORSAL against various well-known classifiers such as; SVM, ELM, KNN and EL. Each classifier is further evaluated against three benchmark sample selection methods in addition to fuzziness. These sample selection methods include Breaking Ties (BT) [38], Mutual Information (MI) [39] and Modified Breaking Ties (MBT) [39]. The motivation of our current work is to investigate several state-of-the-art sample selection and classification techniques with predefined dual stopping criteria and to properly generalize them for remotely sensed hyperspectral datasets.
The paper is structured as follows. Section 2 discuss the pipeline proposed in this work. Section 3 presents the experimental process and performance measurement metrics. Section 4 contains the information regarding the experimental results and datasets. Finally Section 5 concludes the paper with possible future directions.

Methodology
This work addressed the issue of a small training set while classifying the high-dimensional multi-class HSI data by introducing a fuzziness MLR-based classifier which actively selects the data based on two main aspects: first, is the sample's fuzziness and second is the non-randomized selection of samples to avoid redundancy among them. Furthermore, we compared it with four benchmark AL techniques in association with four different classifiers. We evaluated AL techniques on a pool of diverse samples in each iteration thus minimizing the redundancy among selected samples.

Hyperspectral Data Formulation
The symbols used in this work has been illustrated in Table 1. Let us assume a Hyperspectral dataset can be expressed as X = [x 1 , x 2 , x 3 , ..., x L ] T ∈ R L×(M×N) consisting of M × N samples associated with C classes per band with total L bands, in which each sample is represented as (x i , y j ), where y j is the class label of x i sample. In a nutshell ith sample belongs to jth class. Furthermore, suppose that n = 50 number of initial labeled training samples (given equal representation to each class) are chosen from X to form the training set X T = {(x i , y j )} where i = 1, 2, 3, . . . , n and j = 1, 2, 3, . . . , C. The remaining samples make the validation set . It should be noted that n << m, and (X T ∩ X V ) = ∅.

Multinomial Logistic Regression via Splitting and Augmented Lagrangian (MLR-LORSAL)
An NP classifier modeled using X T is tested on X V , outputs a matrix µ of m × C dimensions whose entries correspond to NP outputs (i.e., least-square estimation) of a classifier. Let u ij represent the membership of ith sample for jth class. Various methods are proposed for the transformation of least square estimation into the posteriori probabilities [36,38]. These methods are expensive in terms of computations due to two main aspects: first, in order to estimate y i , they need to approximate the posterior distribution in the case of a Bayesian framework. Secondly, they restrict the training output between 0 and 1. Such methods compute the approximated posterior probabilities in the form of least squares regression i.e., f (x i , µ j ) = p(µ j , x i ) [40]. An alternate to the least squares approach which outputs the probabilities in [0 − 1] range is MLR which is computed as [40,41]: where h(x) = [h 1 (x), . . . , h L (x)] T , mostly termed as features, is a vector of L fixed functions of the input, w = [w 1 , . . . , w L−1 ] T . Since the density function represented in Equation (1) is independent of translations of the regressors w L , we take w L = 0. Therefore, the posteriori probabilities are computed by using LORSAL algorithm as similar to the work [37]. h( are symmetric kernel functions. RBF has been widely used for HSI classification as it improves the data separability in the transformed space [41,42].
From RBF output, a membership matrix µ ij is obtained, which must satisfy the properties [6,25] represents the membership of sample i for the jth class [25]. If the approximated posterior probability is close to 1, it represents the true class, whereas wrong class if the probability is close to 0. AL techniques do not need exact probabilities, but they require a ranking of approximated probabilities which makes it easier to calculate the fuzziness [6].
The fuzziness F(µ ij ) from the membership matrix µ ij can be expressed as in Equation (3) and it should satisfy the properties defined below [6,43]. 1.
where µ ij and σ ij are fuzzy sets. We first associate the F(µ) with actual and predicted class labels and validation set X V . Then this set is sorted in a descending manner according to the F(µ) values. After thatm number of misclassified samples having higher fuzziness (i.e., F(µ) >> 0.7 andm m) values are heuristically selected. We randomly select a reference training sample (α j ) from each class to compute the spectral angle using a Spectral Angle Mapper, which takes the cos −1 based dot product among the test samples to a reference sample as follows: where X preserve the information from those samples which have maximum distance among the same class. X denotes the index of the unlabeled sample that will be included in the pool. Please note that here we used a soft threshing scheme to balance the number of classes in both training and selected samples. This process picks the samples in a non-randomized way to avoid redundancy not only among the pool of new samples but also with the samples that are already added to the training set. We follow the strategy of keeping the pool ofm new samples balanced, which gives equal representation to all classes via softening the thresholds at run time. The complete pipeline is presented in Algorithm 1.

Algorithm 1: Pipeline of Proposed Algorithm.
Data: X T , X V Training and Test Set, respectively. Pick X samples from X V , add them to X T , and remove from X V ;

Experimental Process
The performance of our proposed pipeline is validated on 3 benchmark HSI datasets that are publicly available and acquired by two different sensors, i.e., Reflective Optics System Imaging Spectrometer (ROSIS) and Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) sensor. These 3 datasets are Salinas, Indian Pines and Pavia University (PU). More information on these datasets can be found in [9,44].
Furthermore, we evaluated our proposed pipeline against five diverse classifiers which are Support Vector Machine (SVM), K-Nearest Neighbors (KNN), Ensemble learning (EL) methods i.e., Gradient Boost (GB) & Logistic Boost (LB) and Extreme Learning Machine (ELM) [45]. We choose these classifiers for comparison purposes because they have been broadly studied for HSI classification. The performance of our proposed pipeline and the aforementioned classifiers is evaluated using two benchmark metrics: Overall Accuracy (OA) and kappa (κ) coefficient [25]. Furthermore, to check the statistical significance of our proposed pipeline, several statistical measures are taken into accounts, such as Precision, Recall and F1-score. All these metrics are computed as follows: where where TP, FP, TN and FN are true positive, false positive, true negative and false negative computed from the confusion matrix, respectively. Moreover, to show that our proposed pipeline is commensurable, we evaluated it against four well-known sample selection methods: Mutual Information (MI), Breaking Ties (BT) and Modified Breaking Ties (MBT).

1.
Mutual Information (MI): Selects the samples by maximizing the mutual information between the classifier and class labels and can obtain samples from the complicated region [39].

2.
Breaking Ties (BT): Selects the samples by minimizing the distance of the two classes having the highest posterior probabilities, and can choose samples from the boundary regions [46]. In the multiclass scenario, BT can be utilized by finding the difference between the first two most probable classes.

3.
Modified Breaking Ties (MBT): Adds more diverse samples as compared to MI and BT. The MBT algorithm follows two important steps: first, it selects samples from the unlabeled pool with the same maximum a posteriori (MAP) estimation; and then choose the samples from the most complicated region [46].
In all experiments, 50 samples were selected from the whole HSI data for the initial training dataset and in each iteration of AL process, h = 50 samples are actively added to the training dataset. The process is repeated until we achieve the desired accuracy of ≥85% for at least one classifier or training set size reaches up to h = 2500 samples.

Experimental Datasets and Results
For the purpose of experimental design, we used five-fold-cross-validation to evaluate the performance of our proposed pipeline. From the results, one can conclude that our proposed approach significantly enhances the classification performance with comparatively less computational time.
This section enlists the experimental results obtained by several sample selection methods with various types of classifiers. These classifiers include Support Vector Machine (SVM), k-Nearest Neighbours (KNN), Ensemble Learning (EL) and Extreme Learning Machine (ELM). The important parameters for all the experimental classifiers mentioned above are carefully tuned and to avoid bias, all the experiments are carried out under the same settings that maximize performance.
KNN classifier is tested with Euclidean distance function K = [2:1 :20]. The SVM classifier is tested with the RBF kernel function. EL i.e., both Logistic (LB) and Gentle boost (GB) is trained using tree template with 100 trees and leaf size "number of rows in training set/10". ELM classifier is tuned with [1:1:500] hidden neurons with a sigmoid as an activation function. Here our main aim is to compare the performance of different classifiers with MLR-LORSAL.
The experiments described below show the κ evaluation for the above-discussed classifiers with a different number of training samples in each iteration for several HIS datasets. The number of selected training samples is set as shown in the following figures i.e., initially, 50 samples are used to train the model and in each iteration, 100 new non-random samples are selected using the proposed strategy. Based on the experimental results shown in the following figures, MLR-LORSAL classifier performs better than other state-of-the-art classifiers. From the following results, one can conclude that with a different number of training samples, there is a significant improvement in classification accuracy.

Computational Cost
Here, we first enlist the computational time for an MLR-LORSAL classifier for each Query Function used in this work. One can observe that the computational time is gradually increasing as the number of training samples increases as shown in Figure 1    Further details about Indian Pines dataset can be found at [43]. Table 2

Experimental Results on Indian Pines Dataset
Indian Pines dataset was collected over northwestern Indiana's test site, Indian Pines, by AVIRIS sensor and comprises of 145 × 145 pixels and 224 bands in the wavelength range 0.4-2.5 × 10 −6 m. This dataset consists of two-thirds agriculture area and one-third forest or other naturally evergreen vegetation. A railway line, two dual-lane highways, low-density building and housing and small roads are also part of this dataset. Furthermore, some corps in the early stages of their growth is also present with approximately less than 5% of total coverage. The ground truth is comprised of a total of Appl. Sci. 2020, 10, 4739 8 of 16 16 classes but they all are not mutually exclusive. The number of spectral bands is reduced to 200 from 224, by removing the water absorption bands.
Further details about Indian Pines dataset can be found at [44]. Table 2 and Figure 2 presents the accuracies for different Classifiers as well as with different sample selection methods. From comparative results, one can see that fuzziness together with MLR-LORSAL classifier works better as compared to the other sample selection and classification methods. We also presents the Indiana Pines predicted geographical maps (classification maps) in Figure 3. These geographical locations of each predicted class label validate the superiority of our proposed pipeline. Figure 3 shows the complete performance assessment on experimental results with profound improvement. As shown in the figure, the classification maps generated by adopting the proposed pipeline are less noisy and more accurate.    about the dataset can be found at [43].

221
Further details about Salinas dataset can be found at [43]. Table 3 and Figure 4 presents

Experimental Results on Salinas
The Salinas full scene was gathered over Salinas Valley California, through AVIRIS sensor. It comprises 512 × 217 pixels per band and a total of 244 bands with a 3.7 m spatial resolution. It consists of vineyard fields, vegetables and bare soils and contains sixteen classes. Few water absorption bands (108-112, 154-167 and 224) are removed from the dataset before analysis. Further details about the dataset can be found at [44].
Further details about Salinas dataset can be found at [44]. Table 3 and Figure 4 presents the accuracies for different classifiers as well as with different sample selection methods. From comparative results, one can see that fuzziness together with MLR-LORSAL classifier works better as compared to the other sample selection and classification methods. We also presents the Salinas predicted geographical maps (classification maps) in Figure 5. These geographical locations of each predicted class label validate the superiority of our proposed pipeline. Figure 5 show the complete performance assessment on experimental results with profound improvement. As shown in the figure, the classification maps generated by adopting the proposed pipeline are less noisy and more accurate.    Further details about Pavia University dataset can be found at [43]. Table 4 and Figure 6 presents   predicted class label validate the superiority of our proposed pipeline. Figure 7 show the complete

Experimental Results on Pavia University
Pavia University dataset is gathered over Pavia in northern Italy through ROSIS optical sensor during a flight campaign. It consists of 610 × 610 pixels and 103 spectral bands with a spatial resolution of 1.3 m. Some samples in this dataset provide no information and are removed before analysis.
Further details about Pavia University dataset can be found at [44]. Table 4 and Figure 6 presents the accuracies for different classifiers as well as with different sample selection methods. From comparative results, one can see that fuzziness together with MLR-LORSAL classifier works better as compared to the other sample selection and classification methods. We also presents the Salinas predicted geographical maps (classification maps) in Figure 7. These geographical locations of each predicted class label validate the superiority of our proposed pipeline. Figure 7 shows the complete performance assessment on experimental results with profound improvement. As shown in the figure, the classification maps generated by adopting the proposed pipeline are less noisy and more accurate.
Version July 5, 2020 submitted to Appl. Sci.     proposed pipeline significantly improves the performance as compared to the state-of-the-art deep 260 models which the same number of training samples [18].

261
The comparative methods are implemented on an online platform commonly known as Google

Results Discussion
In all the above-shown experiments, the performance of the proposed pipeline is being evaluated using a set of experiments i.e., first, we analyze several sample selection methods for MLR-LORSAL classifier. Later this work compared several classifiers for the same sample selection methods. In all these experiments, the size of the training set is fixed to the maximum of 50 samples randomly selected from all the classes while giving equal representation to each class. In each iteration, 100 new samples are selected based on sample selection method and their spatial locations i.e., the newly selected samples should not be spatially close to the previously selected samples. This phenomenon significantly helps to limit the redundancy factor among the training samples.
The experiments are repeated on more complicated and nested region datasets such as Pavia University dataset. The number of training samples selected in each iteration is shown in the above figures. Based on the results, one can conclude that the fuzziness-based sample selection method is a competitive process that slightly worked better than other well-know sample selection methods. One can conclude from the results shown in Tables 2-4 and Figures 2-7 that all the sample selection methods performed well, however, BT and Fuzziness-based samples boost the accuracy for MLR-LORSAL classifier followed by MI. From several observations with a different number of training samples, there is a slight improvement in performance using MBT however, fuzziness and BT improve the generalization in impressive fashion. Fuzziness-based sample selection process works better due to the fact that these samples are usually close to the classification boundary.
This work started evaluating the hypotheses with a 50 number of randomly selected samples; it is a well-known fact that the spectral information-based randomly adding samples back to the training set does not increase the accuracy as desired. Therefore, this work explicitly fuses the spatial information while considering the new samples for a training set which significantly boosts the accuracy and generalization performance of a classifier. Furthermore, this work also validates the sufficient number of samples required to train a classifier; i.e., 500 − 1000 samples are more than enough to train a classifier to produce an acceptable accuracy for HSI classification tasks.

Conclusions
A novel AL for HSI datasets has been explored in this study to overcome the limitations of randomness using spatial-spectral information with predefined dual stopping criteria using a fuzziness-based MLR-LORSAL classifier. Extensive comparisons with state-of-the-art sample selection and classification methods have been carried out. Furthermore, several statistical tests were also considered to validate the claims that the fuzziness-based MLR-LORSAL classifiers outperformed other state-of-the-art classifiers. In short, this work investigated different sample selection techniques and classifiers to properly generalize them to the classification of remotely sensed HSIs with multiclass problems. The experimental results on three benchmark hyperspectral datasets reveal that the proposed pipeline significantly increases the classification accuracy and generalization performance.