Dual-Weighted Kernel Extreme Learning Machine for Hyperspectral Imagery Classiﬁcation

: Due to its excellent performance in high-dimensional space


Introduction
Hyperspectral remote sensing images contain rich spatial and spectral object information, including ultraviolet, visible, and near-and mid-infrared regions of electromagnetic waves. For this reason, the ability to recognize and classify ground objects has greatly improved. The classification of hyperspectral images has become a hot research topic in recent years, with a considerable amount of research on hyperspectral image classification having been conducted. However, despite the rich information provided by hyperspectral images, their high dimensionality and non-linear characteristics make detailed classification difficult. Moreover, as the number of available training samples is typically small, we previously encountered the Hughes phenomenon [1] during the supervised classification of hyperspectral images (HSI). To overcome the high-dimensionality problem, many methods have been introduced for HSI classification and shown good performance, such as manifold learning, the support vector machine (SVM) [2], and composite kernel-based methods [3][4][5][6][7].
Recently, many deep learning employed for hyperspectral imagery classification tasks. H.Wu [8] proposed semi-supervised deep learning for hyperspectral image classification, while the approach uses limited labeled data and abundant unlabeled data to train a deep neural network. B. Pan [9] introduced a dilated semantic segmentation network, in order to avoid spatial information loss during the pooling operation. The network has an end-to-end structure, thus reducing time consumption. In [10], a deep learning method by combining spatial and spectral information for HSI classification was successfully designed. An unsupervised spatial-spectral feature learning strategy using a 3-dimensional convolutional auto-encoder (3D-CAE) has been proposed for hyperspectral data [11]. achieved, allowing the features of the hyperspectral image to be fully represented and the classification error rates to be reduced. As extended attribute profiles usually require manual parameter settings, Marpu [30] presented a technique to automatically produce the extended attribute profiles under consideration of the standard deviation, where the homogenous regions were retained by the minimum and maximum values of the standard deviation. Recently, group intelligent algorithms have also been used: H. Su [31,32] proposed an extreme learning machined optimized by the firefly algorithm, where the parameters in ELM were optimized by the proposed method. J. Li [33] presented an empirical linear relationship between the training number and hidden nodes with a linear model. To improve the individual performance of a basic classifier, F. Lv [34] proposed a stacked auto-encoder ELM (SAE-ELM) model. The features were extracted by this model, while the Q statistic was adopted to determine the final results. Spatial features provide subtle information, which helps discriminate different classes. As an excellent edge-preserving filter, guided image filtering [35], which was proposed by He, has been widely used in the fields of noise reduction, haze removal, and so on. B. Pan [36] proposed an ensemble framework where, by integrating many individual learners, better generation can be achieved. To establish the ensemble model, hierarchical guidance filtering was employed. Y. Guo [37] attempted to develop two fusion methods for spectral and spatial features and, in order to obtain better results, adopted guided image filtering. Z. Wang [38] proposed a discriminative guided filtering framework which integrates a classifier by guided filtering. Guided image filtering establishes a local linear model between the guided image and the output image, implicitly completes the filtering of the input image by solving the difference function between the input image and the output image [35,39]. Inspired by these studies, guided image filtering is used to extract spatial information, in order to further improve the accuracy of hyperspectral image classification (HSIC).
While these spatial-spectral ELM-based methods performed well, their performance can be further improved, as they ignored the imbalanced samples in different classes in multiclassification tasks, causing the majority of samples to weaken the minority's influences on the classification performance; thus, small-sized samples should be taken into consideration. Motivated by these, we propose a dual weighted kernel extreme learning for hyperspectral image classification. For one thing, different scales of spatial features extend the feature space, the combination of multiscale spatial features will rich the diversity of samples which may bring more information for our classification task. The other, in imbalanced data environment, the separating boundary is supposed to be pushed toward the side of minority class, which in fact favors the performance of majority class. To alleviate the depression by the majority, we attempt to assign an extra weight to each sample to strengthen the impact of minorities and weaken the impact of majorities in some distant. To tackle the above task, the main contributions of this paper are summarized below: A spatial-spectral dual-weighted kernel extreme learning machine framework for hyperspectral image classification is proposed. As important spatial features can help to identify similar classes, the spatial-and spectral-added weight summation make hyperspectral imagery classification feasible. In addition, the minority class should not be ignored, as the majority classes may weaken the generalization performance of minorities. For this reason, the weighted extreme learning machine is employed, in order to counteract this imbalance problem.
The rest of the paper is organized as follows: In Section 2, the related works of single layer feed-forward networks, ELM and weighted kernel ELM, and guided filters are introduced; furthermore, the proposed dual-weighted kernel ELM is described in detail. The experimental results and analysis are provided in Section 3. The conclusions of this paper are given in Section 4. ELM is a fast-learning algorithm for single hidden layer neural networks, which works by randomly initializing the input weights and biases, which can greatly save a considerable amount of computation time. Meanwhile, the random input may bring diversity of samples.
For a single hidden layer neural network, we suppose that there are N arbitrary sam- . . x id ] T ∈ R n and y i = [y i1 , . . . y im ] T ∈ R m . Therefore, a single hidden layer neural network with one hidden layer node can be expressed as where is the activation function, β i is the output weight, a i = [a i1 , . . . , a id ] T is the weight vector, and b i is the bias of the ith hidden layer. G. Huang [19] proved that SLFN with L nodes can approximate an arbitrary function.
We used the matrix form to rewrite Equation (2): where βi = [β 1 , . . . β L ] T ∈ R L×m and Y = [y 1 , . . . , y n ] T ∈ R N×m . The hidden layer output matrix, H, is expressed as The matrix H is an active function of the hidden layer. In Equation (5), the parameters a i and b i are both unknown: arg min In traditional neural networks, Equation (5) is usually solved using a gradient descentbased iterative algorithm. During the process of iteration, all parameters need to be tuned, according to the iteration, which may cause the problems of gradient diffusion, local minima, and overfitting.

ELM and Weighted Kernel ELM
As for ELM, the solution of the parameters is completely different. The parameters a i and b i are randomly generated. They do not change during the whole procedure. The hidden layer is determined after the input parameters are produced. Based on the input parameter and hidden layer, we can derive the output by the linear analytic solution. The final goal of ELM is to obtain the smallest training error with the smallest norm of the output weight. This is expressed as arg min Remote Sens. 2021, 13, 508

of 21
Based on optimization theory, Equation (6) can be formulated as follows: where h(x) = [G(a 1 , b 1 , x), . . . , G(a L , b L , x)], ξ i is the training error, and C is the regularization parameter. According to Lagrange multiplier theory and Karush-Kuhn-Tucker (KKT) optimization conditions [40], training the ELM is equivalent to solving the following dual optimization problem: where β j is the column vector of matrix β and α i,j is the Lagrange multiplier. From the KKT theorem, we can further derive: Based on Equations (9)-(11), the output weight, β, can be expressed as After obtaining the output weight β, the output of the ELM is obtained as: Traditional ELM does not take the imbalance problem into account, while the weighted ELM was designed to address it [41]. For this paper, two weighting schemes were proposed: Scheme 1: where t k is the total number of samples belonging to the kth class. After applying weighting scheme 1, we can obtain a balanced ratio between the minority and majority. Scheme 2: where t avg represents the average number of samples for all classes. If the number of t k is below the average, similar to ELM, the optimization form of the weighted ELM can be expressed as: min For the multiclass-weighted kernel ELM [41,42], we define a diagonal matrix, W, which is associated with the training sample x The output weight, β, can be expressed as Remote Sens. 2021, 13, 508 6 of 21 Given a new sample, x, the output function of the weighted ELM classifier is obtained from f (x) = h(x)β, that is: Similar to SVM kernel methods, the kernel trick can be used in Equation (18), where the kernel function can replace the inner products h(x)H T and HH T .
The kernel trick version of the weighted ELM is the weighted kernel ELM. Thus, the N × N version of the kernel ELM can be rewritten as: where Therefore, the weighted kernel ELM provides a unified solution for networks with different feature mappings and, at the same time, strengthens the impact of minority class samples by adding a weighted matrix.

Spatial Feature Extraction
To improve the performance of ELM for HSI classification, guided image filtering is adopted to extract spatial information. The guided image filtering method proposed by He [36] is a novel type of explicit filter that can act as an edge-preserving smoothing operator-like bilateral filter and obtain better behavior near edges. Given an image p as an input, g is a guided image, q is an output image-which is a linear transform in a window ω o around a pixel o with a size of (2r + 1) × (2r + 1), where r is the window radius-and u is the pixel of ω o : where a o and b o are linear coefficients that are assumed to be constant in ω o . From Equation (20), we can see that ∆q = a∆g, which means that the output q has a similar gradient as the guidance image g. The coefficients are solved by the following minimum cost function: where ε is a regularization parameter, to prevent a k from being too large. The values of a o and b o can be obtained by linear regression [40]: where µ o and σ 2 o are the mean and variance of g in the window of ω o , |ω| is the number of pixels in ω o , and p o = 1 |ω| ∑ u∈ω k p u is the mean of p in ω o . After obtaining the coefficients a o and b o , the guided filtering image q u can be computed. Based on the above procedure, we can obtain the linear transformed image q.

Proposed Dual-Weighted Kernel ELM-Based Method
In this section, the proposed dual-weighted kernel extreme learning machine for hyperspectral image classification-termed DW-KELM-is described in detail. The joint spatialspectral information is employed to investigate the performance of the dual-weighted Remote Sens. 2021, 13, 508 7 of 21 kernel ELM for hyperspectral imagery classification. Figure 1 shows the procedure of the spatial-spectral dual-weighted kernel ELM-based HSI classification.
Remote Sens. 2021, 13, x FOR PEER REVIEW 7 of 21 weighted kernel ELM for hyperspectral imagery classification. Figure 1 shows the procedure of the spatial-spectral dual-weighted kernel ELM-based HSI classification. For the classification task, principal component analysis (PCA) is applied as a preprocedure of feature extraction. The PCs that contain 99% of information are preserved. We use the guided filter on PCs that have a group of spatial features.
Given pixel , which is a sample consisting of the spectral characteristics across a continuous range of spectral bands, we denote its spectral and spatial features as and , respectively. The spectral feature vector is the original , which consists of spectral reflection values across all bands. The spatial feature vector is extracted by multiple guided image filtering methods. As the first PC contains most of the useful information, we use it as the guided image in our proposal. The first PC greatly maintains the edge information after these operations, while the other PCs are input images for guided image filtering. Then, we obtained groups of spatial features.
Exploiting the information from the spatial and spectral domains, the kernel method is usually used to perform the spatial-spectral classification. For the kernel method, the original spectral features are used to compute spatial and spectral kernels, which are combined to form kernels.
Once the spatial and spectral features and are constructed, we can compute the spatial kernel and spectral kernel , as follows: Here, we use the Radial Basis Function (RBF)kernel. and are the width of the respective RBF kernels. The Kernel ELM is represented as a weighted kernel summation: Then, the weighted summation composite kernel is required. The spatial-spectral kernel in Equations (24)-(26) is computed. Then, the features are recalculated using the weighted matrix, , in order to strengthen the impact of the minority class samples. Following this, the dual-weighted kernel ELM model solves: For the classification task, principal component analysis (PCA) is applied as a preprocedure of feature extraction. The PCs that contain 99% of information are preserved. We use the guided filter on PCs that have a group of spatial features.
Given pixel x i , which is a sample consisting of the spectral characteristics across a continuous range of spectral bands, we denote its spectral and spatial features as x w i and x s i , respectively. The spectral feature vector x w i is the original x i , which consists of spectral reflection values across all bands. The spatial feature vector x s i is extracted by multiple guided image filtering methods. As the first PC contains most of the useful information, we use it as the guided image in our proposal. The first PC greatly maintains the edge information after these operations, while the other PCs are input images for guided image filtering. Then, we obtained groups of spatial features.
Exploiting the information from the spatial and spectral domains, the kernel method is usually used to perform the spatial-spectral classification. For the kernel method, the original spectral features are used to compute spatial and spectral kernels, which are combined to form kernels.
Once the spatial and spectral features x s i and x w i are constructed, we can compute the spatial kernel K s and spectral kernel K w , as follows: Here, we use the Radial Basis Function (RBF)kernel. σ s and σ w are the width of the respective RBF kernels. The Kernel ELM is represented as a weighted kernel summation: Then, the weighted summation composite kernel is required. The spatial-spectral kernel in Equations (24)-(26) is computed. Then, the features are recalculated using the Remote Sens. 2021, 13, 508 8 of 21 weighted matrix, W, in order to strengthen the impact of the minority class samples. Following this, the dual-weighted kernel ELM model solves: where the weighted matrix W is the diagonal matrix of the spatial-spectral feature extracted by weighted scheme 2 [41]: where t k is the total number belonging to the kth class. W is assigned to 1 t k , that is, the inverse of the minority samples is weighted for the minorities. The golden ratio is used for the majorities.
After the final results are obtained, each test sample is assigned to the highest value in f q x q = f 1 x q , . . . , f m x q , where q = 1, . . . , m, according to the index during the prediction phase: Algorithm (Spatial-spectral dual-weighted kernel ELM for HSI classification) Input: HSI data set, r, ε, µ, L Output :

Hyperspectral Image Data Sets
The performance of the proposed approach was evaluated using three widely used data sets; namely, Indian Pines, the University of Pavia, and Salinas. The three data sets are publicly available hyperspectral data sets.

Indian Pines
The Indian Pines data set was acquired with the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) sensor in 1992. The image scene contains 145 × 145 pixels, 220 spectral bands, and a spectral range from 0.4 to 2.5 µm, where 20 channels were discarded due to the atmospheric affection. The spatial resolution of the data is 20 m per pixel. The scene contains two-thirds agricultural land and one-third forest or other natural perennial vegetation. Some of the crops present are in early stages of growth, with less than 5% coverage. There are 16 classes and 10,249 labeled samples in the data set in total. The RGB composite image and ground-truth map from the data set are shown in Figure 2. etation. Some of the crops present are in early stages of growth, with less than 5% coverage. There are 16 classes and 10,249 labeled samples in the data set in total. The RGB composite image and ground-truth map from the data set are shown in Figure 2.

Pavia University
The Pavia University data set was acquired in 2001 using the Reflective Optics System Imaging Spectrometer (ROSIS) instrument over the urban area surrounding the University of Pavia, Italy. This image scene has a size of 610 × 610 pixels. As some of the samples in Pavia University contain no information, we discarded these parts. Thus, the size in our experiment was 610 × 340. The spatial resolution was 1.3 m per pixel. The RO-SIS-03 sensor captures 115 spectral bands ranging from 0.43 to 0.86 μm. After removing 12 noisy and water-absorption bands, 103 bands were retained. The data contain nine ground-truth classes: asphalt, meadows, gravel, trees, metal sheets, bare soil, bitumen, bricks, and shadows. There was a total of 42,776 labeled samples，The RGB composite image and ground-truth map from the data set are shown in Figure 3.

Salinas
The Salinas data set was acquired using the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) sensor over Salinas Valley, California, USA. It contains 224 bands and 512 × 217 pixels with 3.7 m spatial resolution per pixel. The data contain 16 ground-truth classes, and 12 noisy and water-absorption bands were removed in the experiment. An image of Salinas is shown in Figure 4.

Pavia University
The Pavia University data set was acquired in 2001 using the Reflective Optics System Imaging Spectrometer (ROSIS) instrument over the urban area surrounding the University of Pavia, Italy. This image scene has a size of 610 × 610 pixels. As some of the samples in Pavia University contain no information, we discarded these parts. Thus, the size in our experiment was 610 × 340. The spatial resolution was 1.3 m per pixel. The ROSIS-03 sensor captures 115 spectral bands ranging from 0.43 to 0.86 µm. After removing 12 noisy and water-absorption bands, 103 bands were retained. The data contain nine ground-truth classes: asphalt, meadows, gravel, trees, metal sheets, bare soil, bitumen, bricks, and shadows. There was a total of 42,776 labeled samples, The RGB composite image and ground-truth map from the data set are shown in Figure 3. etation. Some of the crops present are in early stages of growth, with less than 5% coverage. There are 16 classes and 10,249 labeled samples in the data set in total. The RGB composite image and ground-truth map from the data set are shown in Figure 2.

Pavia University
The Pavia University data set was acquired in 2001 using the Reflective Optics System Imaging Spectrometer (ROSIS) instrument over the urban area surrounding the University of Pavia, Italy. This image scene has a size of 610 × 610 pixels. As some of the samples in Pavia University contain no information, we discarded these parts. Thus, the size in our experiment was 610 × 340. The spatial resolution was 1.3 m per pixel. The RO-SIS-03 sensor captures 115 spectral bands ranging from 0.43 to 0.86 μm. After removing 12 noisy and water-absorption bands, 103 bands were retained. The data contain nine ground-truth classes: asphalt, meadows, gravel, trees, metal sheets, bare soil, bitumen, bricks, and shadows. There was a total of 42,776 labeled samples，The RGB composite image and ground-truth map from the data set are shown in Figure 3.

Salinas
The Salinas data set was acquired using the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) sensor over Salinas Valley, California, USA. It contains 224 bands and 512 × 217 pixels with 3.7 m spatial resolution per pixel. The data contain 16 ground-truth classes, and 12 noisy and water-absorption bands were removed in the experiment. An image of Salinas is shown in Figure 4.

Salinas
The Salinas data set was acquired using the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) sensor over Salinas Valley, California, USA. It contains 224 bands and 512 × 217 pixels with 3.7 m spatial resolution per pixel. The data contain 16 ground-truth classes, and 12 noisy and water-absorption bands were removed in the experiment. An image of Salinas is shown in Figure 4. Remote Sens. 2021, 13, x FOR PEER REVIEW 10 of 21

Parameter Settings
The classification performance of the different algorithms was assessed on the testing set using the overall accuracy (OA), which is the number of correctly classified testing samples divided by the number of total testing samples; as well as the average accuracy (AA), which represents the average of the classification accuracies for the individual classes; and the kappa (κ) coefficient, which measures the accuracy of classification agreement. The experiments were conducted using MATLAB R2016b on a computer with a 2.8 GHz dual core and 16 GB RAM.
In the pre-processing stage, the principle components (PC) which contained more than 99% information were chosen; PC1 was used as a guided image, the other PCs were used as input images, and the step of the window was 2.
For the kernel methods, the combination of kernel ELM and coefficient µ was set to 0.95, according to our experience. For all kernel-based algorithms, the RBF kernel was

Accuracy of Classification and Analysis
The total number of pixels of Indian Pines available in the reference data was 10,366; however, some classes only had very small labeled samples. To evaluate the performance of different algorithms in this challenging case, we randomly chose 10% of labeled training samples per class. The remaining labeled samples were used for testing. At the same time, for comparison with traditional methods, we also chose 5, 10, 15, 20, 25, and 30 samples as training samples, in order to evaluate the effects of different methods.

Parameter Settings
The classification performance of the different algorithms was assessed on the testing set using the overall accuracy (OA), which is the number of correctly classified testing samples divided by the number of total testing samples; as well as the average accuracy (AA), which represents the average of the classification accuracies for the individual classes; and the kappa (κ) coefficient, which measures the accuracy of classification agreement. The experiments were conducted using MATLAB R2016b on a computer with a 2.8 GHz dual core and 16 GB RAM.
In the pre-processing stage, the principle components (PC) which contained more than 99% information were chosen; PC1 was used as a guided image, the other PCs were used as input images, and the step of the window was 2.
For the kernel methods, the combination of kernel ELM and coefficient µ was set to 0.95, according to our experience. For all kernel-based algorithms, the RBF kernel was used. The parameter σ varied in the range 2 −4 , 2 −3 , . . . , 2 4 and C ranged from 10 0 to 10 5 . The number of hidden nodes for the Indian Pines data set was 500, while those for the University of Pavia and Salinas data sets were 1250 and 650, respectively.
In the general ELM method, the sigmoid function was used and the hidden layer parameters, (a i , b i ) L i=1 , were randomly generated based on the uniform distribution in the range of [−1, 1].

Accuracy of Classification and Analysis
The total number of pixels of Indian Pines available in the reference data was 10,366; however, some classes only had very small labeled samples. To evaluate the performance of different algorithms in this challenging case, we randomly chose 10% of labeled training samples per class. The remaining labeled samples were used for testing. At the same time, for comparison with traditional methods, we also chose 5, 10, 15, 20, 25, and 30 samples as training samples, in order to evaluate the effects of different methods.

Results on the Indian Pines Data Set
The accuracies of the ELM, KELM, WKELM, SS-KELM, KELM-CK, ASS-H-DELM, and HCKBoost measures are provided in Table 1.  Table 1, it can be observed that the ELM method only required a few seconds for hyperspectral classification application. At the same time, ELM provided the worst results, especially for the classes with limited training samples. The KELM method alleviated this, to some extent, but not significantly. This demonstrates that the kernel used in kernel ELM is more powerful than that which is randomly generated. For the DW-KELM algorithm, when additional spatial information was available, the dual-weighted framework improved the performance of the classifier, while the accuracy dramatically increased. This conclusion can be clearly seen for classes 1, 7, and 9. These three classes contained very similar spectral information, which made the results of classification bad, due to the spectral classifier. Classes 2, 3, 4 are corn subclasses and, thus, had very similar spectral curves; however, the spatial information helped to discriminate the subtle differences and, so, DW-KELM achieved good classification accuracies on corn (more than 95%) and on soybeans (more than 96%). After comparing the cost of time among those methods, it was observed that the ELM method consumes the least amount of time. There are three reasons which explain this phenomenon: Only spectral information was used, the random initial parameters, and the analytic solution for the network. While the same parameter settings were retained, the solution form decides the computation time. It is very common to use the spatial feature as an effective supplement. From the results of classification of classes 1, 7, and 9 for the SS-KELM, KELM-CK, ASS-H-DELM, and HCKBoost algorithms, we can see improvements in both spatial feature use and multiple kernel sides. However, despite the considerable improvement of these methods, the proposed dual-weighted kernel provided more satisfactory results, as the minority class sample was a more important consideration.
Further experiments on the performance with different numbers of labeled training samples per class were conducted, using the three previously introduced data sets. The training set was formed by randomly choosing from 5 to 30 samples, with a step of 5. The remaining samples were used as testing sets. As shown in Table 2, the OA, AA, and κ values were greatly improved with an increase in training numbers. When only spectral information was used, KELM achieved better results than ELM, especially in the condition of extremely small-sized samples. Among the spatial-spectral methods, the proposed DW-KELM showed a significant improvement over the SS-KELM, KELM-CK, ASS-H-DELM, and HCKBoost algorithms. This means that the proposed DW-KELM method is a powerful algorithm for this task, especially for enhancing the performance relating to minority class samples. When the number of training samples was 5 per class, the DW-KELM improved the OA by 4.36%, AA by 4.09%, and κ by 3.50%, while the presence of 30 samples conditions improved the OA by 3.29%, AA by 2.8%, and κ by 3.03%, when compared with HCKBoost on the Indian Pines data set. The classification map of the Indian Pines data set is shown in Figure 5. It can be clearly seen that the classification maps of DW-KELM were more coherent in the homogeneous regions, compared with the ELM, KELM, WKELM, SS-KELM, KELM-CK, ASS-H-DELM, and HCKBoost algorithms. In addition, the spatial-spectral methods provided better results than the spectral methods, in terms of consistent classification results with less noise. In particular, in the application of dual-weighted KELM, subtle features and minority samples were considered; this improvement typically arises for classes with similar spectral signatures. The classification map of the Indian Pines data set is shown in Figure 5. It can be clearly seen that the classification maps of DW-KELM were more coherent in the homogeneous regions, compared with the ELM, KELM, WKELM, SS-KELM, KELM-CK, ASS-H-DELM, and HCKBoost algorithms. In addition, the spatial-spectral methods provided better results than the spectral methods, in terms of consistent classification results with less noise. In particular, in the application of dual-weighted KELM, subtle features and minority samples were considered; this improvement typically arises for classes with similar spectral signatures.

Results on the University of Pavia Image Data Set
The classification results for the University of Pavia images are shown in Figure 6 and the accuracy measures are given in Table 3. The total number of pixels available in the reference data was 414,815. Accordingly, a training set of 10% samples per class were used. Regarding Table 2, the accuracy measures of the proposed ELM-based technique provided equally competitive and even better classification results, when compared to the traditional approaches. The results of the classification of the University of Pavia data set

Results on the University of Pavia Image Data Set
The classification results for the University of Pavia images are shown in Figure 6 and the accuracy measures are given in Table 3. The total number of pixels available in the reference data was 414,815. Accordingly, a training set of 10% samples per class were used. Regarding Table 2, the accuracy measures of the proposed ELM-based technique provided equally competitive and even better classification results, when compared to the traditional approaches. The results of the classification of the University of Pavia data set are shown in Figure 7. Figure 8a represents a map of ELM, only using spectral information. The accuracy measures for classification of the University of Pavia image are shown in Table 3. The first columns are the samples that we chose in the experiment.
can be clearly seen from Table 4. When only spectral information was used, ELM provided worse results than KELM. Among the joint spatial and spectral information classification methods, DW-KELM provided the best results. When the number of training samples per class was 30, DW-KELM improved the OA by 3.95%, AA by 2.99%, and κ by2.53% on the University of Pavia image, when compared with the HCKBoost algorithm. It seems that the proposed dual-weighted KELM is not only suitable for data with an imbalanced distribution, but also for balanced data.    ure 8. Similar settings as those in the aforementioned images were used. It can be clearly seen that the classification maps of DW-KELM are more spatially coherent in the large homogeneous region than other methods; further, the results have little noise. The increasing trend of OA, AA, and κ was also the same as for the images of the Indian Pines and Pavia University data sets. Among the other ELM-or KELM-based approaches, DW-KELM improved the OA by 3.72%, AA by 2.32%, and κ by 2.64%, when compared with HCKBoost, on the Salinas image.  From Table 3, we can clearly see that, when spatial information and dual-weighted KELM were used, the accuracy of classification was dramatically increased; for instance, for bare soil, from 84.10 to 97.25% and, for bitumen, from 78.93 to 99.90%. There were two main reasons for this: First, the weighted matrix strengthened the importance of class samples, which may be ignored in the presence of many majority class samples; second, TTthe spatial information helped to discriminate samples with similar spectral curves.
When the training samples increased, the OA, AA, and κ values improved, which can be clearly seen from Table 4. When only spectral information was used, ELM provided worse results than KELM. Among the joint spatial and spectral information classification methods, DW-KELM provided the best results. When the number of training samples per class was 30, DW-KELM improved the OA by 3.95%, AA by 2.99%, and κ by2.53% on the University of Pavia image, when compared with the HCKBoost algorithm. It seems that the proposed dual-weighted KELM is not only suitable for data with an imbalanced distribution, but also for balanced data.

Results on the Salinas Image
The classification results of different methods for the Salinas image are shown in Figure 8. Similar settings as those in the aforementioned images were used. It can be clearly seen that the classification maps of DW-KELM are more spatially coherent in the large homogeneous region than other methods; further, the results have little noise. The increasing trend of OA, AA, and κ was also the same as for the images of the Indian Pines and Pavia University data sets. Among the other ELM-or KELM-based approaches, From the box plot in Figure 8, the results show that the proposed DW-KELM obtained a more concentrated G-mean, especially on the Indian Pines image, due to consideration of the importance of the minority samples. In addition, its interquartile range (IQR) was smaller than those of the other methods.

Conclusions
In this paper, a dual-weighted kernel extreme learning machine was proposed, in order to tackle the hyperspectral imagery classification task. It is more effective when using small-sized samples, as the cumulative errors of the minority samples were previously ignored in traditional ELM algorithms. In particular, the weighted matrix W plays an important role in the proposed method; larger weights are assigned to samples from the

Ablation Study
To evaluate the purpose of our method, the ablation experiments are also carried, which termed as WKELM and SS-KELM respectively. From Tables 1 and 2, we can clearly see that, without multiple spatial information, the accuracy is not high as those within spatial features methods. At the same time, when the extra weight not assigned to each sample, the accuracy is also not high. Specially, for these samples whose training sample are extremely small, for instance the class 7 and class 9 in the image of Indian Pines, the weight will affect a lot. The same trend happens on the image of Pavia University and the Salinas, we can see these from Tables 3-5 and Table 6 respectively.

G-Mean as a Supplementary Measure for Evaluation
Overall accuracy has been widely used to evaluate the performance of classifiers. In addition, if the samples are imbalanced or distributed, it may not be possible to provide adequate information regarding the generalizability of a classifier; for instance, with a data set which has 10 samples belonging to a negative class and 90 samples belonging to a positive class, if there are 10 misclassified samples, the overall accuracy is equal to 80%, but the G-mean is equal to zero. Thus, we used the G-mean [36] as a supplementary measure to evaluate the performance of the proposed dual-weighted method: where TP represents the number that correctly classified positive samples and FN is the number of incorrectly classified positive samples. From the box plot in Figure 8, the results show that the proposed DW-KELM obtained a more concentrated G-mean, especially on the Indian Pines image, due to consideration of the importance of the minority samples. In addition, its interquartile range (IQR) was smaller than those of the other methods.

Conclusions
In this paper, a dual-weighted kernel extreme learning machine was proposed, in order to tackle the hyperspectral imagery classification task. It is more effective when using small-sized samples, as the cumulative errors of the minority samples were previously ignored in traditional ELM algorithms. In particular, the weighted matrix W plays an important role in the proposed method; larger weights are assigned to samples from the minority class, thus emphasizing their importance. In addition, as useful supplementary features, the spatial features are fully mined by adding weighted summation. This spatial information contains rich structure features, which help in distinguishing subtle differences in similar classes. The experimental results demonstrated that the proposed DW-KELM method is more accurate than the considered benchmark methods for the classification of hyperspectral imagery.

Data Availability Statement:
The data presented in this study are openly available in the website: http://www.ehu.eus/ccwintco/index.php/Hyperspectral_Remote_Sensing_Scenes.