Meta-XGBoost for Hyperspectral Image Classiﬁcation Using Extended MSER-Guided Morphological Proﬁles

: To investigate the performance of extreme gradient boosting (XGBoost) in remote sensing image classiﬁcation tasks, XGBoost was ﬁrst introduced and comparatively investigated for the spectral-spatial classiﬁcation of hyperspectral imagery using the extended maximally stable extreme-region-guided morphological proﬁles (EMSER_MPs) proposed in this study. To overcome the potential issues of XGBoost, meta-XGBoost was proposed as an ensemble XGBoost method with classiﬁcation and regression tree (CART), dropout-introduced multiple additive regression tree (DART), elastic net regression and parallel coordinate descent-based linear regression (linear) and random forest (RaF) boosters. Moreover, to evaluate the performance of the introduced XGBoost approach with di ﬀ erent boosters, meta-XGBoost and EMSER_MPs, well-known and widely accepted classiﬁers, including support vector machine (SVM), bagging, adaptive boosting (AdaBoost), multi class AdaBoost (MultiBoost), extremely randomized decision trees (ExtraTrees), RaF, classiﬁcation via random forest regression (CVRFR) and ensemble of nested dichotomies with extremely randomized decision tree (END-ERDT) methods, were considered in terms of the classiﬁcation accuracy and computational e ﬃ ciency. The experimental results based on two benchmark hyperspectral data sets conﬁrm the superior performance of EMSER_MPs and EMSER_MPs with mean pixel values within region (EMSER_MPsM) compared to that for morphological proﬁles (MPs), morphological proﬁle with partial reconstruction (MPPR), extended MPs (EMPs), extended MPPR (EMPPR), maximally stable extreme-region-guided morphological proﬁles (MSER_MPs) and MSER_MPs with mean pixel values within region (MSER_MPsM) features. The proposed meta-XGBoost algorithm is capable of obtaining better results than XGBoost with the CART, DART, linear and RaF boosters, and it could be an alternative to the other considered classiﬁers in terms of the classiﬁcation of hyperspectral images using advanced spectral-spatial features, especially from generalized classiﬁcation accuracy and model training e ﬃ ciency perspectives.


Introduction
Hyperspectral images can provide detailed spectral information, thereby increasing the possibility of accurately discriminating materials of interest. Furthermore, high resolution (HR) and very high resolution (VHR) sensors enable the analysis of small spatial structures with unprecedented detail. However, the high dimensionality of hyperspectral images may produce the Hughes phenomenon, which is related to the curse of dimensionality in classification tasks [1]. Notably, although HR and VHR data solve the problem of "observing" structural objects and elements, they do not improve the extraction procedure [2]. Therefore, two major challenges, spectral dimensionality and the need for specific spectral-spatial classifiers, have been identified [3,4].
Driven by such challenges, intensive work has been and continues to be performed in the remote sensing community to build accurate classifiers for the classification of hyperspectral images. In particular, support vector machines (SVMs) have shown remarkable performance in terms of classification accuracy in scenarios with a limited number of labeled samples available [5,6], and high performance, generalization, prediction accuracy and operation speed characteristics have been observed for random forest (RaF), rotation forest (RoF), extreme learning machine (ELM), extremely randomized decision trees (ExtraTrees), classification via random forest regression (CVRFR) and ensemble of nested dichotomies with extremely randomized decision tree (END-ERDT) classifiers in many studies [7][8][9][10][11]; these approaches encompass multispectral to hyperspectral methods and are applicable for optical images to synthetic aperture radar (SAR) and polarimetric SAR (PolSAR) images [7,8,10,[12][13][14][15][16].
Although RaF and regularized greedy forest (RGF) [17] decision tree (DT)-based ensemble learning (EL) methods can provide state-of-the-art results based on many standard classification and ranking benchmarks, gradient-boosting decision trees (GBDT) [18] have recently gained considerable interest due to their superb performance and flexibility in incorporating different loss functions [19][20][21]. As a variant of boosting, the GBDT algorithm represents the learning problem as gradient descent based on an arbitrary differentiable loss function that measures the production accuracy of the model for the training set. In comparison to the various applications of GBDT in action and text classification [22], web searching [23], landslide susceptibility assessment [24], image classification [25], insurance loss modeling and prediction [26], only a limited number of studies have been reported for GBDT with remotely sensed data. For example, the good performance of a hybrid approach involving boosting and bagging procedures called stochastic GBDT (SGBDT) [27] was verified for general land use/land cover classification problems using IKONOS, Landsat ETM+ and Probe-1 hyperspectral images for several study areas in the USA [28]. Additionally, SGBDT provided the most stable results when compared to other generalized additive models and tree-based methods for predicting the presence and basal area of 13 tree species in Utah using Landsat 7 ETM+ and ancillary information [29]. In addition, SGBDT was used to map forest fuel types through airborne laser scanning and IRS LISS-III imagery, and the superiority of SGBDT based on the classification accuracy was demonstrated in comparison to the results of classification and regression tree (CART) and RaF methods [30]. According to the recent work by Elizabeth et al. in tree canopy cover prediction, the performance of RaF and SGBDT models was remarkably similar based on a comparison of the tuning process and model performance [31].
Although the basic concept of GBDT is simple, it is nontrivial to implement the method and achieve good performance in practice. In addition, the major computational cost of training a DT-based ensemble learning (EL) method comes from finding the best split for each leaf, which requires scanning all the training data in the current subtree. Therefore, a typical DT-based EL algorithm (e.g., GBDT, RaF, RGF, ExtraTrees, CVRFR or END-ERDT) that has more than a hundred trees will be time consuming to use for datasets with millions of instances and thousands of attributes. Several parallel algorithms have been proposed to solve the scalability issue of building an EL system with multicore or distributed settings. Since the crucial part of determining the best split of each leaf is identifying the main component that can be made parallel, parallel DT-based EL algorithms can be grouped into the following three classes according to the partitioning approach: (1) those that partition data across cores explanation ability, tabular data processing and data feature invariance. However, in contrast with well-known shallow methods, such as SVM, RaF, ExtraTrees, Bagging, adaptive boosting (AdaBoost) and multi class Adaboost (MultiBoost) methods, XGBoost has more critical parameters. For instance, XGBoost with CART, DART and linear boosters has 22, 5 and 5 parameters, respectively [21,36,45]. XGBoost with default parameters cannot guarantee the optimal results for all cases. If a version of XGBoost could provide generalized performance and low model complexity, it would be practically appealing. In this sense, an ensemble of XGBoost methods with different boosters is proposed.
In our previous work [9], maximally stable extreme region (MSER)-guided morphological profiles (MSER_MPs and MSE_MPsM), which contain mean pixel values within given regions, were proposed to overcome the potential issues of MPs and MPPRs in VHR multispectral image classification tasks. In hyperspectral image classification, potential issues related to computational inefficiency and the generation of highly redundant features may also occur for MSERS-MPs and MSER_MPsM. Hence, inspired by the extended morphological profile (EMP) approach [46], an extended version of MSER_MPs called extended maximally stable extreme-region-guided morphological profiles (EMSER_MPs) was proposed for the spectral-spatial classification of hyperspectral images.
The main contributions of this article are as follows: (1) XGBoost was introduced and investigated for spectral-spatial hyperspectral image classification; (2) extended maximally stable extreme-region-guided morphological profiles were proposed for spatial feature extraction from hyperspectral images; and (3) meta-XGBoost was proposed as an ensemble of different boosters with few and simple parameters. In Table 1, we provide the acronym with corresponding full names that are used in this paper.

MSER
The MSER approach is a state-of-the-art local-invariant feature detection method that denotes a set of distinguished regions defined by the extremal property of the intensity function in the regions and the outer boundaries of the regions [47]. Additionally, MSERs have highly desirable properties, such as invariance to monotonic intensity transformation, invariance to adjacency-preserving transformation, stability, multiscale detection ability, and low computational complexity [46,48].
According to the formation of MSERs, an MSER incrementally steps through the intensity range of the input image to detect stable regions. For an image I(x) : D ⊂ Z 2 → S , x ∈ Λ is a real function of a finite set Λ with an adjacency relation Λ ⊂ D × D. In this paper, four neighborhoods are used, and p, q ∈ D are adjacent (pΛq) if d i=1 p i − q i ≤ 1. Region J is a contiguous subset of D, and ∂J = q ∈ D\J : ∃p ∈ J : qΛp represents the outer region boundary where a set of pixels is adjacent to at least one pixel of J but does not belong to J; the extremal region J ⊂ D is a region such that for all p ∈ J and q ∈ ∂J, I(p) > I(q) (maximum intensity regions) or I(p) < I(q) (minimum intensity region). Finally, let J 1 , . . . , J i−1 , J i , . . . be a sequence of nested extremal regions, where J i ⊂ J i+1 ; then, extremal region J i * is maximally stable if q(i) = J i+λ \J i−λ / J i has a local minimum at i * , where || denotes cardinality and λ ∈ SS = 0, 1, . . . , max(I(x)) is the step size for intensity threshold levels. λ determines the number of increments the detector tests for stability; one can think of the λ value as the size of a cup used to fill a bucket with water. The smaller the cup is, the larger the number of increments it takes to fill the bucket; the bucket represents the intensity profile of the region [9].

EMSER-MPs
Generally, MPs act on the values of pixels and consider the pixel neighborhood determined by a structural element (SE) with a predefined size and shape based on dilation and erosion operators. To adaptively set as many SEs as possible that match the sizes and shapes of all objects in an image, we adopt MSERs to identify maximally stable regions J * = J i * , . . . , J M * (M S) (objects) and define diverse sizes and shapes of SEs according to the aforementioned properties of MSERs [9]. Therefore, MSER-guided opening by partial reconstruction (OBR) can be obtained by first eroding the input image using ∃J i * ∈ J * as SEs and then applying the result as a marker in geodesic reconstruction in the dilation phase: Similarly, we can have for the MSER-guided closing by partial reconstruction (CBR). This relation is obtained by complementing the image, obtaining the MSER-guided OBRs using ∃J i * ∈ J * as SEs, and complementing the results: where the superscript C represents the image complimenting process. In mathematical morphology, the erosion of f by b at any location (x, y) is defined as the minimum value of all the pixels in the neighborhood defined by b (∃J i * ∈ J * in our case). By contrast, dilation returns the maximum value of the image in the window outlined by b. Thus, we can obtain new formations of the erosion and dilation operators: Finally, if structure elements ∃J i * ∈ J * are specified by the MSERs to obtain OBR and CBR profiles, the MSER_MPs of an image f can be defined as [9]: where f J * (MSER) mean represents the composed feature of taking mean pixel values within MSER regions. Although the use of MPs could help in creating an image feature set that provides abundant discriminative information, redundancy is still evident in the feature set, especially for hyperspectral images. Therefore, feature extraction can be used to find the most important features first, and morphological operators can then be applied. After principal component analysis (PCA) is performed for the original feature set, EMSER_MPs and EMSER_MPs with mean pixel values within region (EMSER_MPsM) can be obtained by applying the basic principles of MSER_MPs and MSER_MPsM, as described above, for the first few (usually three) features:

Conventional XGBoost
XGBoost stands for extreme gradient boosting, which is a supervised EL algorithm that implements a generalized gradient boosting method that includes a regularization term to yield accurate models with multicore and distributed settings for classification, regression and ranking tasks [21,36,45,49]. For a given data set composed of n instances and m features X = {x i } n i=1 , x i ∈ R m with labels y = y i n i=1 , y i ∈ ω ∀j∈(1,2,...,C) , where ω j represents the j th class from C total classes, an ensemble of DT that uses K additive functions to predict the output can be formed as: where Here, q and T represent the structure and number of leaves in the tree, and each tree f k corresponds to an independent q and leaf weights w. For a given instance, we will use decision rules in the tree structure given by q to classify it into leaves and calculate the final prediction by summing the score in the corresponding leaves given by w. Then, the following regularization term can be used to learn the set of functions used in the ensemble model: where l is a differentiable convex loss function that measures the difference between the predictionŷ i and the target y i and the second term Ω( f ) describes the complexity of tree f k , where ξT and ξ w 2 penalize each tree leaf involved in addition and extreme weights, respectively. Unfortunately, Equation (8) includes parametric functions and cannot be practically optimized using traditional optimization methods in Euclidean space. However, due to the additive training manner of the model, we can state the objective function for the current iteration t in terms of the prediction at the previous iteration t − 1 adjusted by the newest tree f t : By taking the Taylor expansion of Equation (9) to the first-and second-order gradients based on the loss function, we can obtain the following simplified objective function: where g i = ∂ŷ (t−1) l(y i ,ŷ (t−1) ) and h i = ∂ 2 y (t−1) l(y i ,ŷ (t−1) ). A DT predicts constant values within a leaf. Thus, tree f k (x) can be represented by w q (x), where w is the score vector for each leaf and q(x) maps instance x to a leaf. By expanding the second term in Equation (10), a sum over the tree leaves can be obtained, and the regularization term becomes: where I j = i q(x i ) = j is the instance at leaf j. For a fixed structure tree, the objective function can be minimized as ∂ (t) /∂w j = G j + (H j + λ)w j = 0, and the best weight of leaf j can be obtained by: Remote Sens. 2020, 12, 1973 8 of 23 By substituting this formula into Equation (11), the objective function for finding the best tree structure then becomes: This formula is used in practice for evaluating the split candidates in XGBoost. To find the best split, the exact greedy algorithm and the global (process all the candidate splits during the initial phase, and use the same splitting protocol to find splits at all leaves) and local (re-propose candidates after each split) variant approximation algorithms are run over all the possible splits of all the features [21,36]. Although the local variant approximate algorithm requires fewer candidates than the global algorithm, the results of the global approach can be as accurate as those of the local method given enough candidates. For a distributed tree learning system, although most of the existing approximations use direct calculations of gradient statistics or quantile strategies, XGBoost efficiently supports the exact greedy algorithm for a single machine set and both local and global variant approximation methods for all settings [21,45,49].
In XGBoost with a DART booster, suppose k trees were dropped from the algorithm during the m-th training round. Let D = k∈K F k be the leaf scores of the dropped trees and F m = η F m be the leaf scores of a new tree; then, the objective function form in Equation (9) can be reformed as: where D and F m are overshooting parameters that need to be normalized in practice and XGBoost supports tree-and forest-based normalization techniques. For XGBoost with a linear booster, the objective function is defined as: where y = ω † x + b, ω= (ω 1 , ω 2 , . . . , ω d is a linear model, d is the dimension of features, λ is the 2 regularization term based on ω, λ b is the 2 regularization term based on the offset coefficient b, and a is the 1 regularization term based on ω.

Meta-XGBoost
According to the definition and literature studies, XGBoost is a stronger learner than CART, C4.5 and linear regression and may also be stronger than RF, GRF, GBDT, LightGBDT and SGBDT learners in some classification and regression tasks [21,[31][32][33][34][35][36]42,43]. However, according to the limitations described in the Introduction, XGBoost is still not capable of providing generalized performance for all cases. Hence, there is still a need for a modified version of XGBoost with a small computational cost and generalized performance that is able to work efficiently for linear and nonlinear samples without experiencing overfitting problems. This objective might be achieved by building an ensemble system using CART, DART, linear and RaF boosters. The framework of majority voting (MV) can be employed: where ε * represents the classification error rate of an ensemble with n classifiers and ε classification error, k = n/2 + 1, and n/2 denotes the floor. Theoretically, ε * will monotonicity decrease to 0 as n → ∞ , and ε > 0.5. However, the number of classifiers is only four in our case, and simply considering the MV ensemble may limit or even degrade algorithm performance by leading to the scenario of ε * ≥ ε.
In this scenario, a metaboost ensemble might yield the best solution; in this approach, MV occurs first, and the best strategy among all strategies is then selected: where ε xgb−mv represents the classification error using MV and ε xgb−c , ε xgb−d , ε xgb−l and ε xgb−ra f represent the classification errors of XGBoost with the CART, DART, linear and RaF boosters, respectively. Now, we can obtain a decision function for the meta-XGBoost classifier as: and h xgb−ra f (x) are decision functions for XGBoost with the CART, DART, linear and RaF boosters, respectively.

ROSIS Pavia University Data Set
ROSIS Pavia University hyperspectral images are acquired with a ROSIS optical sensor that provides 115 bands with spectral coverage ranging from 0.43 to 0.86 µm. The geometric resolution is 1.3 m. The image shown in Figure 1a was captured over the Engineering School, University of Pavia, Pavia, Italy. The image has 610 × 340 pixels with 103 spectral channels (a few original bands are very noisy and were discarded immediately after data acquisition). The validation data refer to nine land cover classes and are shown in Figure 1, with details about the number of samples given in Table 2.    The second hyperspectral image was acquired at a spatial resolution of 2.5 m by the NSF-funded Center for Airborne Laser Mapping (NCALM) over the University of Houston campus and the neighboring urban area on 23 June 2012. The image has 349×1905 pixels with 144 spectral bands in the spectral range between 380 and 1050 nm. The 15 classes of interest selected by the Data Fusion Technical Committee (DFTC) of the IEEE Geoscience and Remote Sensing Society (GRSS) are shown in Figure 2 and reported in Table 3 with the corresponding numbers of samples for both the training and validation sets [50].

Experimental Setup
To evaluate the performance of XGBoost and the meta-XGBoost method proposed in this work, SVM, bagging, RaF, AdaBoost, MultiBoost, CVRFR, ExtraTrees and END-ERDT algorithms were adopted [6,[9][10][11]44,51,52]. The critical parameters (e.g., minimum and maximum leaf sizes, maximum tree depth, tree pruning and smoothing rates, and tree size) of the DT-based EL classifiers were set by referring to the corresponding suggestions from the literature. The free parameters for the radial basis function (RBF) kernel-based SVM were tuned by 10-by-10 grid search optimization, with a search range of 0-1000 for gamma and 1-1000 for the cost factor.
To analyze the performance of the proposed spatial feature extractors, MPs, MPPRs, MSER_MPs, MSER_MPsM, EMPs and EMPPRs were applied to the two benchmark hyperspectral data sets presented previously [9,46,53,54]. To generate MPs and MPPR features from each data set, we applied a disk shape SE with n = 10 openings and closings based on conventional and partial reconstruction methods. The value of n was varied from one to ten with a step size of one. Thus, we will obtain a total of 2163 = 103 + 103 × 10 × 2 and 3024 = 144 + 144 × 10 × 2 dimensional data sets using the original spectral bands and a total of 70 = 10 + 3 × 10 × 2 and 67 = 7 + 3 × 10 × 2 dimensional data sets using the PCA-transformed features for ROSIS and GRSS-DFC2013, respectively. For fair comparison, we set the threshold =100 to 1000 with a step of 100 for selecting the objects in the MSER_MP and EMSER_MP feature extraction phases. Notably, MSER_MPsM and EMSER_MPsM, which contain extra mean pixel values within objects, will yield 3193 = 103 + 103 × 10 × 3, 4464 = 144 + 144 × 10 × 3, 100 = 10 + 3 × 10 × 3 and 97 = 7 + 3 × 10 × 3 dimensional data sets using the original spectral bands and PCA-transformed features for ROSIS and GRSS-DFC2013, respectively.
All the experiments were performed using R 3.5.0 software on a Windows 10 64-bit system run on an Intel Core™ i7-7820X CPU at 3.60 GHz and with 64 GB of RAM. The accuracy (OA), kappa statistic (k) and CPU run time for training were used to evaluate the classification performance of the all considered methods.

Parameter Configuration in XGBoost
XGBoost with the CART booster has more than 20 parameters, as mentioned above, but the learning rate (η,[0,1]), minimum split loss (γ,[0,∞]), maximum tree depth ([0,∞]) and subsample ratio of training samples ([0,1]) are the most important parameters [21,36,45]. Figure 3 illustrates the OA values versus the combination of these four parameters based on PCA10 features from the ROSIS University dataset. Notably, each OA surface graph was obtained by a combination of two parameters in a dynamic way, and the other two parameters were set by default, as suggested on the official website of XGBoost (https://xgboost.readthedocs.io/en/latest/parameter.html). First, according to the plots in the first row, the optimum range for the learning rate used to prevent overfitting is between 0.2 and 0.4. A large step size will shrink the feature weights to make the boosting process relatively conservative, and a small step size will not prevent overfitting. For the minimum split loss, a small value is always the best option, as shown in the plots in Figure 3a,d,e. In contrast, a large tree depth is always optimal for DTs to construct the best possible model. However, a large tree depth will make the model more complex and more likely to experience overfitting than would a small depth. Thus, an optimum range of between 5 and 10 for the maximum tree depth is recommended. In our next experiments, the maximum depth of trees was set to eight as the default for both efficiently training the low-complexity model and avoiding the potential overfitting issue. Additionally, according to the results for the sampling ratio parameter, there is no obvious impact on the classification accuracy. However, to prevent possible overfitting, especially in the limited sample training scenario, and maintain booster diversity in different boosting iterations, the optimum range for the sampling ratio should be between 0.5 and 1. Figure 4 presents the OA and CPU time consumption in seconds versus the dropout rate ([0,1]; the fraction of the previous tree dropped during the dropout period) and probability of skipping the dropout procedure during a boosting iteration ([0,1]) for the DART booster, where different tree sampling algorithms (uniform and weighted) and normalization schemes (tree based: new trees have the same weights as dropped trees; forest based: new trees have the same weights as the sum of dropped trees) are combined in four ways. Notably, both the dropout rate and probability of skipping have significant impacts not only on the classification accuracy but also on the computational efficiency. Specifically, the probability of skipping does not have a significant influence on the classification accuracy with the parameter set for uniform sampling with both tree and forest normalization (see Figure 4a,b) but has a significant influence on the classification accuracy with the parameter set for weighted sampling with both tree and forest normalization (see Figure 4c,d). By comparing the results of uniform sampling versus weighted sampling and tree normalization versus forest normalization, it can be concluded that uniform sampling with tree or forest normalization is the best solution for building the model with optimal performance. For the dropout rate, a small value is always better than a large value, which is in accordance with the findings of previous works [21,36,46]. Additionally, a small dropout rate with a large probability of skipping is optimal for efficient model training. Therefore, dropout rate = 0.1 (set as 0, DART degradation to MART), probability of skipping = 0.5, and uniform sampling with tree normalization were used as defaults for XGBoost with the DART booster in subsequent experiments.
recommended. In our next experiments, the maximum depth of trees was set to eight as the default for both efficiently training the low-complexity model and avoiding the potential overfitting issue. Additionally, according to the results for the sampling ratio parameter, there is no obvious impact on the classification accuracy. However, to prevent possible overfitting, especially in the limited sample training scenario, and maintain booster diversity in different boosting iterations, the optimum range for the sampling ratio should be between 0.5 and 1.  Notably, both the dropout rate and probability of skipping have significant impacts not only on the classification accuracy but also on the computational efficiency. Specifically, the probability of skipping does not have a significant influence on the classification accuracy with the parameter set for uniform sampling with both tree and forest normalization (see Figure 4a,b) but has a significant influence on the classification accuracy with the parameter set for weighted sampling with both tree and forest normalization (see Figure 4c,d). By comparing the results of uniform sampling versus weighted sampling and tree normalization versus forest normalization, it can be concluded that uniform sampling with tree or forest normalization is the best solution for building the model with optimal performance. For the dropout rate, a small value is always better than a large value, which is in accordance with the findings of previous works [21,36,46]. Additionally, a small dropout rate with a large probability of skipping is optimal for efficient model training. Therefore, dropout rate = 0.1 (set as 0, DART degradation to MART), probability of skipping = 0.5, and uniform sampling with tree normalization were used as defaults for XGBoost with the DART booster in subsequent experiments.    Figure 5 presents the critical parameters, including the 1 and 2 regularization terms α and λ, based on weighted and cyclic methods of shuffle feature selection and ordering for the linear booster. According to the plots in Figure 6, the 2 regularization term λ influences both the classification accuracy and training efficiency, and the optimum value is 0. There is no obvious influence on the classification accuracy or model training efficiency for the 1 regularization term α with the cyclic method of shuffle feature selection and ordering.
booster based on PCA features from the DFC2013 data set. Figure 5 presents the critical parameters, including the 1  and 2  regularization terms α and λ , based on weighted and cyclic methods of shuffle feature selection and ordering for the linear booster. According to the plots in Figure 6, the 2  regularization term λ influences both the classification accuracy and training efficiency, and the optimum value is 0. There is no obvious influence on the classification accuracy or model training efficiency for the 1  regularization term α with the cyclic method of shuffle feature selection and ordering.
(a) (b) (c) (d)  Figure 6 illustrates the OA and CPU time consumption versus the number of trees and number of boosting iterations for XGBoost with the RaF booster based on raw features from the ROSIS University and DFC2013 datasets. Note that if early stopping is not adopted (which is the case in our experiments), the final model will consist of the number of trees in RaF multiplied by the number of boosting iterations. As shown in Figure 6, the number of boosting iterations has a greater influence on the classification accuracy than does the size of the RaF, and the computational complexity is increased by multiplying the size of the RaF by the number of boosting iterations. Moreover, values

Classification Accuracy
In Figures 7 and 8, we present the results of OA from the considered classifiers with increasing ensemble size for various features from the ROSIS University and DFC2013 hyperspectral data sets, respectively. Each point on the x-axis represents the size of trees in a conventional RaF and the number of boosting iterations for meta-XGBoost and XGBoost with the CART, DART, linear and RaF boosters. The left y-axis represents the overall classification accuracies of meta-XGBoost and XGBoost with the CART, DART and RaF boosters, and the right y-axis represents the overall classification accuracies of the conventional RaF and XGBoost combined method with a linear booster (Figure 7) or only XGBoost with a linear booster (Figure 8). Notably, the XGBoost with a linear booster classifier and conventional RaF classifier displayed large variations in classification accuracy, in contrast with the results of meta-XGBoost and XGBoost with the CART, DART and RaF boosters. If there were only one y-axis, it would be difficult to visually analyze the differences among the meta-XGBoost and XGBoost methods with the CART, DART and RaF boosters, which display small variations in classification accuracy. Figure 9 presents the OA and CPU time consumption in seconds for the considered classifiers and all the features from the ROSIS University and DFC2013 hyperspectral data sets.  Figure 6 illustrates the OA and CPU time consumption versus the number of trees and number of boosting iterations for XGBoost with the RaF booster based on raw features from the ROSIS University and DFC2013 datasets. Note that if early stopping is not adopted (which is the case in our experiments), the final model will consist of the number of trees in RaF multiplied by the number of boosting iterations. As shown in Figure 6, the number of boosting iterations has a greater influence on the classification accuracy than does the size of the RaF, and the computational complexity is increased by multiplying the size of the RaF by the number of boosting iterations. Moreover, values beyond approximately 100 boosting iterations do not improve the classification accuracy and are computationally expensive. Hence, the numbers of trees and boosting iterations were set to 10 and 100, respectively, for XGBoost with the RaF booster in subsequent experiments.

Classification Accuracy
In Figures 7 and 8, we present the results of OA from the considered classifiers with increasing ensemble size for various features from the ROSIS University and DFC2013 hyperspectral data sets, respectively. Each point on the x-axis represents the size of trees in a conventional RaF and the number of boosting iterations for meta-XGBoost and XGBoost with the CART, DART, linear and RaF boosters. The left y-axis represents the overall classification accuracies of meta-XGBoost and XGBoost with the CART, DART and RaF boosters, and the right y-axis represents the overall classification accuracies of the conventional RaF and XGBoost combined method with a linear booster (Figure 7) or only XGBoost with a linear booster (Figure 8). Notably, the XGBoost with a linear booster classifier and conventional RaF classifier displayed large variations in classification accuracy, in contrast with the Remote Sens. 2020, 12,1973 14 of 23 results of meta-XGBoost and XGBoost with the CART, DART and RaF boosters. If there were only one y-axis, it would be difficult to visually analyze the differences among the meta-XGBoost and XGBoost methods with the CART, DART and RaF boosters, which display small variations in classification accuracy. Figure 9 presents the OA and CPU time consumption in seconds for the considered classifiers and all the features from the ROSIS University and DFC2013 hyperspectral data sets.
number of boosting iterations for meta-XGBoost and XGBoost with the CART, DART, linear and RaF boosters. The left y-axis represents the overall classification accuracies of meta-XGBoost and XGBoost with the CART, DART and RaF boosters, and the right y-axis represents the overall classification accuracies of the conventional RaF and XGBoost combined method with a linear booster (Figure 7) or only XGBoost with a linear booster (Figure 8). Notably, the XGBoost with a linear booster classifier and conventional RaF classifier displayed large variations in classification accuracy, in contrast with the results of meta-XGBoost and XGBoost with the CART, DART and RaF boosters. If there were only one y-axis, it would be difficult to visually analyze the differences among the meta-XGBoost and XGBoost methods with the CART, DART and RaF boosters, which display small variations in classification accuracy. Figure 9 presents the OA and CPU time consumption in seconds for the considered classifiers and all the features from the ROSIS University and DFC2013 hyperspectral data sets.   (f) (g) (h) (i) (j) From the results shown in Figures 7-9, differences in the classification accuracy of XGBoost with the CART, DART, linear and RaF boosters are clear for the different datasets and for features from the same dataset, as expected. For instance, XGBoost with the linear booster exhibited the highest OA values for MPPR features extracted from the raw bands of the ROSIS University data at small numbers of boosting iterations (see Figure 7c) but yielded the worst OA values for PC10 features, with no positive or negative influences based on the number of boosting iterations (see Figure 7f). When comparing the results of XGBoost with the CART, DART and RaF boosters, XGBoost with the RaF booster is more stable than XGBoost with the CART and DART boosters in most cases and generally displays better performance, as illustrated by the blue lines with circle markers. This result is reasonable because both theoretically and practically, the RaF ensemble classifier is stronger than the single CART and DART classifiers. Other studies have found that the DART booster is superior to the CART booster and that XGBoost is superior to the conventional RaF classifier; these findings were not consistently observed in our experiments. In contrast, XGBoost with the DART booster display larger variations in OA values than XGBoost with the CART booster, as illustrated by the  72   70  72  74  76  78  80  82  84  86  88  90  92  94  96 OA(%) (M   58  60  62  64  66  68  70  72  74  76  78  80  82  From the results shown in Figures 7-9, differences in the classification accuracy of XGBoost with the CART, DART, linear and RaF boosters are clear for the different datasets and for features from the same dataset, as expected. For instance, XGBoost with the linear booster exhibited the highest OA values for MPPR features extracted from the raw bands of the ROSIS University data at small numbers of boosting iterations (see Figure 7c) but yielded the worst OA values for PC10 features, with no positive or negative influences based on the number of boosting iterations (see Figure 7f). When comparing the results of XGBoost with the CART, DART and RaF boosters, XGBoost with the RaF booster is more stable than XGBoost with the CART and DART boosters in most cases and generally displays better performance, as illustrated by the blue lines with circle markers. This result is reasonable because both theoretically and practically, the RaF ensemble classifier is stronger than the single CART and DART classifiers. Other studies have found that the DART booster is superior to the CART booster and that XGBoost is superior to the conventional RaF classifier; these findings were not consistently observed in our experiments. In contrast, XGBoost with the DART booster display larger variations in OA values than XGBoost with the CART booster, as illustrated by the blue lines with upward-pointing triangles in Figures 7a-j and 8a,f-h. Notably, the dropout technique introduced in the DART booster can overcome the overfitting issue of CART, and it might also introduce instability, especially for scenarios with large numbers of boosting iterations. According to the theorem of EL, if diversities exist among classifiers that yield better performance than random guessing, improvements can always be achieved with an ensemble system. By comparing the OA results of the proposed meta-XGBoost method with those of XGBoost with the CART, DART, linear and RaF boosters, higher and more stable results can be observed for meta-XGBoost in almost all cases, as shown by the blue dotted lines with cross markers in Figures 7 and 8. A comparison of the OA bars in Figure 9a,b suggests that better results can be obtained with meta-XGBoost than with the SVM, AdaBoost, MultiBoost, RaF, ExtraTree, END-ERDT and CVRFR classifiers based on both experimental datasets; this finding is supported by the results in Tables 4 and 5. Specifically, this finding is evident in cases that use advanced spectral-spatial features, including the MP, MPPR, MSER_MP, MSER_MPsM methods and their extended versions. Accordingly, the XGBoost classifier can be boosted further by using an ensemble of four boosters.

Computational Efficiency
Computational efficiency is considered a key factor when evaluating classifier performance. In accordance with the plots in Figure 9a,b, which show the classification accuracies, the plots in Figure 9c,d show the CPU time in seconds for the training phase with the considered classifiers and using all the considered features. Because the free parameters of all the ensemble classifiers were set as constants before model training, the CPU time consumption for the 10-by-10 grid search optimization procedure was not included for the SVM for fair comparison. The numbers of trees in the bagging, RaF, ExtraTree, CVRFR, and END-ERDT methods and boosting iterations in AdaBoost, MultiBoost, and XGBoost with the CART, DART and linear boosters and meta-XGBoost are set to 100 by default; additionally, the number of parallel trees in XGBoost with the RaF booster was set to 10.
When considering the influence of data dimensionality, high-data dimensionality always increases the model training inefficiency, especially for the SVM, bagging, AdaBoost, and MultiBoost methods. In contrast, based on the bar plots for classifiers based on all the considered features, ExtraTrees yields the fastest model training efficiency for low-dimensionality data. The bar plots for the features of PC10 and spatial features extracted from the first three principal components also reflect this trend. This result is in accordance with the findings of our previous works [9,11]. As a highly efficient and scalable algorithm, XGB-boost with the CART, DART and linear boosters is much more efficient (at least five times faster) than the SVM, AdaBoost and MultiBoost methods and less efficient than the conventional RaF and ExtraTrees classifiers, especially in the case of datasets with high dimensionality. Combined with the results from the previous subsection, meta-XGBoost can be an alternative to state-of-the-art classifiers, including RBF kernel-based SVMs, AdaBoost and MultiBoost, based on the generalized classification accuracy and computational efficiency in model training, especially for data with high dimensionality, such as hyperspectral imagery.

Performance of EMSER_MPs
In our previous work, the superiority of MSER_MPs over the conventional MP and MPPR methods was verified for VHR remote sensing images over urban areas based on both visual interpretation and classification [9]. Here, we analyze the performance of the EMSER_MPs from the aspect of classification accuracy. According to the results shown in Figures 7-9, the OA values of MSER_MPs and MSER_MPsM are higher than the OA values of the raw, MP and MPPR feature methods. This result was also observed for EMSER_MPs and EMSER_MPsM in all experiments with the considered datasets. For instance, classifiers including meta-XGBoost, XGBoost with the RaF booster and RaF yielded classification accuracies higher than 95% on average with EMSER_MPs and EMSER_MPsM features from the ROSIS University data; additionally, the maximum classification accuracy was 93.69% for meta-XGBoost and XGBoost with the DART booster and 90.31% for meta-XGBoost and XGBoost with the RaF booster based on EMPs and EMPPR features (see the results in Figures 7 and 9a, and Table 4). Similarly, the classification performance of EMSER_MPs and EMSER_MPsM was superior to that of EMPs and EMPPR, as can be observed in Figures 8 and 9b and Table 4 based on experiments with the DFC2013 data set. For example, the highest classification accuracy (OA = 85.07) was reached by the SVM using EMP features, but the second-best value (OA = 84.92%) was obtained with the proposed meta-XGBoost method using EMSER_MP features. If we compare the results from all other classifiers using EMSESR_MP and EMSER_MPsM features, as shown in Figure 10 by the bars in magenta and orange colors and in the last two columns of Table 4, the classification accuracies reached by using these two feature sets are generally higher than others in most cases. Thus, we can conclude that the proposed EMSER_MP approach can be an alternative to state-of-the-art spatial feature extractors, including MPs, EMPs, MPPR, EMPPR and MSER_MPs, for hyperspectral image classification.

Classification Maps
Finally, in Figure 10, we present the classification maps with OA values for the classifiers, including SVM; XGBoost with the CART, DART and linear boosters; and meta-XGBoost, using various features from the ROSIS University data set. In addition, Figure 11 shows the classification maps with OA values for the SVM and proposed meta-XGBoost methods using all the considered features from the second experimental data set. Due to space limitations, and for clear visualization, we only selected the results from several methods to show in Figures 10 and 11; moreover, Tables 4

Classification Maps
Finally, in Figure 10, we present the classification maps with OA values for the classifiers, including SVM; XGBoost with the CART, DART and linear boosters; and meta-XGBoost, using various features from the ROSIS University data set. In addition, Figure 11 shows the classification maps with OA values for the SVM and proposed meta-XGBoost methods using all the considered features from the second experimental data set. Due to space limitations, and for clear visualization, we only selected the results from several methods to show in Figures 10 and 11; moreover, Tables 4 and 5 show the overall classification accuracies with kappa coefficient values for all classifiers using all the considered features.

Conclusions
According to the literature, XGBoost has shown remarkable performance in some classification, regression and ranking tasks. However, the use of XGBoost has not been extensively investigated in the remote sensing image classification context with spectral and spectral-spatial features. Additionally, several issues of potential overfitting, reduced training efficiency, decreased predictive performance, unstable early stopping, and being limited to solving nonlinearly separable problems remain for XGBoost with different boosters in practical applications. In this regard, a novel version of XGBoost, meta-XGBoost, was proposed to overcome the above issues.
According to the results, the following conclusions can be drawn. First, the proposed EMSER_MP features are better than all the MP, MPPR, MSER_MP, EMP and EMPPR features. Furthermore, some previous findings suggested that XGBoost with a DART booster is superior to XGBoost with a CART booster and that XGBoost is superior to conventional RaF methods in the spectral-spatial classification of hyperspectral images. However, the classification accuracy of SVM, AdaBoost, MultiBoost, ExtraTrees and END-ERDT classifiers was better than that of XGBoost with the CART, DART, linear and RaF boosters in some cases. Additionally, XGBoost with the RaF booster yielded a higher classification accuracy than XGBoost with the CART booster, but the best results were consistently obtained by meta-XGBoost, especially when advanced features were used. Finally, based on both the generalized classification accuracy and computational efficiency of model training, the proposed meta-XGBoost classifier could be an alternative to state-of-the-art classifier, including RBF kernel-based SVM, AdaBoost and MultiBoost classifiers, especially for high-dimension and nonlinearly separable data such as hyperspectral imagery used in spectral-spatial classification.