Assessing Different Feature Sets ’ Effects on Land Cover Classification in Complex Surface-Mined Landscapes by ZiYuan-3 Satellite Imagery

Land cover classification (LCC) in complex surface-mined landscapes has become very important for understanding the influence of mining activities on the regional geo-environment. There are three characteristics of complex surface-mined areas limiting LCC: significant three-dimensional terrain, strong temporal-spatial variability of surface cover, and spectral-spatial homogeneity. Thus, determining effective feature sets are very important as input dataset to improve detailed extent of classification schemes and classification accuracy. In this study, data such as various feature sets derived from ZiYuan-3 stereo satellite imagery, a feature subset resulting from a feature selection (FS) procedure, training data polygons, and test sample sets were firstly obtained; then, feature sets’ effects on classification accuracy was assessed based on different feature set combination schemes, a FS procedure, and random forest algorithm. The following conclusions were drawn. (1) The importance of feature set could be divided into three grades: the vegetation index (VI), principal component bands (PCs), mean filters (Mean), standard deviation filters (StDev), texture measures (Textures), and topographic variables (TVs) were important; the Gaussian low-pass filters (GLP) was just positive; and none were useless. The descending order of their importance was TVs, StDev, Textures, Mean, PCs, VI, and GLP. (2) TVs and StDev both significantly outperformed VI, PCs, GLP, and Mean; Mean outperformed GLP; all other pairs of feature sets had no difference. In general, the study assessed different feature sets’ effects on LCC in complex surface-mined landscapes.


Introduction
Land cover datasets are basic components for global change studies and various applications [1,2].Currently, researchers are mainly focusing on land cover classification (LCC) at fine scales [3][4][5] in complex landscapes such as agricultural [6][7][8][9], surface-mined land [10][11][12][13][14], and Mediterranean [15] by using high spatial resolution satellite imagery.In general, there are also other landscapes in surface-mined areas, such as agricultural, forest, and cities.Thus, they can be considered as complex surface-mined landscape together for LCC.LCC in surface-mined landscapes (LCCSML) can help with the planning and management of mines.
Classification technology based on machine learning algorithms and high spatial resolution imagery has achieved more accurate results for urban environments, precision agriculture, transportation, forestry surveys, and so on.However, LCCSML differs from other fields in three specific characteristics: significant three-dimensional terrain, strong temporal-spatial variability of surface cover, and spectral-spatial homogeneity.These characteristics increase difficulty of obtaining high accuracy results for the LCCSML [5,14].As a result, besides powerful classification algorithm, one of the key solutions is to derive beneficial feature sets from helpful satellite sensors.The importance of single features has been examined in our former study [14].However, the importance of different feature sets for LCCSML has not been investigated.Some studies attempted to find out the most effective features for classification by assessing the importance of single features.For example, some studies utilized feature selection (FS) procedure as [14], e.g., landslide identification [16][17][18], LCC in arid regions [19], and object-based image analysis LCC [20].Besides, some others have used different feature combinations by including or excluding specific features for classifications to assess the effects of a single feature, e.g., red-edge band for land-use classification [21]; classifying insect defoliation levels [22]; classification of paddy rice crops [23]; LCC in arid region [19]; and normalized difference vegetation index (NDVI) for classification of tea and hazelnut plantation areas [24].
However, determining effective feature sets is more beneficial than single features.As a result, some studies also used the feature combination method to evaluate the importance of feature sets.For example, Fassnacht et al. [25] aimed to find out which spectral regions were consistently effective for classifying tree species.Akar and Güngör [24] evaluated the contribution of the gray level co-occurrence matrix and Gabor filter texture sets for detecting tea and hazelnut plantation areas.Aguilar et al. [26] grouped different object feature sets such as spectral information, elevation data, band index data and ratios, textures, and shape geometry into 10 strategies for greenhouse extraction and assessed their importance.Wright and Gallant [27] investigated the addition of image texture and digital elevation model-derived terrain variables to Landsat Thematic Mapper variables for wetland discrimination.
Similarly, for agricultural and surface-mined landscapes, Hurni et al. [7] assessed the inclusion of texture measures for the delineation of shifting cultivation landscape.Okubo et al. [8] explored the effectiveness of gray level co-occurrence matrix texture measures for land-use/cover classification in a complex agricultural landscape.Maxwell and Warner [11] investigated the use of multi-temporal terrain data for differentiating mine-reclaimed grasslands from non-mining grasslands.Maxwell et al. [12] assessed RapidEye image-and light detection and ranging (LiDAR)-derived variable sets for geographic object-based image analysis classification of mining and mine reclamation.Maxwell et al. [13] examined the incorporation of LiDAR-derived data for mapping of mining and mine reclamation area by making comparison to data derived by using only RapidEye imagery bands.However, those studies just examined whether the feature sets were effective.There is little research that grades and ranks the importance of feature sets, which might be more beneficial than that of single features for LCCSML.Only few studies have analyzed the relative importance between different feature sets, e.g., the comparison of co-occurrence-, Gabor-, and Markov random fields-based textures for sea-ice classification [28].Similarly, there is little research that grades the relative importance.
As shown in [14], the random forest (RF) algorithm is easy to implement and can significantly outperform support vector machine and artificial neural network algorithms for the LCCSML.Furthermore, the RF algorithm is known to be less sensitive to the proposed feature set compared to other algorithms, such as support vector machine [14,18].Thus, using RF to rank and grade importance of feature sets is more reliable than other algorithms.
The objective of this study is to reveal how different feature sets the affect accuracy of LCCSML to rank the importance of feature sets.First, based on our former study [14], the feature sets derived from ZiYuan-3 stereo satellite imagery (ZY-3), the feature subset resulting from a FS procedure, the training data polygons, and the test sample sets were directly obtained.Then, three types of feature set combination schemes were evaluated by combining FS and the RF algorithm.

Test Site and Data Set
In this study, a test site with area of 109.4 km 2 located in Wuhan City of China (114 • 12 33.59"E-114• 23 6.89"E and 30 • 15 38.85"N-30 • 18 57.48"N)was selected for the analysis (Figure 1) [14].Surface mining is prominent features of the test site.The mine disturbance here has a history of nearly 60 years, and most of the mines are active nowadays, especially the Wulongquan mine.The test site also covers a variety of agricultural activities such as crop cultivation (e.g., rice, cotton, corn, rapeseed, and wheat), greenhouse farming, forestry, and aquaculture [14].The test site is located in the subtropical humid monsoon climate zone, and the annual average temperate is 15.9-17.9• C. The rainfall is concentrated in the rainy season of early summer with an annual average rainfall amount of about 1347.7 mm.Several national road networks pass through the test site (Figure 1).The locations of the 28 field survey samples are shown in Figure 1, which is the same as our previous research [14].

Test Site and Data Set
In this study, a test site with area of 109.4 km 2 located in Wuhan City of China (114°12′33.59″E-114°23′6.89″Eand 30°15′38.85″N-30°18′57.48″N)was selected for the analysis (Figure 1) [14].Surface mining is prominent features of the test site.The mine disturbance here has a history of nearly 60 years, and most of the mines are active nowadays, especially the Wulongquan mine.The test site also covers a variety of agricultural activities such as crop cultivation (e.g., rice, cotton, corn, rapeseed, and wheat), greenhouse farming, forestry, and aquaculture [14].The test site is located in the subtropical humid monsoon climate zone, and the annual average temperate is 15.9 °C-17.9°C.The rainfall is concentrated in the rainy season of early summer with an annual average rainfall amount of about 1347.7 mm.Several national road networks pass through the test site (Figure 1).The locations of the 28 field survey samples are shown in Figure 1, which is the same as our previous research [14].A ZY-3 stereo satellite image acquired on 20 June 2012, was used in the analysis.ZY-3 is equipped with four cameras, namely, one 2.1 m nadir-looking panchromatic camera, two 3.6 m frontand backward-looking panchromatic cameras, and one 5.8 m nadir-looking multispectral camera.The 3.6 m resolution front and backward looking panchromatic data were used to extract relative digital terrain models (DTM) data with 10 m resolution using ENVI (The Environment for Visualizing Images) 5.0 software.Then, 2.1 m resolution panchromatic-multispectral fused data were generated [14].A ZY-3 stereo satellite image acquired on 20 June 2012, was used in the analysis.ZY-3 is equipped with four cameras, namely, one 2.1 m nadir-looking panchromatic camera, two 3.6 m front-and backward-looking panchromatic cameras, and one 5.8 m nadir-looking multispectral camera.The 3.6 m resolution front and backward looking panchromatic data were used to extract relative digital terrain models (DTM) data with 10 m resolution using ENVI (The Environment for Visualizing Images) 5.0 software.Then, 2.1 m resolution panchromatic-multispectral fused data were generated [14].

Employed Feature Sets
Our former study [14] used a total of 106 pixel-based features for LCCSML and the importance of single features was assessed.Although there are hundreds of feature sets developed in some other studies, this study further examined the importance of the feature sets formed by those 106 features.The features could be divided into eight types (Table 1): (1) four spectral bands (SBs); (2) one vegetation index (VI): NDVI; (3) two principal component bands (PCs); (4) 12 Gaussian low-pass (GLP) filter features; (5) 12 mean (Mean) filter features; (6) 12 standard deviation (StDev) filter features; (7) 60 texture measures (Textures); and (8) three topographic variables (TVs).The detailed description can be found in our former study [14].

Image Features Names
No. In the former study [14], a feature subset with 34 features for LCCSML was obtained based on the 106 features and a FS method Feature subset, which is used in this study (Table 1).In the feature subset, only the following seven types of features were included: SBs, VI, PCs, GLP, Mean, StDev, and TVs.

Referenced Data
In this study, the LCC schemes developed in [14] were used.The first-level scheme includes the following seven land cover classes: crop land, forest land, water, road, urban and rural residential land, bare land, and surface-mined land.As explained in [14], 20 second-level land cover classes were further acquired to improve the classification accuracy of the first-level land covers.The components of these land covers were as follows: (1) crop land: paddy field, vegetation and fruit greenhouse, dry land, and fallow land; (2) forest land: woodland, shrub forest, forest under stress, and nursery and orchard; (3) water: pond and stream and mine pit lake; (4) road: black road, white road, and gray road; (5) urban and rural residential land: white roof building, red roof building, and blue roof building; (6) bare land: exposed rock/soil; and (7) surface-mined land: opencast stope, mineral processing land, and dumping site.
The same as in [14], besides the accuracy assessment, the following procedures were based on the second-level land cover classes: training set construction, implementation of feature selection, and classifier training and prediction.The training set that was obtained in [14] by using referenced training data polygons, and a stratified random sampling method was used in this study.The training set involves 40,000 pixels (Figure 2), in which each of the second-level land covers contained 2000 samples.Moreover, the classification results were finally grouped into seven first-level land classes for the accuracy assessments.Accordingly, the test set with 700 pixels (100 in each of the first-level classes) that was acquired in [14] was used in this study (Figure 2).Specifically, the test set was selected by a stratified random sampling method from the classification result that erased the training data polygons.As a result, the test set was independent of the training data polygons.samples.Moreover, the classification results were finally grouped into seven first-level land classes for the accuracy assessments.Accordingly, the test set with 700 pixels (100 in each of the first-level classes) that was acquired in [14] was used in this study (Figure 2).Specifically, the test set was selected by a stratified random sampling method from the classification result that erased the training data polygons.As a result, the test set was independent of the training data polygons.

Feature Set Combinations and Classification Procedure
Based on the employed feature sets and the training and test sets, three feature combinations were constructed by two feature set combination schemes with FS (named 1, 2, and 3, as listed in Figure 3) to examine the influence of feature sets for LCCSML.A flow chart of the procedure used is shown in Figure 3.   1), and here we use the abbreviations of the six feature sets for the convenience of expression; the later analysis will consider this issue), with "−" representing exclusion of certain feature sets from the feature subset, and used similarly hereinafter; and (3) Combination 3: performing FS on SBs + some types of feature sets that have numerous

Feature Set Combinations and Classification Procedure
Based on the employed feature sets and the training and test sets, three feature combinations were constructed by two feature set combination schemes with FS (named 1, 2, and 3, as listed in Figure 3) to examine the influence of feature sets for LCCSML.A flow chart of the procedure used is shown in Figure 3. samples.Moreover, the classification results were finally grouped into seven first-level land classes for the accuracy assessments.Accordingly, the test set with 700 pixels (100 in each of the first-level classes) that was acquired in [14] was used in this study (Figure 2).Specifically, the test set was selected by a stratified random sampling method from the classification result that erased the training data polygons.As a result, the test set was independent of the training data polygons.

Feature Set Combinations and Classification Procedure
Based on the employed feature sets and the training and test sets, three feature combinations were constructed by two feature set combination schemes with FS (named 1, 2, and 3, as listed in Figure 3) to examine the influence of feature sets for LCCSML.A flow chart of the procedure used is shown in Figure 3.   1), and here we use the abbreviations of the six feature sets for the convenience of expression; the later analysis will consider this issue), with "−" representing exclusion of certain feature sets from the feature subset, and used similarly hereinafter; and (3) Combination 3: performing FS on SBs + some types of feature sets that have numerous  involves some features of the six feature sets (Table 1), and here we use the abbreviations of the six feature sets for the convenience of expression; the later analysis will consider this issue), with "−" representing exclusion of certain feature sets from the feature subset, and used similarly hereinafter; and (3) Combination 3: performing FS on SBs + some types of feature sets that have numerous features, i.e., SBs + GLP + FS, SBs + Mean + FS, SBs + StDev + FS, and SBs + Textures + FS.Combinations 1 and 2 were used to determine the importance of feature sets by analyzing the accuracy improvements and losses as a result of their addition or exclusion.Combination 3 was used to further analyze the effects of those four features on LCCSML when using FS.The same as in [14], the varSelRF package [29] in the R platform [30] was utilized with two parameters, namely, 2000 trees for the first forest to produce a feature rank and 500 trees for the subsequent forest to iteratively eliminate the least important features.Finally, the feature combination with the lowest out-of-bag error was considered as the optimal result of FS.
Considering that feature subset often shows data-dependency, this study also used 20 training sets that were obtained in [14] to further produce a final robust feature subset.The use of the random feature selection allows the RF to select the most important features with allowing the diversity among the trees.In particular, based on 20 preliminary feature subsets, this study first ranked the features by selected times, mean ranks, and standard deviation values of ranks as in [14], and, then, explored different thresholds of selected times to determine the final feature subset.

Classification Model Construction and Accuracy Assessment
The RF algorithm [31] is a non-parametric ensemble method based on decision trees.Readers could reference the formulas in [31] and Figure 1 in [32] for further understanding.RF have been reported to be with promising classification capacity in some remote sensing applications [32][33][34][35].Considering the randomization principle of the RF algorithm, 10 random training sets were used.The training and parameter optimization of the RF algorithm were conducted in the R platform [30] by using the randomForest package [36] and e1071 package [37].The default value of 500 trees was used for the parameter ntree, and the parameter mtry had to be optimized suggested by [32,38].Specially, in the process of RF model construction the function "best.tune" of e1071 package called the function "randomForest" of randomForest package with a 10-fold cross-validation method.The parameter value that resulted in highest average overall accuracy in the process of cross-validation was the optimized mtry value, in which the range of mtry was from 1 to the number of features.
Accuracy assessment was conducted by using the test set collected in [14] (Section 2.3).The average F1-measure [39] and overall accuracy were also calculated for each classification.The F1-measure is the harmonic average of the precision and recall, where an F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0. The percentage deviations [21] of F1-measure and overall accuracy were utilized to assess the differences of two different results derived from different feature combinations with FS.Additionally, the McNemar test [40] was used to determine whether the two results had difference that was statistically significant by using the results that have minimal OA differences with the average OA values.

Results
Experiment 1 using feature Combination 1 aimed to analyze the effects of the addition of different types of feature sets for LCCSML to assess the importance of feature sets (Section 3.2).The purpose of Experiment 2 using feature Combination 2 was to analyze the effects of the exclusion of different types of feature sets for LCCSML (Section 3.3).Since the feature sets had different size, for example VI, PCs, and TVs had only 1-3 features, the others had 12 features (GLP, Mean, StDev) or 60 features (Textures), it is possible that some feature sets resulted in higher accuracy improvements only due to the larger number of features within the set.As a result, Experiment 3 that based on feature Combination 3 further assessed the importance of feature sets by a FS method to obtain commensurate feature set size (Section 3.4).The results of the FS procedures are shown in Section 3.1.

Feature Selection Results for SBs Following the Addition of Some Types of Feature Sets
For LCCSML when using SBs and some types of feature sets, i.e., SBs + GLP, SBs + Mean, SBs + StDev, and SBs + Textures, FS was performed, and the preliminary results are shown in Tables 3-6.Other features that have never been selected in the feature selection process are not listed in these tables.The former study [14] used a selected time threshold of 16 for 20 runs to pick out the final feature subsets, i.e., in which the features were selected by at least 80% of the runs.In view of the results in Tables 3-6, the features with the selected time of 20 were picked out to form the final feature subsets (bold in tables).As a result, there were 6, 6, 8, and 6 features in the final feature subsets, respectively.It could be concluded that the sizes of SBs + GLP + FS, SBs + Mean + FS, SBs + StDev + FS, and SBs + Textures + FS were commensurate with those of SBs, SBs + VI, SBs + PCs, and SBs + TVs.In general, using the results of feature Combination 3 to assess the effects of those feature sets could achieve more reliable information (for details, see Section 3.4).
Table 3. Feature selection results for the spectral bands adding the Gaussian low-pass filters (GLP)._b/g/r/n_3/5/7: the filter features derived from the blue, green, red, and near-infrared bands using the kernel sizes 3 × 3, 5 × 5, and 7 × 7 pixels.Band: the spectral band.The bold features were selected as members of the final feature subset.The average F1-measure and overall accuracy obtained by using SBs and different types of feature sets are shown in Table 7, in which the best results are displayed in bold.The average overall accuracies ranged from 54.8 ± 1.3% (SBs + PCs) to 64.8 ± 1.4% (SBs + TVs).The very low accuracies could be attributed to the difficulty of LCCSML and the insufficiency of effective information provided by only two types of features.

Features
Table 7.The F1-measure and overall accuracy (OA) (%) for land cover classification in complex surface-mined landscapes using spectral bands and different types of feature sets: 1 : spectral bands; 2 : vegetation index; 3 : PC bands; 4 : the Gaussian low-pass filters; 5 : the mean filters; 6 : the standard deviation filters; 7 : texture measures; and 8 : topographic variables.The bold values represent the best F1-measures and OA values.The accuracies with standard deviation were averaged on 10 runs.Compared to SBs, by adding different types of feature sets the percentage deviations (%) of average overall accuracies were −0.3%, −1.4%, 3.8%, 9.8%, 14.0%, 10.8%, and 16.5% (Table 7).Table 7 shows that the addition of VI and PCs decreased the overall accuracies.However, the addition of GLP, Mean, StDev, Textures, and TVs contributed to the classifications, and their descending order of importance was TVs, StDev, Textures, Mean, and GLP.Because VI and PCs were linearly computed from spectral bands, their addition imported relevant, even redundant information, which might have led to accuracy losses.Although many studies indicated that the addition of VI contributed to the classification [19,41], some others also supported the conclusion in this study.For example, Adelabu et al. [22] reported that, when using all the bands of RapidEye imagery, adding NDVI decreased the classification accuracy.Conversely, the other feature sets were nonlinear features derived from spectral bands, and these data may have provided some heterogeneous information that improved the classifications.
With regard to the F1-measure of each class, the models with only SBs were better than those with SBs + VI and SBs + PCs with the exception of water and urban and rural residential land.Moreover, the models with the additions of other feature sets did not show consistent trends in terms of the overall accuracies.For example, with respect to crop land, the descending order of F1-measures was the models with the additions of TVs, StDev, Mean, GLP, and Textures.Besides, there were some exceptions, i.e., the F1-measures for some classes decreased: the models with SBs and StDev for water; and the models with SBs and Textures for water.Overall, almost all of the models achieved over 80% F1-measures for water and surface-mined land.However, road and urban and rural residential land achieved only 20-50% F1-measures, and those of the other types were 40-70%.
Moreover, it could be drawn that some feature sets were specific to some classes and the feature sets were complementary.For example, the use of Textures led to the best F1-measure values for urban and rural residential land, whereas the use of TVs resulted in the best F1-measure values for crop land, forest land, and water.It might be attributed to that Textures could better discriminate urban and rural residential land from road by better characterizing their shape textures, even if their spectral profiles are similar.The use of DTM allowed to better characterize the elevation for crop land, forest land, and water.

McNemar Test
McNemar test was executed on each pair of classifications derived from SBs with different types of feature sets added.The chi-square values are shown in Table 8, in which the shaded ones larger than 3.84 indicated statistically significant differences at the confidence level of 95% (p < 0.05).The results revealed that: (1) although the addition of VI and PC imported negative effects, they were not statistically significant, i.e., the chi-square values were 0.00 and 0.43, respectively, when compared with the models with only SBs; (2) four of the other five feature sets showed significant positive effects, i.e., the addition of the Mean (8.70), StDev (12.55),Textures (7.67), and TVs (20.90) compared to the models with only SBs; (3) although there were significant differences between the models that added the GLP, Mean, StDev, Textures, and TVs and those that added VI and PC, the results meant little as a result of the negative effects of VI and PC; (4) Mean, StDev, and TVs significantly outperformed GLP, with chi-square values of 4.10, 6.90, and 14.04, respectively; and (5) there were no significant differences among the models that added Mean, StDev, Textures, and TVs. for land cover classification in complex surface-mined landscapes using spectral bands and different types of feature sets: 1 : spectral bands; 2 : vegetation index; 3 : PC bands; 4 : the Gaussian low-pass filters; 5 : the mean filters; 6 : the standard deviation filters; 7 : texture measures; and 8 : topographic variables.The shaded ones that larger than 3.84 indicated statistically significant differences at the confidence level of 95% (p < 0.05).The average F1-measure and overall accuracy obtained by excluding different types of feature sets from feature subset are shown in Table 9.The overall accuracies ranged from 77.6% (feature subset) to 66.2 ± 1.0% (feature subset − StDev).
Table 9.The F1-measure and overall accuracy (OA) (%) for land cover classification in surface-mined landscapes that involved the exclusion of different types of feature sets from feature subset ( 9): 1 : spectral bands; 2 : vegetation index; 3 : PC bands; 4 : the Gaussian low-pass filters; 5 : the mean filters; 6 : the standard deviation filters; and 8 : topographic variables.The sign "−" represents exclusion of certain feature sets from the feature subset.The bold values represent the best F1-measures and OA values.The accuracies with standard deviation were averaged on 10 runs.Compared to feature subset, by excluding different types of feature sets, the percentage deviations (%) of overall accuracies were −4.6%, −4.8%, −3.0%, −7.5%, −14.6%, and −11.7% (Table 9).This means that excluding VI, PCs, GLP, Mean, StDev, and TVs decreased the classifications, and their descending order of importance was StDev, TVs, Mean, PCs, VI, and GLP.
In regard to the F1-measure of each class, the models that excluded different types of feature sets from feature subset were nearly almost worse than that with feature subset, with some exceptions: the model with feature subset − VI for bare land and surface-mined land; the model with feature subset − GLP for urban and rural residential land, bare land, and surface-mined land.Overall, all of the models achieved over 80% F1-measures for water and surface-mined land.However, road and urban and rural residential land only achieved approximately 40-70% F1-measures as they were spectrally similar, and those of the other types were approximately 60-80%.

McNemar Test
McNemar test was implemented on each pair of classifications derived from the models with feature subset and feature subset excluding different types of feature sets.The chi-square values are shown in Table 10, in which the shaded ones larger than 3.84 indicated statistically significant differences at the confidence level of 95% (p < 0.05).The results revealed that: ( 1  1 : spectral bands; 2 : vegetation index; 3 : PC bands; 4 : the Gaussian low-pass filters; 5 : the mean filters; 6 : the standard deviation filters; and 8 : topographic variables.The sign "−" represents exclusion of certain feature sets from the feature subset.The shaded ones that larger than 3.84 indicated statistically significant differences at the confidence level of 95% (p < 0.05).The average F1-measure and overall accuracy calculated by the results derived from the models that added some types of feature sets to SBs with FS are shown in Table 11, in which the best results are displayed in bold.The overall accuracies ranged from 58.1 ± 1.1% (SBs + GLP + FS) to 64.5 ± 0.9% (SBs + StDev + FS).
With regard to the F1-measure of each class, SBs + GLP, SBs + Mean, SBs + StDev, and SBs + Textures were compared with SBs + GLP + FS, SBs + Mean + FS, SBs + StDev + FS, and SBs + Textures + FS, respectively.The models with FS were better than those with SBs + GLP and SBs + StDev, though some exceptions existed: the models with SBs + GLP + FS for water, road, bare land, and surface-mined land; the models with SBs + StDev + FS for water.The models with FS were worse than that with SBs + Mean and SBs + Textures, though exceptions existed: the models with SBs + Mean + FS for forest land and bare land; the models with SBs + Textures + FS for crop land, forest land, water, and surface-mined land.Overall, the four models with FS almost achieved over 80% F1-measures for water and surface-mined land.Similarly, road and urban and rural residential land achieved only 20-50% F1-measures, and those of the other types were 50-70%.
Table 11.The F1-measure and overall accuracy (OA) (%) for land cover classification in complex surface-mined landscapes using SBs and some types of feature sets with feature selection (FS): 1 : spectral bands; 4 : the Gaussian low-pass filters; 5 : the mean filters; 6 : the standard deviation filters; and 7 : texture measures.The bold values represent the best F1-measures and OA values.The accuracies with standard deviation were averaged on 10 runs.Besides, each land cover class also showed different percentage deviations in response to the addition of FS.The percentage deviation approximately ranged from −4% to 4%, with only three exceptions: the model with SBs + Mean + FS for urban and rural residential land (−16.7%); the model with SBs + Textures + FS for urban and rural residential land (−13.3%) and bare land (−6.9%).
Comparing the average OA values of SBs, SBs + GLP + FS, SBs + Mean + FS, SBs + StDev + FS, SBs + Textures + FS, and SBs + TVs that having commensurate feature set size, it could be concluded that their descending order of importance was TVs, StDev, Textures, Mean, and GLP.This conclusion was more reliable and as same as that drawn in Section 3.2.

McNemar Test
McNemar test was executed on each pair of classifications derived from SBs with the addition of some types of feature sets with FS.The chi-square values after FS are shown in Table 12.The results showed that FS did not significantly improve or reduce the classification accuracies.Besides, the chi-square values for the above-mentioned models with commensurate feature set size are shown in Table 13, in which the shaded ones that larger than 3.84 also indicated statistically significant differences at the confidence level of 95% (p < 0.05).The results revealed that: (1)   For assessing the importance of feature sets, the following three grades were defined (Table 14): (1) important, i.e., the feature sets could exert statistically significant effects on LCCSML; (2) positive, i.e., the feature sets could provide effective information for LCCSML but did not result in significant effects; and (3) useless, i.e., the feature sets had little effects on LCCSML.Specifically, when based on SBs, whether the addition of different types of feature sets achieved significant improvements or resulted in no effects should be examined.Similarly, when based on feature subset, whether the exclusion of different types of feature sets achieved significant decreases or resulted in no effects should be investigated.For the relative importance between different feature sets, the following two types were defined (Table 15): (1) significantly outperformed, i.e., one feature set statistically significantly outperformed another feature set; and (2) with no difference, i.e., one feature set resulted in higher accuracy improvement than another feature set but with no statistically significant difference.Specifically, whether SBs + one feature set significantly outperformed SBs + another feature set and whether feature subset − another feature set significantly outperformed feature set − one feature set should be examined.As shown in the feature subset in which the importance of single features was indicated [14], VI and PCs could provide effective information for the LCCSML.Specifically, VI had very high importance that was only second to DTM.The first PC band had moderate importance inferior to some features from TVs, VI, Mean, and GLP, and the importance of the second PC band was very low.However, the results obtained in Section 3.2 showed that the addition of VI and PCs slightly decreased the classification accuracy compared to the use of only SBs because of the importing of relevant and even redundant information.Thus, the results in Section 3.2 did not reflect the effects of adding VI and PCs to SBs, and the drawn conclusion was not used to determine the importance grades of VI and PCs.In contrast, the results in Section 3.3 showed that the exclusion of VI and PCs resulted in significant accuracy loss.As a result, it could be concluded that VI and PCs were important (Figure 4).

Importance of GLP, Mean, and StDev
For GLP, Mean, and StDev, some of their features were selected as members of the feature subset, and the results revealed that those feature sets could provide effective information for the LCCSML.The results obtained in Section 3.2 showed that the addition of GLP, Mean, and StDev contributed to the classifications compared to the use of only SBs, and the latter two feature sets resulted in significant improvements.Similarly, the results in Section 3.3 revealed that the exclusion of GLP, Mean, and StDev from feature subset decreased the classification accuracies, and the latter two feature sets resulted in significant reductions.In addition, Section 3.4 indicated that SBs + Mean + FS and SBs + StDev + FS significantly outperformed SBs, and there was slight difference between SBs + Mean + FS and SBs.As a result, it could be concluded that the Mean and StDev were important and the GLP was positive (Figure 4) for the LCCSML.

Importance of Textures and TVs
For the Textures and TVs, only some features from TVs were selected in the feature subset (i.e., TVs could provide effective information for the LCCSML and the effectiveness of Textures was not clear [14]).However, the results presented in Section 3.2 indicated that the addition of Textures and TVs significantly contributed to the classifications compared to the use of only SBs.Similarly, the results in Section 3.3 showed that the exclusion of TVs from the feature subset resulted in significant accuracy loss.Especially, SBs + Textures + FS significantly outperformed SBs (Section 3.4) In other words, it could be concluded that Textures and TVs were important for the LCCSML (Figure 4).

Relative Importance of VI and PCs
For the relative importance of the VI and PCs, the former study [14] concluded based on the importance of single features that the features from VI outperformed those from PCs.In Section 3.2,

Importance of GLP, Mean, and StDev
For GLP, Mean, and StDev, some of their features were selected as members of the feature subset, and the results revealed that those feature sets could provide effective information for the LCCSML.The results obtained in Section 3.2 showed that the addition of GLP, Mean, and StDev contributed to the classifications compared to the use of only SBs, and the latter two feature sets resulted in significant improvements.Similarly, the results in Section 3.3 revealed that the exclusion of GLP, Mean, and StDev from feature subset decreased the classification accuracies, and the latter two feature sets resulted in significant reductions.In addition, Section 3.4 indicated that SBs + Mean + FS and SBs + StDev + FS significantly outperformed SBs, and there was slight difference between SBs + Mean + FS and SBs.As a result, it could be concluded that the Mean and StDev were important and the GLP was positive (Figure 4) for the LCCSML.

Importance of Textures and TVs
For the Textures and TVs, only some features from TVs were selected in the feature subset (i.e., TVs could provide effective information for the LCCSML and the effectiveness of Textures was not clear [14]).However, the results presented in Section 3.2 indicated that the addition of Textures and TVs significantly contributed to the classifications compared to the use of only SBs.Similarly, the results in Section 3.3 showed that the exclusion of TVs from the feature subset resulted in significant accuracy loss.Especially, SBs + Textures + FS significantly outperformed SBs (Section 3.4).In other words, it could be concluded that Textures and TVs were important for the LCCSML (Figure 4).

Relative Importance between Different Feature Sets Relative Importance of VI and PCs
For the relative importance of the VI and PCs, the former study [14] concluded based on the importance of single features that the features from VI outperformed those from PCs.In Section 3.2, the addition of VI and PCs to SBs decreased the classifications as a result of the import of relevant, even redundant information.Therefore, this section could not provide effective information for the judgment of the relative importance of VI and PCs.The results presented in Section 3.3 indicated that PCs was more effective than VI but with no significant difference, i.e., the chi-square value of feature subset − VI and feature subset − PCs was 0.83.In other words, it could be concluded that there was no difference between VI and PCs (Figure 4).

Relative Importance of GLP, Mean, and StDev
For the relative importance of GLP, Mean, and StDev, the former study [14] concluded based on the importance of single features that the features from Mean outperformed those from GLP, and the features from StDev achieved lower importance.However, the results presented in Section 3.2 revealed that StDev had the greatest importance, followed by Mean and GLP (i.e., the overall accuracy of SBs + StDev > that of SBs + Mean > that of SBs + GLP).Especially, Mean and StDev significantly outperformed GLP (with chi-square values of 4.10 and 6.90, respectively), and there was no difference between StDev and Mean (with a chi-square value of 1.31).Similarly, the results presented in Section 3.3 indicated that StDev showed the greatest importance, followed by Mean and GLP (i.e., the overall accuracy of feature subset − StDev < that of feature subset − Mean < that of feature subset − GLP).However, each pair of them was associated with significant differences.Section 3.4 indicated that StDev outperformed GLP and Mean, and there was no difference between Mean and GLP.It seems then that the conclusions drawn above were inconsistent.The importance ranks that derived from single feature importance (i.e., features from Mean > features from GLP > features from StDev) and the experiments with additions of feature sets to SBs (Section 3.2), exclusions of feature sets from feature subset (Section 3.3), and additions of some feature sets to SBs with FS (Section 3.4) (StDev > Mean > GLP) were inconsistent.This inconsistence could be attributed to the fact that the importance of single features did not represent the importance of feature sets.For the GLP, Mean, and StDev, only partial features from them were selected as members of the feature subset according to importance metrics, and the others were deemed to be unimportant according to importance metrics.Especially, the importance rank of the selected features from those three feature sets was disordered.For example, some selected features from Mean were less important than some of those from GLP and StDev.The conclusion drawn from single feature importance was just a general judgment.Therefore, it could be concluded that the descending order of importance was as follows: StDev, Mean, and GLP.As for the statistical significance between them, inconsistencies existed in whether there was a significant difference between StDev and Mean, and Mean and GLP according the results derived in Sections 3.2-3.4.Considering that SBs and feature subset were the basic feature sets used for comparisons in Sections 3.2-3.4,and the latter feature set was more effective than the former for the LCCSML, especially StDev in Section 3.3 just involved four features (Table 1), the conclusions from Section 3.3 were adopted when there were small conflicts.In other words, it could be concluded that StDev significantly outperformed Mean and GLP, and Mean also significantly outperformed GLP (Figure 4).significant differences were detected in Section 3.3.Accordingly, TVs significantly outperformed VI and PCs (Figure 4).Because Textures slightly outperformed Mean and GLP, it could be drawn that Textures slightly outperformed VI and PCs.
These conclusions can provide beneficial information for the LCC in various landscapes, for example in agricultural setting, the Mediterranean, the urban fringe, upland forest and so on.In the future, the conclusions would be systematically examined for the LCCSML at fine scale based on object-based image analysis method by integrating more features and feature sets, such as topography variables-derived hydrology and landscape position information [42,43], filter feature sets with larger kernels, and texture sets with different methods and so on.

Conclusions
LCCSML was challenging as a result of significant three-dimensional terrain, strong temporal-spatial variability of surface cover, and spectral-spatial homogeneity.One of the key solutions is to derive beneficial feature sets.The importance of single features has been examined in our former study [14].However, how to determine effective feature sets as input dataset has not been investigated.The present study aimed to reveal that how different feature sets affect accuracy of the LCCSML and assess the importance of feature sets.The feature sets derived from ZY-3 stereo satellite imagery, a feature subset, training data polygons, and test sample sets were firstly obtained; then, three feature set combination schemes were evaluated by combining FS and RF algorithm.In general, the study assessed different feature sets' effects on LCC in complex surface-mined landscapes.The following conclusions were drawn.
(1) The importance of feature sets was graded.VI, PCs, Mean, StDev, Textures, and TVs were important, i.e., their addition significantly contributed to the accuracies of LCCSML, while GLP was positive, i.e., adding it was effective but did not achieve statistically significant improvement.
(2) The importance of feature sets was ranked and their relative importance was graded.The descending order of the importance of feature sets was TVs, StDev, Textures, Mean, PCs, VI, and GLP.TVs and StDev both significantly outperformed VI, PCs, GLP, and Mean; Mean outperformed GLP; and all other pairs of feature sets had no difference.

Figure 1 .
Figure 1.Location of the study area and field survey samples, and ZiYuan-3 fused true color image (R, Red; G, Green; B, Blue) [14].Jing-Zhu expressway: connecting Beijing and Zhuhai; G107: national highway 107 of China; Hu-Rong expressway: connecting Shanghai and Chengdu; Jing-Guang railway: connecting Beijing and Guangzhou; Wu-Xian inter-city railway: connecting Wuhan city and Xianning of Hubei Province, China; Wu-Guang high-speed railway: connecting Wuhan and Guangzhou.

Figure 1 .
Figure 1.Location of the study area and field survey samples, and ZiYuan-3 fused true color image (R, Red; G, Green; B, Blue) [14].Jing-Zhu expressway: connecting Beijing and Zhuhai; G107: national highway 107 of China; Hu-Rong expressway: connecting Shanghai and Chengdu; Jing-Guang railway: connecting Beijing and Guangzhou; Wu-Xian inter-city railway: connecting Wuhan city and Xianning of Hubei Province, China; Wu-Guang high-speed railway: connecting Wuhan and Guangzhou.

Figure 2 .
Figure 2. Location of training and test samples and the red band of ZiYuan-3 fused image.

Figure 3 .
Figure 3. Flow chart of the feature set combinations and classification procedure.

Figure 2 .
Figure 2. Location of training and test samples and the red band of ZiYuan-3 fused image.

Figure 2 .
Figure 2. Location of training and test samples and the red band of ZiYuan-3 fused image.

Figure 3 .
Figure 3. Flow chart of the feature set combinations and classification procedure.

Figure 3 .
Figure 3. Flow chart of the feature set combinations and classification procedure.

Figure 4 .
Figure 4.The importance grade and relative importance of feature sets.

Figure 4 .
Figure 4.The importance grade and relative importance of feature sets.

Table 2 .
Feature combinations used in this study: 1 : spectral bands; 2 : vegetation index; 3 : PC bands;4: the Gaussian low-pass filters; 5 : the mean filters; 6 : the standard deviation filters; 7 : texture measures; 8 : topographic variables; and 9 : feature subset.FS: feature selection procedure; -: the number was indefinite.The column titled "No." in the caption represents the number of different variables.The sign "−" represents exclusion of certain feature sets from the feature subset.

Table 4 .
Feature selection results for spectral bands adding the mean filters (Mean)._b/g/r/n_3/5/7: the filter features derived from the blue, green, red, and near-infrared bands using the kernel sizes 3 × 3, 5 × 5, and 7 × 7 pixels.The bold features were selected as members of the final feature subset.

Table 5 .
Feature selection results for spectral bands adding the standard deviation filter (StDev).Band: the spectral band._b/g/r/n_5/7: the filter features derived from the blue, green, red, and near-infrared bands using the kernel sizes 5 × 5 and 7 × 7 pixels.The bold features were selected as members of the final feature subset.

Table 6 .
Feature selection result for spectral bands adding the texture measures.Band: the spectral band._b/g/r/n_7: the texture features derived from the blue, green, red, and near-infrared bands using the kernel size 7 × 7 pixels.Con: contrast texture.Hom: homogeneity texture.The bold features were selected as members of the final feature subset.

Table 8 .
The chi-square values of McNemar tests Analysis of the Exclusion of Different Types of Feature Sets from Feature Subset for LCCSML 3.3.1.Overall Accuracy, F1-Measure, and Percentage Deviation

Table 10 .
The chi-square values of McNemar tests for land cover classification in surface-mined landscapes that involved the exclusion of different types of feature sets from feature subset ( 9 ): SBs + Mean + FS, SBs + StDev + FS, and SBs + Textures + FS significantly outperformed SBs; (2) SBs + StDev + FS outperformed SBs + GLP + FS and SBs + Mean + FS; and (3) SBs + TVs outperformed SBs + GLP + FS and SBs + Mean + FS.

Table 12 .
The chi-square values of McNemar tests for land cover classification in complex surface-mined landscapes using spectral bands and some types of feature sets with feature selection: 1 : spectral bands; 4 : the Gaussian low-pass filters; 5 : the mean filters; 6 : the standard deviation filters; and 7 : texture measures.

Table 13 .
The chi-square values of McNemar tests for land cover classification in complex surface-mined landscapes using the superior models that involved the addition of some types of feature sets to spectral bands combined with feature selection: 1 : spectral bands; 4 : the Gaussian low-pass filters; 5 : the mean filters; 6 : the standard deviation filters; and 7 : texture measures.FS80: feature selection with a threshold of 16.FS60: feature selection with a threshold of 12.The shaded ones that larger than 3.84 indicated statistically significant differences at the confidence level of 95% (p < 0.05).

Table 14 .
Descriptions of the three defined importance grades of feature sets in this study.

Table 15 .
Descriptions of the two defined relative importance grades between different feature sets in this study.