Optimum Feature and Classifier Selection for Accurate Urban Land Use/Cover Mapping from Very High Resolution Satellite Imagery

: Feature selection to reduce redundancies for efficient classification is necessary but usually time consuming and challenging. This paper proposed a comprehensive analysis for optimum feature selection and the most efficient classifier for accurate urban area mapping. To this end, 136 multiscale textural features alongside a panchromatic band were initially extracted from WorldView-2, GeoEye-3, and QuickBird satellite images. The wrapper-based and filter-based feature selection were implemented to optimally select the best ten percent of the primary features from the initial feature set. Then, machine leaning algorithms such as artificial neural network (ANN), support vector machine (SVM), and random forest (RF) classifiers were utilized to evaluate the efficiency of these selected features and select the most efficient classifier. The achieved optimum feature set was validated using two other images of WorldView-3 and Pleiades. The experiments revealed that RF, particle swarm optimization (PSO), and neighborhood component analysis (NCA) resulted in the most efficient classifier and wrapper-based and filter-based methods, respec-tively. While ANN and SVM’s process time depended on the number of input features, RF was significantly resistant to the criterion. Dissimilarity, contrast, and correlation features played the greatest contributing role in the classification performance among the textural features used in this study. These trials showed that the feature number could be reduced optimally to 14 from 137; these optimally selected features, alongside the RF classifier, can produce an F1-measure of about 0.90 for different images from five very high resolution satellite sensors for various urban geographical landscapes. These results successfully achieve our goal of assisting users by eliminating the task of optimal feature selection and classifier, thereby increasing the efficiency of urban land use/cover classification from very high resolution images. This optimal feature selection can also significantly reduce the high computational load of the feature-engineering phase in the machine and deep learning approaches.

As a solution to the issues mentioned above, researchers have paid lots of attention to feature selection (FS) as an approach to selecting the most relevant features for predefined classes [21,22]. The FS methods are categorized into three categories, including filter, wrapper, and hybrid algorithms. Filter approaches utilize statistical properties independent of the classification performance to eliminate less-significant input in feature sets [23][24][25]. They are computationally inexpensive and fast [26,27]. On the other hand, wrapper methods consider the relation between learning algorithms and training data [28,29]. This is why they outperform filter models in terms of accuracy [30][31][32]. However, they are slower and computationally more expensive [27,30]. The hybrid method uses two algorithms by various evaluation standards in the different search stages [26]. Naeini, et al. [33] investigated several wrapper-based algorithms, such as particle swarm optimization (PSO), genetic algorithm (GA), artificial bee colony (ABC), and honey-bee mating (HBM), to select the best spectral and textural parameters for very high spatial resolution data classification. Then, the selected features were input to various classifiers resulting in better performance for the PSO-selected feature dataset. Hamedianfar, et al. [34] carried out feature selection through consolidation of PSO and artificial neural network (ANN) applied to WorldView-2 data for land use/cover mapping with extreme gradient boosting (XGBoost). They reached an overall accuracy of above 89%. Concerning the filter models surveys, Wu, et al. [35] conducted comparative analyses on four filter-based FS methods, including maximal-minimal associated index (MMAI), maximum relevance minimum redundancy (MRMR), relief-F, and correlation-based FS (CFS). They applied these methods to hyperspectral band selection and showed that MMAI led to the best performance. In another comparative analysis, Malan, et al. [36] showed that neighborhood component analysis (NCA) outperformed GA, principal component analysis (PCA), and relief-F in terms of the kappa coefficient for signal analysis. Ren, et al. [37] implemented an improved version of the relief-F method called partitioned relief-F and compared it with PCA and the original relief-F via classification outputs. The experiments showed a slight increase of up to 4% for the improved method compared to the classical ones. Despite the merits of FS methods, they are intensive because no self-evident and consistent guidelines have been suggested for this process yet.
The non-parametric machine learning methods such as artificial neural networks (ANNs), support vector machine (SVM), and random forest (RF) have significantly captured remote sensing experts' attention in the classification of heterogeneous surfaces and big data analysis throughout the recent decades [38][39][40]. These algorithms were developed for land use/cover mapping [41][42][43] and were compared based on their classification performance by Rogan, et al. [44] and Camargo, et al. [45]. Concerning these surveys, Xie, et al. [39] exploited vegetation indices and textural and topographical data from bi-temporal ZiYuan-3 multispectral and stereo images to detect tree, forest, and land use classes. They utilized six classifiers comprising maximum-likelihood classifier (MLC), k-nearest neighbor (KNN), decision tree (DT), ANN, SVM, and RF, of which SVM's overall accuracy was greater than that of the rest with 89.2%. Vohra, et al. [46] mapped urban areas using textural and morphological features and vegetation indices. These features were extracted from long-wave bands in hyperspectral and multispectral information belonging to very high resolution (VHR) images and classified by ANN and SVM algorithms, in which SVM outperformed. Sentinel-1 urban GLCM features were used by [47]. They evaluated MLC and SVM and showed that the latter surpassed the former with a kappa of 0.72 compared to 0.61. Whereas ANN might be followed by over-fitting, SVM and RF comply with rules solving these problems in expansive feature spaces [39,48,49]. In general, SVM and RF are considered the best land use/cover mapping methods compared to other machine learning procedures [32,45,[50][51][52]. However, the performance of a classification technique is determined by the sensor properties and data attributes akin to spatial and temporal features, software and hardware capabilities, etc. [53]. Conclusively, the need for an efficient and constant classification algorithm used for the VHR image dataset has remained unfulfilled.
An optimized and consistent feature set alongside a more efficient classifier that can be employed for a wide range of VHR images from various satellites in urban phenomena classification is necessary for the classification operation. Accordingly, providing a generalizable set consisting of the most efficient features and classifier for accurate urban landcover mapping is the main objective of this research, which has not been applied in previous studies.
Selecting appropriate input features from VHR imagery and the most efficient classifier is usually time consuming and challenging. To overcome the abovementioned issues, this study aims


To evaluate various texture feature selection algorithms and classification procedures;  To provide a full-scale and optimum feature set and classifier for more efficient and accurate urban land use/cover mapping;  To help users provide the optimum feature set, significantly reducing the time and effort required for feature selection in the classification process.
To realize the objectives, we 1. Assessed VHR multispectral and panchromatic image data for extracting various urban land use/covers; 2. Extracted and collected multiscale textural features from VHR image data; 3. Implemented the wrapper-based and filter-based feature selection approaches; 4. Evaluated each feature set with classification performance to obtain the most efficient one; 5. Demonstrated the generalized characteristics of selected features for the efficient classification of new images. 6. Investigated individual features' role in the classification performance. Figure 1 presents an overview of the proposed optimum feature selection and land use/cover classification framework. Accordingly, 136 multiscale textural features were extracted from each test image. Then, these features and the panchromatic band were utilized in an AI-based selection process. Firstly, PSO and GA were employed to optimize the feature set. Subsequently, filter-based FS methods optimally reduced the selected features by ten percent of the initial features. Furthermore, each dataset's performance was evaluated using SVM, RF, and ANN results. Finally, the productivity of the optimal feature dataset was assessed by new images compared to those in the preliminary experiments.

Image Data
One of the objectives of this research was to test a generalized optimum feature selection process that can be employed for various image data with different spectral, spatial, and urban landscape characteristics. Therefore, this research used VHR images capturing a wide range of urban landscapes (shown in Figure 2) from five satellite sensors, each with a panchromatic band of sub-meter spatial resolution. The characteristics of the images used in this work are listed in Table A1. Although the acquisition locations and coverages of these images are different, there are similarities in the urban context. In addition, there are typical urban land covers and land uses, such as buildings, roads, parking lots, trees, short vegetation, highways, sidewalk, and railways in the images. These similarities help attain a rigorous and generalizable dataset for testing and validation. Within the dataset, three images, i.e., the images of Tehran (A), Hobart (E), and Denver (C), were randomly employed for feature selection processing to obtain the optimum feature set. The other two images, i.e., those of Rio (B) and Melbourne (D), were used for validation and demonstrating the generalized and universal characteristics of the selected features. Another objective of this study was to achieve a generalizable feature set for urban land cover mapping. To this end, several test images with different spectral and geometric characteristics, such as off-nadir angles and multispectral and panchromatic bands, and spatial resolution, were selected. For example, the predominance of large and small buildings with concrete roofs in Tehran, commercial and residential buildings with wood and metal roofs in Hobart and Denver, the combination of these objects in Melbourne and Rio, sports facilities in Hobart and Tehran, the railroad in Melbourne and other dominant phenomena such as different types of vegetation and asphalt structures could be really interesting in evaluating the effectiveness of the proposed approach.
The VHR images have a sub-meter spatial resolution (0.31, 0.5, and 0.6 m, respectively) in the panchromatic band; there are sufficient pure pixels for each urban land class listed in Tables A2-A6. The training and test samples were generated for various land classes, including road, highway, parking, low-rise building, high-rise building, sidewalk, lawn, tree, and shrub (Tables A2-A6). The number of pixels used for training affects the final classification accuracy. It has been recommended [2,54] that about 5% and 10% of the total pixels should be used as the training samples. Accordingly, an average of 8% of the samples was used for training, and the rest (92%) was used as test data.

Class Separability Analysis
In order to investigate the contribution of each multispectral band to the extraction of the classes, firstly, the multispectral bands were incorporated into panchromatic data by Gram-Schmidt pan sharpening. Afterward, for testing the class separabilities, Jeffries-Matusita analysis was applied to all pairs of the land use/cover classes sampled from the images of Denver, Tehran, and Hobart. The analysis indicated that 60% of the class pairs in Denver, 81% in Tehran, and 53% in Hobart had weak discrimination owing to a Jeffries-Matusita distance of less than 1.7, which is indicative of weak separability [55], i.e., the classes in a pair with short Jeffries-Matusita distance have similar spectral properties. The pair classes that indicate short Jeffries-Matusita distance are not distinguishable when relying only on spectral information in the classification process. Consequently, supplementary information, namely textural features, is needed to extract the intended classes accurately.

Feature Extraction
First-order statistics are directly derived from the digital number levels, and secondorder ones are calculated based on GLCM record occurrences of pixel pairs in varied directions [4,[11][12][13][14]29]. For the first-order features, we employed mean and variance, and for the second-order features, we used angular second moment, entropy, contrast, correlation, dissimilarity, and homogeneity [4,56,57].

Feature Selection (FS)
FS approaches are conducted to remove the redundant and irrelevant features in the input dataset, which cause more computational complexity and longer processing time in classification.

Filter-Based Feature Selection
The filtering primarily calculates the feature properties based on three criteria, including information about dependency, consistency, and distance, independent of the data mining procedures [58][59][60]. Three tools described below were used in the filter-based feature selection.
Maximum relevance minimum redundancy (MRMR) can be considered as an estimation of the majority dependence made by conditional entropy between the common diffusion of the eligible features and the classification output [35].
Relief-F [61] is designed based on instance-based learning to distinguish those statistically identical features via assessing the role of a feature in distinguishing between instances near each other considering the Euclidean distance.
Neighborhood component analysis (NCA) calculates the weight of each feature and then ranks them to extract the premier subset by maximizing a target function that assesses the average leave-one-out classification performance over the training input [36,62].

Wrapper-Based Feature Selection
The wrapper approach employs learning algorithms with iteration mechanisms and the accuracy of the classification error rate as a standard of feature impact, which includes the two elements described below.
Genetic algorithm (GA) as a population-based FS scheme commences with a primary population of chromosomes and assesses the fitness function based on the overall accuracy of the K-NN classifier with an iteration mechanism [63][64][65].
Particle swarm optimization (PSO) is based on the movement of birds as particles [66,67]. The particles' velocity is modified in an n-dimensional space until the stopping criteria are satisfied [68,69]. The best solution is achieved by each particle searching throughout its flight in the defined space and adjusting its motion based on its own flight experience and that of the group [68,69].

Classification Algorithms
Three different machine learning classifiers were used in this work to test the efficiency of optimal textural features and achieve the most efficient classifier.
Artificial neural network (ANN) is a feed-forward formation black-box model, trained by MLP, based on a back-propagation algorithm [38,70]. The underlying properties of ANN are the learning rate, training momentum, learning RMSE (root mean square error), stop criteria, and the number of learning iterations [39]. Support vector machine (SVM) discriminates input samples related to varied classes by tracking the optimum margin hyperplanes in a feature set. One of the algorithm's most indispensable and prominent parameters is the kernel function that helps apply the hyperplanes to materialize the minimum classification error appropriately [71,72].
Random forest (RF) represents a collection of tree-based algorithms [73]. The final classification result is determined by a voting process of all the trees. There are two requisite parameters for tuning the RF mechanism, including the number of trees, usually represented by 'n-tree', and the number of predictors at each decision tree node split, which can be elaborated by 'm-try'. Typically, the out-of-bag (OOB) error, applied for inner cross-validation to assess the classification performance of RFs, declines with the increase of n-tree [49,74]. The plot of OOB error versus n-tree is usually drawn, representing how many trees are adequate in the mature forest [49]. Concerning m-try, while a low m-try offers an enfeebled forecast capability for every tree, this indicates a slight correlation amongst varied trees, depleting the generalization error. The m-try equals one-third of the square root of the input features [49,75].

Multiscale Textural Feature Extraction
Identifying appropriate textural features is challenging since an efficient texture feature is a set of parameters such as texture measure, window size, spectral band, direction, and cell shift [12]. The influence of a textural feature in the classification increases with increasing spatial resolution or window size [76,77]. However, the solitary optimal window size for the exploitation of texture would not be adequate for images of urban scenes covering land use/cover classes with similar spectral behaviors. Therefore the multiscale texture analysis is appropriate for urban scenes [2,13]. In this study, five different window sizes (e.g., 5 × 5, 9 × 9, 17 × 17, 31 × 31, and 51 × 51); three directions of horizontal (0°), diagonal (45°), and vertical (90°); and four cell shifts of 3, 7, 15, and 30 pixels were tuned to implement the technique. All these configurations resulted in 136 features. It should be noted that the various directions and cell shifts are only restricted to the second-order textural parameters. For more information about how these cell shifts, directions, and window sizes were selected, see [13].
In order to adequately name each feature, the 'ZX-A-B' code was defined and used, where Z, X, A, and B are feature type, window size, cell shift, and direction or angle of the measuring filter, respectively. Table A7 presents all these 137 features used in this work, where the feature type ASM is for angular second moment, Cont is for contrast, Cor is for correlation, Dis is for dissimilarity, Ent is for entropy, Homo is for homogeneity, Var is for variance, Mean is for mean, and Pan is for panchromatic. For example, Cont9-7-45 represents a feature calculated using a contrast filter with a 9 × 9 window size and a 7-pixel cell shift in a diagonal direction.

Classification of Extracted Features
All the extracted features were input to the considered machine learning classifiers. Then the wrapper-based and filter-based optimization algorithms were applied to obtain the desired feature set. Scaled conjugate gradient (SCG) and the tangent sigmoid transfer, adaptable to GPU, were used as a network training method and transfer function. For the SVM classifier, we utilized radial basis function (RBF) kernel and grid search with 10-fold cross-validation to set the parameters of C and γ. Since RF is computationally productive and resistant to overfitting, n-tree can be chosen as large as possible [74,78]. In this study, the OOB error against the number of trees was used with 120 n-tree to reach the best approximation of n-tree for each image (Figure 3). This figure shows a slight difference between the out-of-bag classification error with 120 trees and the other cases with different tree numbers. Consequently, fewer trees were chosen for RF classification to decrease computational time and burden. The 137 features were input to three classifiers, i.e., ANN, SVM, and RF, for three images (Tehran, Hobart, and Denver). The outputs were analyzed based on F1-measure [79][80][81][82][83][84][85] and processing time ( Figure 4); the results indicated that SVM and ANN had the best and weakest F1-measure values. On the other hand, the slowest was SVM, while RF experienced the fastest performance. In addition, RF illustrated the most efficient consequences regarding classification accuracy and time since its F1-measure was approximately near that of SVM and it was also the most rapid one.

SVM RF ANN
Though high-dimensional feature input may offer more information, the irrelevant, wasteful, and redundant ones decrease the processing speed and classification accuracy. Hence, the optimal feature dataset should be provided by eliminating the repetitious and irrational information to solve this problem. In the first step, GA and PSO were implemented as the wrapper-based feature selection methods to pursue the aim. Given that the wrapper-based process is highly time intensive, we employed parallel programming in the MATLAB environment to accelerate the performance. The GA and PSO parameter values shown in Tables A8 and A9 were used in this study.
The number of features selected via the methods can be seen in Table 1. It should be noted that the time elapsed for the GA process in the three images, on average, was 4 h in comparison to 3 h for the PSO process. Subsequently, these selected features were employed in the classifiers used in this study, and their results were investigated based on the classification time and accuracy ( Figure 5). These results show that similar performance, in terms of accuracy and efficiency, can be achieved using only the optimally selected features. For all images, better classification F1-measure and faster processing time were achieved when using the input of selected features than using the original 137 input dataset. In addition, classification F1measure values with PSO-selected features as inputs were higher than those with GAselected features, but classifications with PSO-selected features required a longer processing time in all the trials.

SVM RF ANN
In the second step, three filter-based feature selection approaches, namely NCA, relief-F, and MRMR, were conducted with wrapper-based selected features to select the best 10% (approximately 14 features) from the whole dataset (i.e., 137 features). The results are presented in Table A10. Then, the 14 optimally selected features were used to classify the three images. Table 2 presents the results from classification, including F1-measure values and overall processing time. It should be noted that the overall processing time was the summation of classification and filter-based processing time. According to Table 2, the 14 optimal features provided by PSO_NCA and classified by SVM were the best combination for obtaining the highest classification accuracy in all intended images. While SVM was the most accurate classifier, RF was the most rapid. In addition, the combination PSO_NCA_RF (the qualified features produced by PSO and then minimized by NCA with classifying in RF) offered the most efficient overall time and F1-measure value in all three images. PSO, NCA, and SVM outperformed their counterparts for three images in terms of accuracy. Nonetheless, MRMR and RF indicated the best performance compared to their peer methods for these images in terms of processing speed. Meanwhile, as stated above, PSO was generally faster than GA.
In order to compare the classification accuracy and time per the input dataset, Figure  6 demonstrates the best output of each classifier and feature set for individual images. In all experiments, the F1-measure value increased with a decrease in input features from 137 to wrapper-based selected features. Moreover, this led to variations such as a slight decline or even a rise (in Denver for RF classifier) in wrapper-based results using the optimal feature datasets. The classification time, in all trials, was reduced with the optimum inputs. Although SVM and ANN's process time depended on the input dataset, RF had a significantly minor sensitivity to the change in input. Overall, the most productive combination of dataset and classifier was the set of 14 optimal features and RF.

Analysis of Generalization
In order to investigate the generalization of the PSO_NCA feature set in other VHR satellite images classification, two images from Pléiades and WorldView-3 sensors over Melbourne and Rio de Janeiro were classified by the same classifiers (Table 3). Similar to the previous trials, while SVM provided the highest accuracy and process time, ANN led to the lowest F1-measure value and average processing time. On the other hand, RF, indicated as the fastest classifier with a satisfying F1-measure value of above 0.9, was regarded as the most productive method. The PSO_NCA feature set represented acceptable performance for the new images, the same as the former three data, indicating the high generalization capability.  Denver CLASSIFICATION TIME

Hobart Tehran
While several studies [30][31][32] indicated that the wrapper-based methods outperformed filter-based algorithms, this study utilized both to achieve the dataset with proper size and good performance. Our results showed that the feature dataset selected by wrapper-based algorithms was still extensive and subsequently accompanied by high computational time. Therefore, the filter-based methods were employed to minimize the dataset achieved by wrapper-based algorithms by up to ten percent of the primary dataset. The results indicated that the wrapper features selected outperformed the original set (137 features) and the set of 14 features in classification accuracy. However, 14 features provided the fastest classification process and high classification accuracy in all experimented images.
On the other hand, while SVM and ANN experienced considerable fluctuations in performance with the change of feature input size, RF revealed slight declines in computational burden and classification performance. According to [2,13], the textural features utilized in this study can be effective in the shadowed area and for vertical objects' (buildings and trees) extraction without the use of elevation features such as the digital elevation model (DEM) and the shadow effects. Therefore, the proposed methodologies can be used efficiently to extract buildings and trees with an average user accuracy of around 94% and 89% achieved with the PSO_NCA_RF dataset.

Feature Assessment
To investigate the role of the extracted features (Table A7) in the classification, five first-order feature datasets (the last column of Table A7) and 21 second-order feature datasets (columns 1-6 of Table A7), as well as the panchromatic band for Tehran, Hobart, and Denver's images, were individually input to SVM. Then, the average user accuracies of three major classes, i.e., vegetation (tree, shrub, and lawn), asphalt (road, highway, and parking), and building (high-rise, low-rise, and commercial building), were calculated ( Figure 7). The results presented that dissimilarity, contrast, and correlation played the most crucial role in the more-accurate extraction of the major classes in all experiments. The results were consistent with the final 14 optimal feature cases since these textural parameters constitute at least half of the selected features for GA_Relief-F, PSO_Relief-F, GA_NCA, and PSO_NCA datasets (Table A10). However, dissimilarity, contrast, and correlation had fewer shares in the cases of the feature sets selected by GA_MRMR and PSO_MRMR (Table A10), which induced weaker classification performance (Table 2). On the other hand, the feature sets of GA_Relief-F, PSO_Relief-F, and GA_NCA (Table A10) comprised features with small window sizes (5 × 5 and 9 × 9). As the bigger the window size of textural features is, the better the classification performance [2,13,76,77] is; the PSO_NCA dataset (Table A10) involving only the biggest window sizes (31 × 31 and 51 × 51) was accompanied by the most efficient classification performance (Table 2). Furthermore, the panchromatic band presented an acceptable performance as a single input compared to first-and second-order textural features in classification as a feature dataset (Figure 7). Accordingly, the panchromatic image had the potential to be involved in PSO_NCA as the most efficient feature dataset (Table A10). The results showed that PSO outperformed GA in classification accuracy and computational time, similar to the case in previous works [33,[86][87][88][89][90]. On the other hand, NCA was superior to its counterparts, the same as in [11,85], while MRMR provided the fastest process. Compared to a previous study [13] with identical study areas, input dataset (137 feature set), and classifier (ANN), the number of training samples was decreased by 41%, 43%, and 36% for Tehran, Hobart, and Denver images, respectively. This study dataset calculated the kappa value to compare the change extent, presenting a decrease in kappa by 0.07 for Tehran, 0.06 for Hobart, and 0.14 for Denver. This fact indicates the high sensitivity of ANN to training samples compared to SVM and RF, which is consistent with the results of previous works by [44,49,91,92].
Moreover, RF and ANN were the most and the least efficient methods among AIbased classifiers, as reported in the literature [39,43,72,[93][94][95]. RF had the most rapid classification process. Additionally, although SVM and ANN's process time depended on the number of input features, RF had a significantly minor sensitivity to the number of input features. Accordingly, RF was the most robust classifier, consistent with [95]. Several studies based on GLCM textural features reported entropy, angular second moment, and contrast as the most valuable features in classification [2,13,[96][97][98][99][100]. However, in this study, dissimilarity, contrast, and correlation showed the most contributing role in the classification performance based on feature evaluation in the classification and comparing its results with the optimal features selected by optimization approaches.
The results indicated the validity of the optimum feature input by presenting the F1measure value of larger than 0.9 for the classification process. Furthermore, data processing time and burden were considerably reduced in the classification process, which is one of the merits of the proposed method compared to the previous works [2,13,72,101]. On the other hand, providing a generalizable dataset for classification can be considered the most significant achievement of this research, which has not been applied in previous studies.
While the approach in this study has significantly decreased the computational time and burden and represented a robust and fast method to classify different landscapes in VHR imagery for urban areas, the extraction of training samples has remained challenging and time consuming. The sampling issue can be regarded as the drawback of this study, which is necessary for machine learning supervised classification algorithms. Nevertheless, the downside can be addressed by convolutional neural networks (CNNs) independent of training samples' exploitation.

Conclusions
Feature selection to reduce redundancies for efficient classification is necessary but usually time consuming and challenging. This study offered proper guidelines for selecting textual features for more accurate land use/cover classification. Accordingly, 136 multiscale textural features were generated from each test image. These features, plus the panchromatic band, were used in an AI-based selection process. Firstly, the feature set was optimized by PSO and GA as per the wrapper-based feature selection algorithms. The selected features were then optimally reduced by ten percent of the initial features extracted by filter-based FS methods, including NCA, relief-F, and MRMR.
Moreover, the performance of each feature set was investigated with results using SVM, RF, and ANN classifiers. Finally, the efficiency of the optimum feature set was analyzed by using new images and compared to those in the preliminary trials. The experiments showed that RF, PSO, and NCA were superior to their counterparts in terms of productivity. In the classification performance, dissimilarity, contrast, and correlation features outperformed other GLCM textural features and panchromatic bands. Contrary to SVM and ANN, RF indicated a minor sensitivity to the number of input features in term of implementation time. To conclude, the set of 14 optimal features (including 13 textural features and the panchromatic band) and RF classifier performed as the most optimum combination for classifying VHR images of urban areas. In the future, the role of this feature dataset in the performance of deep learning approaches can be further investigated.  Data Availability Statement: The data, except for Tehran image, used in this paper were provided from MAXAR (formerly Digital Globe) (www.maxar.com) few years ago. Basir Remote Sensing Institute (www.basir-rsi.ir) also supplied the Tehran data.

Acknowledgments:
The authors would like to thank MAXAR (formerly Digital Globe) Company and Basir Remote Sensing Institute for providing the VHR satellite images.

Conflicts of Interest:
The authors declare no conflict of interest.