Optimal Combination of Classification Algorithms and Feature Ranking Methods for Object-Based Classification of Submeter Resolution Z / I-Imaging DMC Imagery

Object-based image analysis allows several different features to be calculated for the resulting objects. However, a large number of features means longer computing times and might even result in a loss of classification accuracy. In this study, we use four feature ranking methods (maximum correlation, average correlation, Jeffries–Matusita distance and mean decrease in the Gini index) and five classification algorithms (linear discriminant analysis, naive Bayes, weighted k-nearest neighbors, support vector machines and random forest). The objective is to discover the optimal algorithm and feature subset to maximize accuracy when classifying a set of 1,076,937 objects, produced by the prior segmentation of a 0.45-m resolution multispectral image, with 356 features calculated on each object. The study area is both large (9070 ha) and diverse, which increases the possibility to generalize the results. The mean decrease in the Gini index was found to be the feature ranking method that provided highest accuracy for all of the classification algorithms. In addition, support vector machines and random forest obtained the highest accuracy in the classification, both using their default parameters. This is a useful result that could be taken into account in the processing of high-resolution images in large and diverse areas to obtain a land cover classification. Remote Sens. 2015, 7 4652


Introduction
While the traditional pixel-based approach for remote sensing image classification is based on the statistical analysis of multispectral features of the pixels in an image, object-based image analysis (OBIA) allows the use of a wide range of additional information.The OBIA approach involves two steps: segmentation and classification.After segmentation, a very large number of features can be calculated for the resulting objects.The main advantages of OBIA, compared with pixel-based approaches, is the larger number of available features and the fact that the features convey more information when they are calculated on real objects than when sampled on a square grid.The availability of high spatial resolution satellites for civil use, for example QuickBird [1], and the release of eCognition in 2000, the first commercial OBIA software [2,3], are the two advances behind the expansion of OBIA.
eCognition [4] was initially developed by Definiens AG to overcome limitations in traditional approaches to the analysis of high spatial resolution remote sensing images and was the first commercial software that attempted to overcome such limitations.This software allows the extraction of objects from the image that have certain meaning and from which it is possible to extract certain semantics information.eCognition adopted an object-based and a multi-scale approach to the analysis of digital images and has become the most widely-used software in remote sensing when trying to extract thematic information from very high spatial resolution images.Its release has given universal access to tools that previously were only available in specialized research labs [2,3].
Until the mid-1990s, satellite imagery classification was mainly based on conventional statistical techniques, such as maximum likelihood or minimum distance, usually on a pixel basis.In recent years, due to the advances made in computing technology, alternative machine-learning algorithms have been proposed, particularly the use of artificial neural networks, weighted k-nearest neighbors (wk-NN), decision trees, support vector machines (SVM) and methods derived from the theory of fuzzy logic [5].OBIA has not been unaffected by this trend, and several studies have used these algorithms to classify the objects produced by OBIA segmentation algorithms [6][7][8][9][10][11][12][13][14][15][16].
Whatever the algorithm used, having a very large number of features poses two problems.First, the larger the number of features used in classification, the longer the computing time needed.Second, using a very large number of explanatory features, especially when some of them are redundant, noisy or informationless, might result in a less accurate classification [17,18].This is the so-called curse of dimensionality or the Hughes effect [19], which is an important issue in optimization and machine learning.Its main consequence is the need to greatly increase the amount of training objects necessary to maintain the sampling density in the space of features as the number of dimensions, i.e., the number of explanatory features, increases.This issue is increasing in importance due to the emergence of both OBIA and hyperspectral sensors.However, not all classification algorithms are sensitive to this effect.
If no classification method is to be a priori discarded, it is better to use the lowest possible number of explanatory features.
However, it may be difficult to select the most relevant for classifying the objects.Thus, feature selection has become an important research topic in OBIA, and Lu and Weng [18] have identified the development of approaches to feature selection as one of the critical steps in remote sensing.
Despite its importance, it is not very common for OBIA-related papers to explicitly mention feature selection or the criteria and measurements used for the same.Most papers in which it is mentioned are based on Jeffries-Matusita distance [21][22][23], while others [10,15,24] use the Gini index, an index of the relative importance of features produced by algorithms based on decision trees.Laliberte et al. [25] compare the Jeffries-Matusita distance and the mean decrease in the Gini index (MDG) to select features to use with the wk-NN classification algorithm.MDG is a feature importance measurement provided by tree-based machine learning prediction methods, such as random forest.The use of Jeffries-Matusita distance assumes a normal multivariate distribution of the features in each of the classes.However, this assumption is not always met for features extracted from segmented objects.On the other hand, MDG is a non-parametric statistic that assumes no theoretical probability distribution, which, in principle, is a strength of MDG.
The main objectives of this research are: (1) To study the performance of four feature ranking methods to select the optimum feature subset to classify a very high resolution image using five classification algorithms.(2) To identify a feature ranking method that overcomes the Hughes effect in most of the classification methods.(3) To identify which of the analyzed classification algorithms is least sensitive to the Hughes effect.(4) Finally, to obtain a land cover layer.
To achieve these main objectives, it was necessary to fulfil three secondary objectives: (1) To generate a large number of potentially explanatory variables and to use them to evaluate several feature ranking and selection methods.The aim is to investigate whether these methods can identify those features that introduce redundant information or do not contribute to a statistical discrimination of classes.(2) To test a group of classification algorithms that represent the most important types of machine learning algorithms.(3) To implement a heuristic search process for feature selection.This will allow us to identify the feature subset that can achieve optimal classification.
The paper is organized as follows: In Section 2, we present the study area (Section 2.1) and the dataset, which is the result of applying multiresolution segmentation (Section 2.3) on a high-resolution multispectral image (Section 2.1).Several features obtained from this segmentation (Section 2.4) were used to classify the objects in order to test several classification algorithms (Section 2.5) and feature selection methods (Section 2.6).Different parameter values for the classification algorithms were also tested (Section 2.5).Special care was devoted to a sample design to obtain validation and calibration data (Section 2.2).In Section 3, we present the results, and finally, the conclusions are presented in Section 4. (6) To modify some parameters of the classification algorithms identified in the literature as the most relevant to improve classification accuracy.(7) To analyze validation data using the best combination ranking method-classification algorithm.
Finally, a land cover layer is obtained.

Study Area and Data Used
The study area (Figure 2) is located in the Murcia region (southeast Spain) and corresponds to Irrigation Unit (UDA) 28, as was defined in the Plan Hidrológico de la cuenca del Segura (River Segura Basin Hydrological Plan), the previously in-force water resources planning law passed on 24 July 1998 (R.D.1664/1998).A 150-m buffer zone has been added to ensure the complete inclusion of the relevant plots.The study area is large, 9070 ha, including the buffer, and includes different types of agricultural landscape; the intra-class variability of the Mediterranean countryside is also well represented.This irrigation unit includes traditional orchards and modern, highly technical agriculture areas.As a result, there is a large variety of crops and land covers (40% grass and 60% trees).The traditional irrigation systems are very old.Initially, several springs in the limestone rocks, predominant in the River Argos headwaters, were exploited (more than 80% of the currently used resources).Later, several wells were excavated to complement those resources, and today, groundwater represents 15% of the resources used [32].
Most of the information used in this research was obtained from the Servicio de Integración y Gestión Ambiental (Environmental Integration and Management Service, SIGA), a branch of the Murcia regional government.The data, obtained under the Natmur-08 project [33], consist of a 2-m resolution multispectral (blue, green, red and near-infrared) image and a 0.45-m panchromatic digital image.The images corresponding to the study area were acquired on 9, 10 and 11 July 2008 with an Intergraph Z/I-Imaging Digital Mapping Camera.Both images were fused using Gram-Schmidt's method [26,34,35] to obtain the final 0.45-m resolution multispectral image used in this study.
Additionally, digital terrain (DTM) and digital surface (DSM) models, with a resolution of 4 m, were used as ancillary data.These layers, also produced under the Natmur-08 project, had been interpolated from LiDAR point clouds obtained with a LEICA ALS50-III; we only had access to the interpolated layers and not to the original point clouds.

Classification Scheme, Sample Design and Field Data
As the aim of the work was to produce a map of agricultural land cover types, most of the classes included in the classification scheme have an agronomic sense.There are also two classes that include natural vegetation and artificial objects.The list of classes includes: Cereals (Cer); Rainfed arable lands (Rar); Rural wasteland (Rws); Irrigated grassland (Igr); Almond trees (Alm); Irrigated fruit trees (Ifr), including seedlings; Olive trees (Oli), including seedlings; Greenhouses (Gre); Other non-agricultural vegetation (Ot.Ve); and Other artificial areas (Ot.Ar).
Validation areas were sampled by generating random points; the plots (both agricultural and non-agricultural) where such points were located were selected as validation areas.Training areas were not randomly chosen; instead, those that adequately characterized the different classes were selected.
To obtain a statistically-consistent sampling strategy, the multinomial distribution equations [36] were used to calculate the sample size necessary for this research with a confidence level of 95% and a margin of error of 5%; the result was a total of 558 plots.
Two sampling schemes were used.The first random sampling included 254 validation areas, 200 of which correspond to agricultural uses.The main objective of this first sampling was to evaluate the spectral variability of each class.This information is useful to establish the sampling size of each stratum (land use class) in the second (stratified) sampling.The objective of this stratified sampling was to evaluate the accuracy of the classification by class rather than the accuracy of the whole scene.
One of the critical points of stratified sampling is the distribution of the samples across strata [37].Some authors [36] claim that at least 50 samples for each class are needed in projects similar to ours.
In the present study, a minimum of 50 validation areas was used for each class.The other 58 were distributed proportionally to the spectral variability, calculated as a coefficient of variation, of the 10 classes.The distribution of sampling plots in each stratum is shown in Table 1.
The analysis unit, the unit on which the decision of the success or failure of the classification is made, is the object.Congalton and Green [36] recommend that the analysis unit is the object, even when the methodology is based on pixels.Objects have been used as analysis units in several studies [30,[38][39][40].
All of the relevant objects in a validation area receive the label corresponding to the area.In areas with tree cover, almonds for instance, this label is manually assigned to all tree-objects or intersections of trees, the bare soil objects remaining unlabeled.In plots without trees, classless objects will be minimal.In this way, the labeled objects become the cases to classify.
Fieldwork was conducted in different periods: July 2009, and December 2010, to verify that land use had been correctly identified.Most of the plots were visited about a year after the capture of the images, so that the state of vegetation of image and fieldwork coincided.

Image Segmentation
Multiresolution segmentation [41], the segmentation algorithm that we used with this image, is one of the most widely used in OBIA.It has been included in eCognition since the first versions [42,43], and despite its recent availability, it is rapidly becoming one of the most cited segmentation algorithms [18].The details of the algorithm can be consulted in [41].The key parameter for this segmentation method is the scale parameter, although there is no straightforward method available to obtain an optimum value of the same.The usual approach is to find a compromise value for the whole image by trying several values and evaluating the results [4,[44][45][46].However, this global approach needs a certain degree of uniformity in the image, so its application to large heterogeneous images, as in this study, may be difficult.
For this reason, a local approach for image segmentation, consisting of splitting the image into uniform spatial units, whereby the scale parameter is optimized locally, was assayed.The multiresolution segmentation algorithm was locally optimized with 3 spectral bands (red, green and near-infrared), all with the same importance.The importance of shape homogeneity was 30%, and the weighting coefficient for both compactness and smoothness was 50%.The importance given to each layer, the weights given to the form and the importance given to the compactness and smoothness were not optimized, but remained invariant throughout the whole optimization process.This methodology and the results are fully explained in [27].As a result of the segmentation, a set of 1,076,937 objects was obtained.

Features Obtained from the Objects
Table 2 shows the object features calculated using eCognition software.Features are grouped into six main categories; their names are in bold face, and a short description is added when needed.Some of the features were only calculated with the red band.The reason for this is that, due to the high correlation within the visible spectrum, we did not think that blue and green bands would have added much to the discrimination capacity.Moreover, their inclusion would have strongly increased the computation complexity and would probably have complicated the feature selection process.The number of bands from which the features were calculated is indicated between parentheses.Technical details of every feature are described in DEFINIENS [47].
Texture is one of the most important features when identifying objects from a raster image.Haralick et al. [48] proposed a set of indices extracted from the grey level co-occurrence matrix (GLCM) [49,50].The spatial relationships between neighboring pixels can be measured in each of the four main directions (N-S, NE-SW, E-W, SE-NW) or as a directionally-invariant average.Each textural feature is calculated in each of the four main directions in four layers-red band reflectivity, slope, aspect and convexity-and as a directionally-invariant average in the 10 input layers.Overall, there are 204 textural features.
The context features of objects are extracted by comparing an object feature (spectral, geometric or textural) with the same feature in the neighboring objects.These features can be used to improve the classification results [51].Although the concepts of brighter and darker objects only make sense in features related to spectral properties and not in those related to topographical properties, we have maintained the feature names given by eCognition after the segmentation phase.
In summary, there are 40 spectral features, 5 pixel-based features, 24 geometric features, 204 texture features and 83 context features.This large number of features (356) poses a problem of collinearity that will be solved using the feature selection techniques introduced below.
Besides the features used for classification, other factors that might affect the accuracy of a classification algorithms are the values given to its parameters.To analyze this problem, the parameters of the classification algorithms identified as relevant in the literature were modified.Naive Bayes and linear discriminant analysis algorithms have no parameter to change.
Parameter changes were made sequentially, because complete modification was not feasible from a computational point of view.For example, with SVM, the optimal kernel was first determined using the default values for C and G.The C parameter was optimized using the optimal kernel and the default G. Finally, the G parameter was optimized using the optimal kernel and C value.(10) relative border to brighter neighboring objects

Linear Discriminant Analysis
Linear discriminant analysis is one of the first used, simplest and most used supervised classification algorithms.It tries to maximize the between-group variance and minimize the within-group variance, assuming that the features arise from a multivariate normal distribution with a class-specific mean vector and a common variance-covariance matrix [63].
The distributions of the predictors are first analyzed for each of the classes, and then, the Bayes theorem is used to obtain the probability of each class given the predictor values using the well-known Bayes equation: where k and l are each one of the classes analyzed and P (k) and P (l) are the prior probabilities of such classes.In the case of linear discriminant analysis, f k (x) and f l (x) correspond to the multivariate normal density.When this function is introduced into Equation ( 1), we obtain an equation that, taking logs and rearranging to eliminate constants, gives a linear equation that can be calculated for each class: The method receives its name from the fact that this equation is linear.Each observation X = x is finally assigned to the class k that maximizes Equation (2).In practice, linear discriminant analysis creates a linear frontier (hyperplane) between each pair of classes in the feature space and divides it into regions belonging to each class.The observations are classified according to the region in which they are located.

Naive Bayes
Bayesian networks model the dependence of the dependent variable on each of the predictors as a directed acyclic graph.The final node represents the dependent variable and the other nodes the predictors.The topology of the graph reflects the dependence relations among variables.
Naive Bayes [52] is the simplest case of a Bayesian network.In this case, just one arc goes from each of the predictors to the dependent variable.This means that predictors are assumed to be conditionally independent in every class.Therefore, a simplified version of the Bayesian equation is applied.The function to maximize is: where V is the set of classes and P (x d |v j ) the density function of each feature for an observation arising from the class v j .
Although conditional independence is a rather strong assumption, naive Bayes is a competitive algorithm whose results can even outperform other algorithms [64][65][66].Another advantage is that the independence assumption causes a significant reduction in computing time [66].

Weighted k-Nearest Neighbors
When used for classification, k-nearest neighbors [53] estimates the class for every new observation using the k-closest observations, according to a distance metric, from the training set.Class probabilities for the new observation are estimated as the proportion of training set neighbors in each class.Ties are broken randomly or by including the k + 1 closest neighbor in the calculation.
An important parameter to be taken into account is k, the number of neighbors.A small value leads to a low-bias, high-variance prediction, increasing the probability of over-fitting, while too large a value cause a high-bias classification.
Despite its simplicity, this algorithm has been successful in a large number of classification problems [61].
The algorithm wk-NN is a modification of k-NN, in which each training example is weighted according to its distance from the point being classified.
As regards the wk-NN parameters, both the number of neighboring cases taken into account to predict and the distance measurement can be modified.Euclidean and Manhattan are the available distance measurements, the latter being the default option.Although it is possible to vary the type of kernel, it was not modified, following the advice of the author of the R package that implements the algorithm [53].For modifying the number of neighbors, an arithmetic progression from 1 to 19, with a step of two, was tested.

Random Forest
Decision trees build a classification model by a recursive binary partition of a labeled dataset into increasingly homogeneous nodes.Homogeneity is measured by the Gini index [54], defined as: where P (k) is the proportion of observations in the k-th class.At each step, an optimization is carried out to select, in each node, the feature and the numeric threshold, or group of values if the variable is categorical, that would produce the lowest G value if used to divide the node.This process continues until it is not possible to reduce the Gini index in any node [60].The final result should be a classification tree whose lower nodes are completely homogeneous.However, this is not always the case, and the predominant class is used to label the node, the other cases being classification errors.On the basis of these errors, the tree is pruned to allow a higher generalization capacity.A single classification tree is very sensitive to small modifications in the dataset.Ensemble learning techniques try to overcome this limitation and to obtain a better predictive performance.
Random forest [56] generates a large number of unpruned trees (500-2,000) using a bootstrapped sample of the cases; each node division is carried out with a randomized subset of the predictors to add randomness and to decrease the correlation between trees.Uncorrelation is a desirable property in ensemble learning classifiers because the different results give sense to the voting system that is finally used to estimate the class to which any new case belongs.Random forest can outperform other machine learning classification algorithms (support vector machines or neural networks) and other decision tree algorithms [56,57].
As the classification errors of any tree are diluted into the ensemble, random forest does not over-fit the model to the dataset [6,56,67,68].Since the cases not included in a bootstrapped sample are not used to fit the corresponding tree, they can be used to perform a cross-validation accuracy estimation [60].
The number of trees and the number of predictors used to train each tree are the parameters that can be set by the user.Nevertheless, the method does not seem to be very sensitive to these values, which are, by default, 500 and the square root of the number of available features [57,59].In general, random forests do remarkably well and require very little tuning [61].
A disadvantage of random forest compared with the simple classification tree approach is that, as individual trees cannot be examined separately, it becomes a "black box" approach [68].However, it provides several metrics that help to interpret the model.Variable importance is evaluated by calculating how the accuracy or the Gini index would decrease if the data for that predictor were permuted randomly.The resulting values can be used to compare the relative importance among predictor variables.In this way, the result is easier to interpret than in other algorithms, such as neural networks [68].
The most relevant parameters in random forest are the number of trees (ntree) generated and the number of randomly chosen features used to divide the nodes in each of the individual trees (mtry).The default values in the R package randomForest are ntree = 500 and mtry = int( N f ), where N f is the number of features.Random forest does not seem to be very sensitive to its parameters [57,61]; however, we then tried using ntree = {250, 500, 1000} and doubling and halving mtry following the recommendations of the authors of the package [57].

Support Vector Machines
Support vector machines (SVM) [61,62] are a very flexible classification algorithm that tries to maximize the distance between the hyperplanes that separate classes and the cases closer to those hyperplanes, the so-called support vectors.These large distances, margins in SVM terminology, give SVM a greater generalization capacity, because they maximize the probability of correctly classifying new cases located in the area between two different classes.
The decision function used to estimate the class of each new case is: where x i are the feature vectors of each one of the training cases, y i the class of the training cases, u the feature vector of the new case whose class is to be estimated and α a parameter that only differs from zero for the training cases that are support vectors.In this way, the decision is a function of only the support vectors.SVM is included in the category of kernel methods; the function K(x i , u) is the kernel function.In the simplest case, it is just the dot product of both feature vectors, producing linear hyperplanes.In addition, different non-linear kernel functions can be used to transform the space of features and, in this way, produce non-linear hyperplanes.
When the classes are not completely separable, a cost parameter will penalize those cases situated on the wrong side of the separating hyperplane.The higher the cost, the most complex the hyperplane needed to avoid misclassifications.Therefore, a higher cost parameter will also produce a model with a lower generalization capacity.SVM is conceived of as an algorithm to separate two classes.When there are more than two classes, it is usually applied on a one vs. the others basis, using the distance to the separating hyperplanes as a membership criterion.
In the case of SVM, it is possible to modify the kernel type, the cost parameter C and the width parameter G. Four kernel transformations were tested: linear, polynomial, radial and sigmoidal.For the penalty parameter, the following values were tested: 2.5.6.Software Used R software [69], an open source statistical program and language, was used to run all classification algorithms.The classification algorithms used in this work are included in the packages: randomForest [57], e1071 [70], kknn [71] and MASS [72].

Feature Ranking and Selection Methods
A successful approach in machine learning is to view feature selection as a heuristic procedure in which a subset of possible features is specified at each step of an iterative search [73].Such a procedure involves 3 steps: (1) Ranking all features in accordance with a criterion related to their relevance to classify the dataset.
(2) Iteratively improving a classification model by adding features according to their rank.
(3) Selecting the best feature subset according to a classification accuracy measurement.

Feature Ranking
Four feature ranking criteria were used: average correlation, maximum correlation, Jeffries-Matusita distance and mean decrease in the Gini index.In addition to these four ranking methods, a random ranking was used to compare the results.
Average correlation is a simple approach, in which the relevance of a feature is related to its correlation with other features.Thus, highly-correlated variables will be considered as redundant information, since they only contribute to classification complexity without adding greater discrimination power.The correlation matrix (using the Pearson correlation coefficient in this case) is used to calculate the importance of any feature as its average correlation coefficient.After eliminating the feature with the highest average correlation, the correlation matrix is recalculated, and the procedure continues.The feature eliminated in each iteration is stored, and finally, a vector of features in ascending order of importance is obtained.
Maximum correlation is very similar to the above criterion; the only difference being that the criterion for choosing the feature to be deleted in each cycle is the maximum rather than the average correlation.
The Jeffries-Matusita distance between each pair of classes measures a feature's average capacity to separate classes.Features are ranked according to this distance, whose equation is: where JM ab is the distance between the two classes being compared (a and b) and Bh is the Bhattacharyya distance [51]: where X a is the average of the analyzed feature for class a, and S 2 a the variance of the feature for class a.The Jeffries-Matusita distance values range from zero to two: zero means that the classes cannot be separated using the feature being analyzed, and two means full separability.The average of all inter-class distances is the feature separability.This ranking method assumes the normal distribution of the variables in each of the classes.
The mean decrease in Gini index is a feature importance statistic produced by random forest averaging Gini indices of the individual trees [54].The importance of a variable in a tree is measured as the sum of the decrements in the Gini index attributed to that feature.The global feature importance is the average of its importance for all of the trees.
The mean decrease in the Gini index ranking method is obtained by running as many classification cycles as there are features available.In each cycle, MDG is calculated, and the feature with the lowest MDG value is eliminated from the dataset.Thus, the order in which each feature is removed gives a rank of its importance.

Iterative Classification and Accuracy Criterion
For each classification algorithm and feature ranking method, all of the features were used to train the algorithm; the kappa index [36] was then calculated from the validation sample, and the least important feature was eliminated from the dataset.This procedure was repeated until only one feature was left.A line graph for each classification method was then drawn in a graph whose horizontal axis is the number of features used in each cycle and where the ordinate is the kappa value obtained (Figures 3 and 4).A visual analysis of the curves was sufficient to locate the number of features that maximizes accuracy.
Because wk-NN classification is substantially slower than the others, only 199 classification cycles were run using the 200 most important features.Linear discriminant analysis and naive Bayes only use quantitative features, which means 333 features and, accordingly, 332 classification steps.Random forest and SVM were classified with the complete dataset of 356 features.Because all of these classification cycles were replicated for the five ranking methods (including the random ranking), the total number of classification was 7865.

Feature Selection
Tables 3 and 4 summarize the results of the feature ranking process.Table 3 shows the 40 most important features according to each ranking method, while Table 4 shows the kind of features included in different subsets: the 40 most important features according to each ranking method, that is, a summary of Table 3, and the optimal subsets using MDG in each of the classification algorithms.Table 3. Ranking of the 40 most relevant features according to each ranking method.Features that were calculated with more than one of the original bands are followed by a colon and the band that was used.In textural features, the direction is indicated between parentheses.The features calculated within the objects seem to be the most relevant, especially in the case of the MDG ranking method.Geometry features do not appear among the 40 most important features for MDG; textural features do not appear among the 40 most important according to the Jeffries-Matusita distance ranking method; finally, border features do not appear within the subsets obtained using the correlation methods.Although these results are interesting and should be analyzed in the future, it is not the objective of this study to enter into these details.Table 4. Summary of the feature types in different optimal subsets.Each subset includes three rows: the first shows the number of features of each type; the second row shows the percentage of each feature type with respect to the features included in the subset; and the third row shows the percentage of each feature type with respect to the number of features of that type.MDG, mean decrease in the Gini index.The features selected using the average correlation method do not include any spectral variable, only three features related with skewness and one related with IHS transformation.Most of the features drawn by this method are related to shape, context and texture.
On the other hand, the methods based on Jeffries-Matusita distance and MDG included several pairs of closely-correlated features in the first rankings.This was expected in the case of Jeffries-Matusita distance, because if a feature provides large separability, all of the features correlated with it will show similar separability.

Classification
The kappa statistic is calculated from a set of confusion matrices with 14 classes.These classes include the original classification scheme, but with adult trees separated from seedlings in irrigated fruit trees and olive trees and mature irrigated grassland separated from plants in an early state of development.The final new classification scheme also includes shadows (Shd) cast by objects in the validation areas (usually trees and buildings).
Figures 3 and 4 show the result of the kappa index in several classification cycles using every ranking method and classification algorithm.These figures should be read from right to left, since a backward elimination was performed.The random forest classification method provided the highest accuracy, especially using features ranked by the MDG index.Figure 3 shows no significant changes until the curve reaches the 50th feature, meaning that random forest is quite insensitive to the presence of redundant or noisy features, i.e., to the Hughes effect.This is because random forest does not overfit the model to the calibration data.
Another ranking method that attains high accuracy with random forest is maximum correlation; however, with fewer than 75 features, the results are inferior to those obtained with MDG, especially when fewer than 30 features are used.Using about 50 features, the kappa indices of MDG and maximum correlation are virtually the same.Furthermore, the downward slope of the kappa index between 50 and 10 is greater in the maximum correlation method than in MDG.The average correlation gives very similar values to the previous two methods when more than 140 features are used, while if fewer features are included, there is a substantial decrease in accuracy.As expected, the worst results were obtained when features were ranked randomly.
MDG also reaches a high accuracy value when using the wk-NN classification algorithm.A kappa value of 0.69 is reached with 46 features, which can be considered the most appropriate number of features to be used with wk-NN.This classification method is, however, more sensitive to the presence of unnecessary features.Using the first 200 features, the kappa index is about 0.60, which is significantly different compared with the value obtained with the first 40 features.According to these data, this classification algorithm seems to be more sensitive to the Hughes effect than random forest.If wk-NN is the only algorithm to be used for classification, a major effort in feature selection will be needed.
When using the Jeffries-Matusita distance, a kappa value of only 0.60 is reached in the best case.Moreover, the curve does not show any increase in accuracy when the least important features are eliminated.The maximum correlation results are also rather weak, below 0.55; however, this ranking method seems to be able to remove unnecessary features in some sections of the curve, such as the stretch between 85 and 50 features, in which there is an increase in accuracy.Whatever the case, this method cannot be considered to provide acceptable results.Average correlation also gives very poor results; analyzing the form of the curve, it seems that important features are eliminated, and, as a result, dimensionality reduction provides no increase in accuracy.
When analyzing SVM, the highest kappa index is obtained with MDG and 29 features.The sensitivity to high dimensionality is not very great, although it is greater than in the case of random forest.In the case of MDG, the kappa index remains stable between 360 and 200 features, while from 200 to 125 features, it increases and remains stable until 30 features are reached, when there is a sharp drop.The rest of the ranking methods have much lower kappa values.The curves obtained from maximum correlation and Jeffries-Matusita distance are very similar, while the average correlation curve is quite different.
With naive Bayes, again, MDG provides the highest accuracy, although its highest kappa value is 0.63, well below that reached by random forest or SVM.However, this accuracy peak is reached with only 11 features, creating quite a parsimonious model.It is very sensitive to the Hughes effect: with 337 features, the kappa values are very low, around 0.4.In the case of MDG, the kappa index increases steadily as features are eliminated, and there are several points on the graph where the elimination of only one or two features causes a particularly sharp increase in classification accuracy.The other methods provide very weak results in all cases, with kappa values under 0.5 throughout the curve.
The general performance of linear discriminant analysis is much lower than that of random forest, wk-NN and SVM.This algorithm, too, obtains the best results with MDG; although, in this case, there are fewer differences with the other ranking methods, especially average separability.The number of features in the optimal subset with MDG ( 65) is much higher than in the case of naive Bayes.The curve remains almost stable until the 90th feature, when there is a slight increase in accuracy.It is below the 50th feature that accuracy rates begin to fall very sharply.Table 5. Omission (OE) and commission (CE) errors in each class, kappa (K) and accuracy (F) of each classification method, using, in each case, the optimal feature set whose size is shown in the last column.In all cases, the highest accuracy was reached with the optimal subset drawn by MDG.Shd, shadow.Table 5 summarizes the confusion matrices of the five classification methods using the MDG ranking method (always the one that provides maximum accuracy) and the feature set that produced the greatest accuracy.The table shows omission and commission errors in each class, kappa, accuracy and the number of features required to reach the maximum kappa index.Both random forest and SVM produced the best result, the former being slightly better.SVM needed fewer features than random forest to reach the maximum accuracy; however, Figure 3 shows that random forest results have a large sill, which means that even with substantially fewer features, accuracy is almost the maximum possible.Interestingly, naive Bayes reaches its maximum accuracy with just 11 features, although its kappa and accuracy are much smaller than those corresponding to random forest and SVM.
As regards parameter optimization, except for the number of neighbors in the wk-NN algorithm, modifying the default parameters brings about only small improvements in accuracy.In the case of wk-NN, the use of the Euclidean distance rather than Manhattan distance produced a substantial decrease in accuracy.Accuracy increases as does the number of neighbors, reaching a peak with 19.More neighbors were not tested, because the increase from k = 17 to k = 19 was very modest, and using k = 19 and Manhattan distance led to a kappa value of 0.69.
As an example, Figure 5 shows a subset of the results obtained by each classification method, using in each case the proposed optimal feature set.

Discussion
Laliberte et al. [25] compare the Jeffries-Matusita distance and MDG to select features for use with the wk-NN classification algorithm.Although both ranking methods, along with the other two, were used in our research, the results are not completely comparable, because in the above-mentioned work, the number of features to be selected was fixed a priori, whereas in our study, the optimal number of variables to be used in each algorithm was one of the outcomes of the ranking and selection process.While the results of the cited work suggest that the results obtained with the Gini index are very similar to those obtained with the Jeffries-Matusita distance, the results obtained in this study lead to a different conclusion: that better results are obtained using the MDG ranking method whichever classification algorithms is tested.
Pal [6], Duro et al. [10], Löw et al. [15] and Ghosh and Joshi [16] obtained similar results to us.In our study, SVM needed fewer features than random forest to reach the maximum accuracy, as was observed by Ghosh and Joshi [16].In our work, as occurred in Löw et al. [15], using SVM to classify with the whole feature set produces a higher error than when using the optimal subset in accordance with the MDG ranking method.The classification results of Pal [6] are very similar to those obtained in the present work.By contrast, Ghosh and Joshi [16] found that SVM performed better than random forest, with a difference in favor of the first of 0.05 in kappa.Since the difference was slight, bearing in mind that both studies are quite comparable, we can conclude that, in terms of classification accuracy, both algorithms provide very similar results.Interestingly, naive Bayes reaches its maximum accuracy with just 11 features, although its kappa and accuracy are much smaller than those corresponding to random forest and SVM.
As regards parameter optimization, except for the number of neighbors in the wk-NN algorithm, modifying the default parameters brings about only small improvements in accuracy.In the case of wk-NN, the use of Euclidean distance rather than Manhattan distance led to a substantial decrease in accuracy.More neighbors were not tested because the increase from k = 17 to k = 19 was very modest and using k = 19 and Manhattan distance gave a kappa value of 0.69.
We agree with Pal [6] that the need for only two parameters to be set and the lack of sensitivity to these parameters are clear advantages of using random forest.For example, no important effect was observed when the number of trees was doubled or halved compared with the default value.Similar results were obtained when modifying the mtry parameter.In this respect, our empirical results also coincide with those given by Duro et al. [10].
Another advantage of random forest is that, along with linear discriminant analysis, it is insensitive to the Hughes effect.In the case of random forest, similar results were found by Duro et al. [10].However, other authors, like Guan et al. [11], obtained different results.In this work, decreasing from 48 to 10 features led to a significant improvement in classification accuracy (from kappa = 0.6 to kappa = 0.8).We think that the cause of these differences is two-fold: first, the study areas in [11] are very small; second, the ntree parameter in [11] was 100, quite lower that the recommended value that we used in our study (500).SVM has low sensitivity to the Hughes effect, while wk-NN and, especially, naive Bayes are very sensitive.

Conclusions
Regarding the classification algorithms, random forest and SVM have provided the highest classification accuracy, followed by wk-NN.On the other hand, the results of naive Bayes and linear discriminant analysis are less accurate, with kappa indices around 10 percentage points lower than those obtained with random forest or SVM.
It was to be expected that the MDG feature ranking method would obtain a good result with random forest, because both methods are fairly closely related.However, MDG obtained the highest accuracy with all classification algorithms.This consistency strongly suggests that MDG can be considered as one of the best options for ranking features.
Another advantage of random forest is that, along with linear discriminant analysis, it is insensitive to the Hughes effect.SVM has low sensitivity, while wk-NN and, especially, naive Bayes are very sensitive.
Random forest and SVM obtain the highest accuracy with the default parameters, which is an advantage over other classification methods, such as wk-NN, which need to be calibrated.In fact, the accuracy obtained with wk-NN increased with the number of neighbor cases used to classify.However, this increase in accuracy was still small.
According to Laliberte et al. [25], because our comparison of feature selection methods was based on the same segmentation, it was reasonable to assume that the classification accuracies could be attributed to the feature selection methods and not to the prior segmentation step.
In summary, random forest with features selected using MDG was the most suitable algorithm for classifying the analyzed image.If only one classification method is used, it should be random forest, because it provides the greatest accuracy values and does not really need feature selection, unless the number of features available is too large for the computing power available.
The results also allowed an improvement in land use classification, which was the main objective of analyzing this image.
Object-oriented analysis represents an advantage over the pixel-based approach because of the substantial reduction in the number of cases.Segmentation can be regarded as a necessary step for remote sensing imagery classification when using cutting-edge machine learning techniques with large computing requirements.We agree with Ghosh and Joshi [16] when they say that "The extensive use of open source R statistical software allows free and easy access of the framework under study to researchers and users across the world."

Figure 3 .
Figure 3. Kappa indices obtained with three classification algorithms: random forest, weighted k (wk)-NN and SVM and the five ranking methods.

Figure 4 .
Figure 4. Kappa indices obtained with two classification algorithms: naive Bayes and linear discriminant analysis and the five ranking methods.

Figure 5 .
Figure 5. Example of the land cover maps obtained with the five classification methods using the optimal feature set obtained with the method based on the Gini index.(a) RF-Gini i., 78 features; (b) wk-NN-Gini i., 46 features; (c) SVM-Gini i., 33 features; (d) nB-Gini i., 11 features; (e) LDA-Gini i., 65 features; and (f) Z/I-Imaging DMC image with 45-cm spatial resolution.
Table 1 also summarizes the distribution of the training and validation areas in the study area.

Table 2 .
[47]ary of the calculated object features[47].Some of the features are calculated from all of the original bands and others from just one.Textural features are calculated for several directions.The total number of features appears in parentheses.DTM, digital terrain model; DSM, digital surface model.