Natural Forest Mapping in the Andes ( Peru ) : A Comparison of the Performance of Machine-Learning Algorithms

The Andes mountain forests are sparse relict populations of tree species that grow in association with local native shrubland species. The identification of forest conditions for conservation in areas such as these is based on remote sensing techniques and classification methods. However, the classification of Andes mountain forests is difficult because of noise in the reflectance data within land cover classes. The noise is the result of variations in terrain illumination resulting from complex topography and the mixture of different land cover types occurring at the sub-pixel level. Considering these issues, the selection of an optimum classification method to obtain accurate results is very important to support conservation activities. We carried out comparative non-parametric statistical analyses on the performance of several classifiers produced by three supervised machine-learning algorithms: Random Forest (RF), Support Vector Machine (SVM), and k-Nearest Neighbor (kNN). The SVM and RF methods were not significantly different in their ability to separate Andes mountain forest and shrubland land cover classes, and their best classifiers showed a significantly better classification accuracy (AUC values 0.81 and 0.79 respectively) than the one produced by the kNN method (AUC value 0.75) because the latter was more sensitive to noisy training data.


Introduction
The Andean region's complex topography and altitudinal range comprise an environmental gradient that contains a variety of ecosystems, vegetation communities, and forest formations.Forests of Escallonia, Myrcianthes, and Polylepis are located in the inter-Andean valleys and the High Andes (1800-4800 m a.s.l.) and form an important ecosystem that is the habitat for endemic fauna and flora at high altitudes [1].These forests have degraded and fragmented tree populations that are currently considered vulnerable due to anthropogenic pressures (e.g., fuelwood exploitation, overgrazing, and fire) [2,3].In addition, these forests have a limited distribution and are exposed to an arid climate.All these factors contribute to the significant ecological and biogeographic importance of Andes mountain forests [4] and, therefore, suitable forest management and conservation practices are absolutely required in these areas.
Assessment of current forest conditions is important for forest conservation, and remote sensing techniques are widely used for land cover mapping.Because the results of this type of mapping vary depending on the classification technique used to create the maps, the selection of appropriate techniques is critical to obtain reliable results.
Andes mountain forests have a low percentage crown cover as compared to Amazon tropical forest.They grow in a semi-arid environment in association with local shrub species, which are also dominant in wide areas.When working with mid-resolution (pixel size of 2 m to 30 m) satellite images of this forest type, most of the pixels actually contain more than a single land cover class, such as soil and shrubs.Consequently, the data obtained by satellite sensors are a mixture of the reflected radiance of different land cover types.Studies have shown that this issue is common when mapping vegetation in semi-arid regions [5,6].One approach to overcome the problem is to obtain high-resolution satellite data such as GeoEye-1 (50 cm), WorldView (50 cm), QuickBird (60 cm), or IKONOS (1 m) scenes.Because of the relatively high cost and computational capacity needed to use such an approach, the satellite dataset chosen by most developing countries for forest monitoring and conservation planning is the Landsat 8 OLI dataset with a pixel resolution of 30 m, which provides periodic global coverage and is provided without cost by the U.S. Geological Survey.
It is very difficult to find fully homogeneous land cover pixels (also termed endmembers) in the field, especially for rare and degraded vegetation types growing in areas that are hard to access.In some cases, spectral mixture analysis has been applied to calculate land cover percentages on a sub-pixel level [7,8].In addition, a rugged topography reduces the accuracy of land cover classification in complex terrain because it produces variations in surface illumination between shaded areas and those receiving direct sunlight.As a result, the reflectance values of land cover vary greatly within classes.Topographic correction analysis can be performed to reduce this effect; nevertheless, it cannot eliminate the effect completely [9].Thus, because terrain complexity and the co-existence of trees with shrub species and soil at the sub-pixel level introduce noise to the training data, the accurate classification of Andes mountain forest remains difficult.
One approach to overcome these issues is to use advanced classification methods based on learning algorithms that have adjustable parameters and can process high-dimensional data to avoid overfitting.Non-parametric classifiers, such as machine-learning algorithms (MLAs), have been used to map vegetation growing in mountains because they have good potential for accurately classifying natural land cover types [10,11].There is no need to assume that the data are normally distributed with MLAs; hence, it is possible to include non-spectral ancillary data in the classification process [12] and produce better classification results in complex landscapes.In addition, this method is robust when analyzing noisy training data.
The most commonly used MLAs are the Random Forest (RF), Support Vector Machine (SVM), and k-Nearest Neighbor (kNN) algorithms.Many mapping studies have been conducted in Amazon tropical forests using advanced classification algorithms with a corresponding performance comparison analysis; for example, Paneque-Gálvez et al. compared parametric (maximum-likelihood), non-parametric (SVM and kNN), and the Max-Min Hill-Climbing algorithms [13].RF was used to determine the area of closed canopy tropical forests for forest carbon estimation [14].In addition, there are forest cover studies for tropical Andes regions that used classification methods such as supervised decision trees [15], logistic regression [16], maximum-likelihood, and spectral unmixing [17].However, there are few studies on the application of more recently developed and advanced classification methods for classifying and mapping Andes mountain forests.
In this study, we carried out a comparative analysis of the performance of the RF, SVM, and kNN methods with different combinations of spectral data from Landsat 8 and topography-derived variables selected by correlation analysis to find the best classification model for mapping Andes mountain forests.

Study Area
The Andes is a continuous mountain range with elevations between 2000 and 6000 m a.s.l.(average = 4000 m a.s.l.) [18] and a steeply sloped topography (60 • on average).In these geographical conditions, changes in climatic conditions such as temperature occur over relatively short horizontal distances.Moreover, the mountains act as a barrier to the mass of water vapor entering from the Atlantic side toward the Pacific side of South America.Thus, the east-facing slopes have high levels of humidity and precipitation, whereas the west-facing slopes are semi-arid.These differences in altitude, temperature, and humidity lead to distinct habitats and vegetation communities.On the semi-arid western side of the Andes, forests are present in high and mid altitude areas [19]; we refer to these as Andes mountain forests.
The Andes mountain forests are relict populations of Polylepis spp., Escallonia spp., and Myrcianthes spp., which commonly grow in association with tall (up to 2 m) local native shrubland species.Most of these forests currently exist in remote and inaccessible areas and in protected areas managed by government agencies to prevent their degradation and deforestation.Degradation and deforestation have taken place since the times of the Inca Empire [20].By the end of the 15th century, the population in the Andes was 6-12 million, and the Andes mountain forests were exploited for various purposes.During the Spanish colonial period in the 16th century, residents exploited forest with greater intensity, as the demand for timber and fuel increased for mining and colonial construction.
The region now consists of a highly fragmented mosaic of different land cover types such as shrubland, natural pasture, agricultural land, and man-made forest plantations of Eucalyptus globulus and Pinus spp. to meet the local communities' demand for fuelwood, aside from the endemic tree species that originally made up the forests of the Andes mountains.
We selected a study site that showed a typical land cover mosaic using a Landsat 8 image.The scene (path 4, row 69) is located in Cuzco, Peru, with an area of 170 km × 185 km (Figure 1).

Study Area
The Andes is a continuous mountain range with elevations between 2000 and 6000 m a.s.l.(average = 4000 m a.s.l.) [18] and a steeply sloped topography (60° on average).In these geographical conditions, changes in climatic conditions such as temperature occur over relatively short horizontal distances.Moreover, the mountains act as a barrier to the mass of water vapor entering from the Atlantic side toward the Pacific side of South America.Thus, the east-facing slopes have high levels of humidity and precipitation, whereas the west-facing slopes are semi-arid.These differences in altitude, temperature, and humidity lead to distinct habitats and vegetation communities.On the semi-arid western side of the Andes, forests are present in high and mid altitude areas [19]; we refer to these as Andes mountain forests.
The Andes mountain forests are relict populations of Polylepis spp., Escallonia spp., and Myrcianthes spp., which commonly grow in association with tall (up to 2 m) local native shrubland species.Most of these forests currently exist in remote and inaccessible areas and in protected areas managed by government agencies to prevent their degradation and deforestation.Degradation and deforestation have taken place since the times of the Inca Empire [20].By the end of the 15th century, the population in the Andes was 6-12 million, and the Andes mountain forests were exploited for various purposes.During the Spanish colonial period in the 16th century, residents exploited forest with greater intensity, as the demand for timber and fuel increased for mining and colonial construction.
The region now consists of a highly fragmented mosaic of different land cover types such as shrubland, natural pasture, agricultural land, and man-made forest plantations of Eucalyptus globulus and Pinus spp. to meet the local communities' demand for fuelwood, aside from the endemic tree species that originally made up the forests of the Andes mountains.
We selected a study site that showed a typical land cover mosaic using a Landsat 8 image.The scene (path 4, row 69) is located in Cuzco, Peru, with an area of 170 km × 185 km (Figure 1).

Land Cover Types and Forest Definition
The land cover classification used in this study is based on the class list that the Ministry of Agriculture (MINAGRI) of Peru developed for its National Forest Inventory.These land cover types include inland surface water bodies, bare land, infrastructure/towns, agricultural land, natural pasture, wetland, shrubland, forest plantations, Andes mountain forest, and Highland Amazon forest.

Land Cover Types and Forest Definition
The land cover classification used in this study is based on the class list that the Ministry of Agriculture (MINAGRI) of Peru developed for its National Forest Inventory.These land cover types include inland surface water bodies, bare land, infrastructure/towns, agricultural land, natural pasture, wetland, shrubland, forest plantations, Andes mountain forest, and Highland Amazon forest.
The Andes mountain forest in this study was defined by MINAGRI as land with a tree crown cover of more than 10%, an area of more than 2 ha, and a minimum tree height of 2 m at maturity.

Landsat 8 OLI Satellite Data and Digital Elevation Model (DEM) Data
We selected four scenes with the minimum cloud coverage rate that were taken in the dry season (between April and September; Table 1).The data have a resolution of 30 m per pixel, and the spectral data used included three bands of the visible spectrum (Bands 2, 3 and 4), the near infrared band (Band 5), and two short wavelength infrared bands (Bands 6 and 7).These scenes were atmospherically and topographically corrected using the ATCOR 3 module of the ERDAS IMAGINE remote sensing application and free DEM data provided by the Space Shuttle Radar Topography Mission.The algorithm used by ATCOR 3 adjusts reflectance values based on the values of sun elevation, azimuth, and parameters related to atmospheric characteristics.Due to the complex topography of the study area, the ATCOR 3 algorithm produced small patches of overcorrected reflectance.To create good correction models in the normalization process, we manually masked such areas to use only coherent reflectance values from the scenes.

Collection of Land Cover Data
Field surveys of land cover and forest condition were conducted in August, October, and November 2015, and June, July and December 2016.Information collected included longitude and latitude of the location, vegetation type, dominant tree species, forest condition, and forest cover rate.Photographs were also taken of the survey sites.Each chosen survey site corresponds to a land cover patch of approximately 2 hectares.
As an indirect method of land cover data collection, we used Collect Earth (CE), a free open source software developed by the Food and Agriculture Organization.CE allows the collection of land cover information by visual interpretation of satellite data from locations determined by a regular sampling grid produced with GIS software.These plots are superimposed over free high-resolution satellite imagery available from Google Maps, and experts with experience in the field and in photo interpretation of vegetation visually identified the land cover within the plots.We used a sampling grid design, with a separation distance of 4 km.Instead of the rectangular plots used by CE, we used the polygons produced by the object segmentation analysis in Section 2.3.1 located at each node of the grid and with an average size of 2 hectares.We collected land cover data from 1368 locations with the guidance of local forestry experts from the National Forest and Wildlife Service of Peru in conjunction with the CE data (Table 2).2.3.Methods

Preprocessing
Figure 2 shows the flow of analysis.One problem with land cover mapping using remote sensing in mountainous terrain is the presence of sunlit and shadowed slopes produced by the steep topography.Topographic correction is necessary to reduce this effect.
Remote Sens. 2018, 10, x FOR PEER REVIEW 5 of 20 locations with the guidance of local forestry experts from the National Forest and Wildlife Service of Peru in conjunction with the CE data (Table 2).

Preprocessing
Figure 2 shows the flow of analysis.One problem with land cover mapping using remote sensing in mountainous terrain is the presence of sunlit and shadowed slopes produced by the steep topography.Topographic correction is necessary to reduce this effect.After conducting atmospheric and topographic correction for the scenes (P2) and masking of clouds, cloud shadows, and pixels with overcorrected reflectance (P3 and P4), we normalized the satellite data.We conducted this process on pairs of overlapping scenes using the Iteratively Reweighted Multivariate Alteration Detection (IR-MAD) algorithm (P5) [21,22] to find pseudoinvariant features that can compute a modified model [23].The size of sample areas was 10 km × 10 km, and the sample areas included deep water bodies, concrete, and/or bare soil, aside from forest areas.This method is based on canonical correlation analysis (CCA) and is an unsupervised change detection algorithm that is invariant to linear transformations of the original data.In this method, After conducting atmospheric and topographic correction for the scenes (P2) and masking of clouds, cloud shadows, and pixels with overcorrected reflectance (P3 and P4), we normalized the satellite data.
We conducted this process on pairs of overlapping scenes using the Iteratively Reweighted Multivariate Alteration Detection (IR-MAD) algorithm (P5) [21,22] to find pseudo-invariant features that can compute a modified model [23].The size of sample areas was 10 km × 10 km, and the sample areas included deep water bodies, concrete, and/or bare soil, aside from forest areas.This method is based on canonical correlation analysis (CCA) and is an unsupervised change detection algorithm that is invariant to linear transformations of the original data.In this method, CCA creates linear combinations of the pixel values for each of the spectral bands of the two scenes.Each pair of linear combinations are called canonical variates (CVs), and the number of CVs is equal to the number of bands.The first pair of CVs has maximal correlation, the second pair of CVs has the second highest correlation and is orthogonal to the first pair of CVs, and so on.Then, the process to determine the difference between the CVs is carried out to produce a sequence of transformed difference images called MAD variates that records the maximum spread (or maximum change) of the pixel values.The sequence generates the same number of MAD variates as the number of bands used.From these different images, we select all pixels with minimum or no change (called Pseudo-Invariant Features, PIFs) that satisfy Formula (1): where N is the number of MAD variates, σ is the variance of the no-change distribution, and t is a decision threshold value.In the absence of change, the sum of the squares of standardized MAD variates is approximately chi-squared distributed with N degrees of freedom.The value of t is defined as: where P is the probability of observing a value lower than t.The process uses P as weight for the observations, and the whole process is iterated until a stopping criterion is met.In this study, we used a fixed number of iterations equal to 50.With this process, we selected PIFs to carry out linear regressions to produce linear equations for each band to normalize the reflectance values of the scenes.Finally, we created cloud-free normalized mosaics by overlapping the normalized scenes.
In contrast to pixel-based and unsupervised classification techniques, object-based image classification creates land cover maps that are easier to compare with reality.The object-based classification approach implies the creation of "objects" or "groups of pixels."For this purpose, we carried out a series of a segmentation analyses using eCognition [24].This process uses a region-merging algorithm starting with randomly selected pixel seeds that are distributed regularly [25].The pixels are then grouped with other pixels based on the homogeneity criteria to form polygons called objects.By using a series of tests with different scale values and evaluating changes in the average size of the created objects, we determined an optimum segmentation scale value to produce objects with an average area equal to the minimum defined forest area (2 ha).Then, we segmented the cloud-free normalized mosaic with the optimum scale value (P6) to produce the objects we used in the classification analysis.

Training Datasets and Verification Datasets
Table 2 shows the land cover data collected by the field survey and through visual interpretation using the CE tool at the study site.Because of the highly fragmented land use (landscape) in the area, it was difficult to collect enough data on Andes mountain forest using CE to build a training dataset.Therefore, we redistributed the available land cover data records of the Andes mountain forest land to complete the training dataset to a percentage of at least 70% of the total data.As a result, we randomly redistributed the 45 combined land cover data records for Andes mountain forest into 33 data records (73%) for the training dataset and 12 (27%) data records for the verification dataset.

Classification Variables and Variable Selection
The values of the variables used in this study were obtained by an object-based analysis, which means that we used the values of the pixels within each object to calculate the value of the corresponding variable for each object and used the average value in our analyses.The raster datasets used were the cloud-free Landsat 8 OLI mosaic and STRM 30-m DEM data described in Table 1.Using the longitude and latitude of the land cover data records in the training and verification datasets, we determined the objects corresponding to each land cover point and then extracted the calculated value of the variables from those objects using the statistical software R version 3.4 (http://www.cran.R-project.org).
The average reflectance values of the visible bands (blue, green, and red), the near infrared (NIR) band, and two shortwave infrared bands were derived from the spectral data.Additionally, we used the three first principal components produced by principal component analysis and tasseled cap transformation, both of which reduce data dimensionality, taking into account the variability of the data as much as possible.We also used the Normalized Difference Vegetation Index (NDVI) and the Modified Soil-adjusted Vegetation Index (MSAVI).MSAVI was calculated as: where ρ NIR and ρ red are the reflectance values in the NIR and red bands, respectively.This index has been used in vegetation studies in arid and semi-arid regions [26][27][28] because it reduces the soil background effect.We chose to use it among other soil-adjusted vegetation indexes because it can be used without any preliminary knowledge of the vegetation cover rate [29].
We included elevation and aspect (the downward direction of the slope in degrees) calculated from the DEM data as topography-derived variables in the classification analysis, considering the ecological relationships between vegetation species, altitude, and slope orientation.
One of the characteristics of SVM and RF algorithms is that they do not require feature selection [30].However, the performance of classifiers produced by kNN algorithm is affected by the presence of irrelevant or redundant features in the training data [31,32].Since we wanted to determine which among these three machine learning algorithms can produce the most accurate classifier for Andes mountain forest classification, we chose to make such a comparison using the same models produced after feature selection and using the same subsets of data produced with cross validation.Additionally, in order to take into account the full capacity of the SVM and RF algorithms to produce highly accurate classifiers, we introduced and tested classifiers produced with all the available variables.
The variables were subjected to a selection process involving two analyses.In the first analysis, we determined the relative importance of the variables by using the Akaike weight [33], which is based on the Akaike information criterion (AIC), taking into account that our classification analysis had a binary outcome: Andes mountain forest or shrubland.To do this, we built models using all possible combination of the variables and calculated their corresponding AIC values.Then, we calculated the AIC difference (∆ i ) between each model (AIC i ) and the model with minimum AIC (AIC min ): If R is the total number of models, then the Akaike weight (w i ) can be calculated for each model as follows: The relative importance of each variable was estimated by summing the Akaike weights of all the models in which the variable occurs.In the second analysis, we calculated the Pearson's correlation coefficient between each pair of variables.We chose the variables with the higher values of relative importance and discarded the correlated variables with the lower importance value.

Classification Methods
RF is a powerful MLA that is widely used to classify imagery data for land cover classification using multispectral satellite sensor imagery.The method performs well when the number of predictors is greater than the number of observations and has low sensitivity when the number of irrelevant predictors is large.SVM is another MLA used for classification to determine a hyperplane (or boundary in a high dimensional space) that can divide training data into a predetermined number of categories.This method is used in many remote sensing studies because of its capacity to process small training datasets [34].The kNN method is simple to implement and has a low training computational cost.This non-parametric method uses the k closest training data vectors to make predictions and has been used in forest inventory practices and as a tool for forest classification and mapping [35][36][37].
We carried out a comparative statistical analysis of the three MLAs using non-parametric statistical tests to compare the performances of all the considered models.

Random Forest
The RF method [38] is an ensemble of classification trees in which each tree contributes a unit vote to determine the most frequent class according to the input data (Equation ( 6)): where C m r f (x) is the predicted class from the RF classification of data record x, and C m i (x) is the predicted class from the classification tree m i of the data record x.Each classification tree is constructed using a bootstrap sample of 63.2% of the training data, while the rest of the data is considered out-of-bag (OOB) data.When forming a split point (node) in a tree, the algorithm randomly selects a sub-set of variables and searches among these variables for the best split point to classify the data.The number of variables in each sub-set is commonly denoted as m try .The performance of this algorithm depends on the availability of a sufficient number of trees (n trees ) to be generated to converge on the value of the OOB error and the number of variables randomly sampled as candidate variables in each node of the classification trees (m try ).
We used 4000 random decision trees.The OOB error is the average of the misclassification(s) rates computed from each sample of the OOB data when classified by all the trees constructed without such samples.We adjusted the value of m try by carrying out a series of classification tests with different values of this parameter and calculated the corresponding Cohen's kappa value.Then, we selected the value of m try that had the maximum Cohen's kappa value.

Support Vector Machine
SVM [39] is a non-parametric supervised statistical learning classifier that finds a hyperplane for optimal classification by minimizing the upper bound of the classification error.To use this method, we standardized the values of the variables in the training data.The method maximizes the distance from the data points of two classes (in the case of binary classification) to an optimal separation vector of a hyperplane created from the variables [40].The hyperplane is the surface used to determine the classification.
Given a training set (x i , y i ), where y i is the class label that takes the value of -1 or 1 and x i is the training vector of the values of the corresponding explanatory variables, a solution to the following optimization problem is needed: subject to where phi is a projection function of the training vector given by the kernel model used by the analyst; w and b are the adjustable weight and bias parameters, respectively; C is the penalty parameter of the error term ξ; l is the number of samples in the training dataset; and T denotes the transpose operator.
The parameters w and b in Equation ( 8) define the decision hyper-plane that separates the classes, and the minimization of Equation ( 7) aims to maximize the separation margin of the data.The projection function is related to a kernel function K by the following expression: We used the radial basis function kernel that depends on the value of the parameter γ in the following expression: where x i and x i' are two different training vectors of the values of the explanatory variables; x ij and x i'j are the values of the jth explanatory variable in the ith and i'th training vectors; and p is the total number of variables.The parameter γ defines the extent to which the impact of a single training example extends to determine the decision surface.On the other hand, C trades off misclassification of training data against the number of dimensions that the decision surface should have.These two parameters were adjusted using a grid search analysis; that is, the best decision hyper-plane of the largest Cohen's kappa value was calculated using different values of C and γ in a series of sequential tests.

k-Nearest Neighbors
kNN is a well-known nonparametric classification method that assigns a sample vector x to the class represented by the majority of k nearest neighbors whose similarity is determined by the distance measure.As with the SVM algorithm, the values of the variables in the training data need to be standardized to use this method.We used the Minkowski metric to define distance: where x i is the predictor vector of length p of observation i to be classified, and x j is the jth nearest neighbor.Euclidean distances can be determined by setting the value of q = 2.When k r is the number of nearest neighbors to the observation x i that belongs to class r, then: The algorithm assigns observation x i to the class r for which k r is the largest.We restricted the value of k to odd values to avoid the possibility of a tie between the numbers of neighboring training data samples of two different classes.Because the performance of this method depends on the value of k, we adjusted its value by calculating the Cohen's kappa value of a series of classification analyses using different values of the parameter and by choosing the value that had the maximum Cohen kappa value.

Tuning of Parameters and Performance Assessment of the Classification Models
We carried out a stratified 10-fold cross-validation using data corresponding to the classes "Andes mountain forests" and "shrubland," dividing the data into 10 sub-datasets of approximately the same size and maintaining the ratio of data of each class.Each model was trained using nine folds of the data and validated with the remaining fold.
In order to avoid the overfitting problem, we followed a simple recommended approach to compare classifiers using cross-validation [41] by carrying out the classifier parameters' tuning task using only the training data.The procedure is as follows: i We created a training set T = A − k for each of the k subsets of the dataset A ii We divided the training set T into subsets t1 and t2; these subsets were used for training and tuning respectively.The subset of variables or features used to fine tune the classifiers are the same set of variables selected for each model, as shown in Section 3.1.iii When the parameters of the classifier were tuned for maximum accuracy, we re-ran each of the models with the initial larger training set T. We chose values of the tuning parameters that maximizes the average of Sensitivity and Specificity metrics.This criterion is recommended for conservation studies where omission error is undesirable [42].iv The classification precision indicators were calculated using the fold k as the validation data.v Mean and standard deviation of the precision indicators were calculated for comparison analysis.
As performance indicators, we used Cohen's kappa value and the area under the curve (AUC) from the receiver operating characteristic (ROC) theory [43,44].Cohen's kappa values were calculated using a confusion matrix constructed with the results obtained by classifying the verification data.However, because of the imbalance in the proportion of data between the two land cover types, the result was biased toward the larger one of the two.Thus, analyzing the performance of the models with a probability threshold of 0.5 will under-predict the occurrence of the rarest class [45,46].To avoid this effect, we used the ROC curve to determine the corresponding optimum threshold of each classification analysis by selecting the threshold value that maximizes the average of Sensitivity and Specificity.Using the new calculated threshold value for each classification result in the cross-validation procedure, we constructed the corresponding confusion matrices from which Cohen's kappa values were calculated.The AUC value is threshold independent, which means it gives a value of overall accuracy based on many different probability thresholds.The value of AUC varies from 0.5 to 1.0 (a perfect fit), and we calculated it with the R software package pROC.
We ranked the performances of all of the models obtained with each fold and calculated the mean value of the ranks obtained per model [47]; a higher rank (1 being the highest rank) would indicate a higher performance indicator value.The mean ranks of the performance indicators were compared as well as the corresponding mean and standard deviation.Also, Friedman tests were carried out to determine if the various models yielded statistically different AUC and Cohen's kappa values [48].The test was conducted using the mean rank of the performances of all of the models.If the p-value of the test was significant (p < 0.05), we could reject the null hypothesis that the difference in classification performances between the models is zero.Then, using the Nemenyi post-hoc test, we carried out a pairwise multiple comparison of ranks and determined which models show a significant performance difference compared to the others by using the critical distance (CD) [49].This test is conservative and robust when analyzing a small amount of unbalanced data.The CD value was calculated as follows: where N is the number of folds, k is the number of models to be compared, α is the confidence level, and q α is the critical value based on the Studentized range statistic that depends on the significance level α and k.Finally, we attempted to determine if, given the same set of variables, a particular MLA would produce a better classifier for "Andes mountain forest" vs. "shrubs" classification.For this, we conducted Nemenyi post-hoc statistical tests to test to determine if there is a significant difference between the performance indicator values of a particular model produced by RF, SVM, and kNN.

Selection of Variables and Constructed Models
Table 3 shows the total Akaike weights used to rank the variables in order of importance to differentiate Andes mountain forest and shrubland.The most important variables are the mean values of elevation, MSAVI, and the reflectance in the NIR band (B5 in Table 1).Table 4 shows the matrix of Pearson correlations.With the exception of the mean reflectance of Band 5, all of the mean reflectance values of the Landsat 8 OLI bands are correlated with each other.Elevation has a strong negative correlation with NDVI, which reflects a decrease in green vegetation mass along the altitude gradient observed during the field surveys.Although NDVI is correlated with all variables except aspect, MSAVI was only correlated with NDVI.Elevation also showed a high correlation with the mean reflectance of short wavelength infrared 2 (B7).Since Elevation showed a higher Total AIC value, Elevation was preferred over B7.Using Tables 3 and 4, we chose uncorrelated variables of great importance to build models that contained one vegetation index.We also included the same models without the topographic variables to test whether the model accuracy increased when these variables were included.Following these criteria, we built models with the following features: We also compared models produced by the SVM and RF algorithms using all the variables available: mean reflectance from the bands 2 to 7, NDVI, MSAVI, and mean values of elevation and topographic aspect.These models were denoted as SVM:All and RF:All.

AUC and Cohen's Kappa Results
Tables 5 and 6 show the mean rank calculated for all the models using AUC and Cohen's kappa values, respectively, obtained by the 10-fold cross-validation procedure.We can observe that the model with the highest AUC and Cohen's Kappa mean values was produced by the SVM algorithm: SVM:All.In contrast, SVM:N and SVM:NA had the lowest mean values of the performance indicators.This result suggests that the performance of a classifier produced by the SVM algorithm is highly related to the dimensionality of the training dataset.

Non-Parametric Tests for Multiple Comparisons
The Friedman test results showed a significant difference in the AUC (X 2 = 62.29, df = 19, p = 1.7 × 10 −6 ) and Cohen's kappa (X 2 = 47.46,df = 19, p = 3.1 × 10 −4 ) values of all the models.This indicates that at least one model has different performance values (i.e., it is safe to reject the null hypothesis that all the classification models perform the same).Therefore, we could proceed with the Nemenyi post-hoc test.
We carried out the comparative analysis of all of the models using the mean rank calculated using AUC and Cohen's kappa values.Using the calculated value of CD = 9.3760, we connected the groups of models that have statistically similar values of performance indicators (Figures 3 and 4).For the AUC test (Figure 3), only the distance between the rank of the worst model (SVM:NA) and the three best models (SVM:All, SVM:PC, and RF:TC) were greater than the CD value.This indicates that these three models performed significantly better than the rest, and that the models SVM:NA, SVM:N, and kNN:N are, statistically, the worst performing models.We also noticed that the model produced by the kNN method with the highest mean value of performance indicators (kNN:M56, AUC = 0.75, Cohen's Kappa = 0.29) and the worst performing models belong to the same group determined by CD.On the other hand, the models produced by the SVM and RF methods with the highest mean value of performance indicators (SVM:All, AUC = 0.81, Cohen's Kappa = 0.43; RF:TC, AUC = 0.79, Cohen's Kappa = 0.35) showed no significant difference.The Cohen's kappa value comparison (Figure 4) shows that the model SVM:All and other nine models have mean Cohen's kappa values that are statistically different from the worst performing model: SVM:N.However, the only model with a kappa value high enough to reach moderate agreement (>0.4) is SVM:All.
Remote Sens. 2018, 10, x FOR PEER REVIEW 13 of 20 We carried out the comparative analysis of all of the models using the mean rank calculated using AUC and Cohen's kappa values.Using the calculated value of CD = 9.3760, we connected the groups of models that have statistically similar values of performance indicators (Figures 3 and 4).For the AUC test (Figure 3), only the distance between the rank of the worst model (SVM:NA) and the three best models (SVM:All, SVM:PC, and RF:TC) were greater than the CD value.This indicates that these three models performed significantly better than the rest, and that the models SVM:NA, SVM:N, and kNN:N are, statistically, the worst performing models.We also noticed that the model produced by the kNN method with the highest mean value of performance indicators (kNN:M56, AUC = 0.75, Cohen's Kappa = 0.29) and the worst performing models belong to the same group determined by CD.On the other hand, the models produced by the SVM and RF methods with the highest mean value of performance indicators (SVM:All, AUC = 0.81, Cohen's Kappa = 0.43; RF:TC, AUC = 0.79, Cohen's Kappa = 0.35) showed no significant difference.The Cohen's kappa value comparison (Figure 4) shows that the model SVM:All and other nine models have mean Cohen's kappa values that are statistically different from the worst performing model: SVM:N.However, the only model with a kappa value high enough to reach moderate agreement (>0.4) is SVM:All.
Tables 7 and 8 show the results of the Nemenyi test for pairwise comparison of the classification performance of machine learning algorithms per model using AUC and Cohen's kappa as the performance indicator, respectively.The test shows a significant difference in Cohen's kappa value in only three cases.In one case, it shows that RF can generate a better performing classifier than kNN when using principal components.In the other two cases, it shows that the RF and kNN algorithms can produce a classifier for Andes mountain forests with a kappa value significantly higher than the SVM algorithm when using the NDVI as the only classification feature.Tables 7 and 8 show the results of the Nemenyi test for pairwise comparison of the classification performance of machine learning algorithms per model using AUC and Cohen's kappa as the performance indicator, respectively.The test shows a significant difference in Cohen's kappa value in only three cases.In one case, it shows that RF can generate a better performing classifier than kNN when using principal components.In the other two cases, it shows that the RF and kNN algorithms can produce a classifier for Andes mountain forests with a kappa value significantly higher than the SVM algorithm when using the NDVI as the only classification feature.The Andes region has a complex topography.Its altitudinal range yields an environmental gradient that contains a variety of ecosystems, vegetation communities, and forest formations.Human pressure on the Andes mountains puts natural resources such as its mountain forests at risk.The evaluation and monitoring of forest conditions is very important for conservation activities and land cover mapping is one important tool to accomplish them.Remote sensing is widely used to monitor large areas at relatively low cost and produce the necessary maps for forest monitoring.This study provides an important contribution on the selection of better classification techniques to create more accurate maps for Andes mountain forests.
Mapping and monitoring large areas of vegetation in semi-arid environments using remote sensing techniques is challenging.One issue is the collection of good quality training data, because the terrain and mixed vegetation introduce noise to the data [50].Such is the case for the Andes mountain terrain, and its remaining low-density forest consisting of a mixed population of native tree species and shrubs.Hence, robust classification methods are essential for mapping this type of forest.Well-known parametric methods such as the maximum likelihood method (ML) have been used for forest mapping using remote sensing data, as well as more complex non-parametric algorithms such as RF, SVM, and kNN which, unlike ML, do not require prior knowledge of the underlying probability density function.In particular, RF and SVM have been shown to outperform ML because they can handle large multivariate and highly collinear datasets [51], which are provided by multispectral (e.g., LANDSAT, SPOT) and newer hyperspectral (e.g., AVIRIS, EO-1Hyperion) high resolution imagery.
When sampling data from a rare land cover type, it is necessary to select a sampling method that enables statistically robust comparisons and logistically feasible sampling within the available budget and staff capacity.However, classification analysis using machine learning algorithms is only as effective as the quality and quantity of the training data they are learning from.We combined an exhaustive field survey within the study area and an indirect sampling method to gather enough data to cover the reflectance variability of Andes mountain forest within the study region (Cuzco, Peru).For this reason, the direct application of these results is restricted to a region of the Andes where forest communities with the same species composition and vegetation structure exist.Given our data collection strategy, we were able to identify models that have the highest and lowest classification performance by using a conservative statistical test (significance level = 5%).

Comparison of MLA Classifiers
By comparing the performances of classifiers produced by the same machine learning algorithm (MLA) in Tables 5 and 6, we observed higher performances in models using the Modified Soil-adjusted Vegetation Index rather than the Normalized Difference Vegetation Index.This can be explained by the effect of soil brightness over the NIR-red band ratio [52], which affects NDVI.On the other hand, MSAVI is less sensitive to changes in soil brightness (due to soil background variations) and shadows than NDVI.This suggests that MSAVI could be a better variable to classify different vegetation types in arid and semi-arid regions where the vegetation cover rate is low.Also, the model SVM:All shows the highest mean AUC and Cohen's kappa values, which suggests that SVM:All is the best model for classifying Andes mountain forests and shrubland.
The results of this research provide new insights to the performance of RF, SVM, and kNN algorithms in the context of mapping Andean forests over large areas with noisy data.Table 5 and Figure 3 show that only three models with the highest ranks have a statistically better performance than the lowest ranked model (SVM:NA), whereas the classification performances of the remaining models were not statistically different.One way to choose from these three top-performing models is to examine the performance indicator's standard deviation.Of the three best models, SVM:All had the highest mean AUC value and the smallest standard deviation followed by SVM:PC which uses a transformation to reduce the dimensionality of the training data used by the former.When analyzing the Cohen's Kappa values (Table 6 and Figure 4) there are ten models that statistically out-perform the worst model.It should be noted that none of these ten models were generated using the kNN algorithm.The performance of kNN was low compared with the other methods, probably because kNN has a greater reliance on data quality.Since this algorithm determines similarity by using the minimum distance to the reference data, the performance of the algorithm is very sensitive to the presence of outliers and noise in the data that was selected to train the classifier.Since Andes mountain forest land cover data is inherently noisy, SVM and RF are more suitable methods for mapping this type of terrain.

The Kappa Coefficient and Its Use as a Performance Indicator of Classification Models
This paper used the area under the curve (AUC) and the Cohen's kappa coefficient for performance assessment of the MLAs.In particular, the Cohen's kappa coefficient is a measure that takes into account the possibility of agreement by chance, and variations of it have been proposed to improve it [53].Its use has been met with intense discussion; for example, Pontius and Millones [54] pointed out that Cohen's kappa coefficient masks sources of error that are significantly different, and its dependence on a comparison with chance agreement is not informative.Although other authors have also criticized this indicator [55][56][57][58], to this day, Cohen's kappa value is still a widely used metric in remote sensing to report classification results [59][60][61][62][63].For this reason, we recommend using additional performance indicators to be used in conjunction with Cohen's kappa value.

Machine Learning Algorithms and Contrast to Recent Research
Machine learning algorithms and their applications, such as regression and classification, have been studied for many years.These applications depend on estimating the probability distribution of the population data from samples, which is the most fundamental statistical approach [64].A simpler approach called density-ratio estimation involves estimating the ratio of probability densities instead of the probability distribution.Density-ratio estimation has been used for predicting forest stand attributes [65] and land cover detection [66], and its application in a framework of machine learning has been proposed [67,68].Future research should focus on deepening the application of this new framework for probabilistic classification and determining whether or not this approach produces more accurate results in the classification of mountain forests when using noisy data.

Conclusions
In this study, we determined the best model to classify Andes mountain forests and shrubland through a series of statistical comparisons using Landsat 8 OLI satellite data and land cover data.We concluded that the highest classification performance was obtained from a SVM classification model using the reflectance values from bands 2 to 7, NDVI, MSAVI, and the topographic variables elevation and aspect.
Based on our statistical pairwise comparison of the MLAs per model (Tables 7 and 8), we found that the SVM and RF algorithm produce comparable classifiers for distinguishing Andes mountain forests and shrub land.
In contrast, the kNN models generally yielded lower ranks and lower mean values of performance indicators when compared with the SVM and RF models.These results suggest that the kNN algorithm had the lowest performance in the comparison test.One possible reason is that kNN is a "lazy learner"; that is, this method does not create classification rules from training data that can be generalized when classifying new data.Instead, it uses the training data (itself) to carry out the classification and predicts the class label of the new data from the k closest neighbors.This approach is likely to be a problem in classifying noisy training data which are derived from the variability of reflectance values in sparse Andes mountain forests that commonly grow in association with shrubland.
In conclusion, the results of our statistical analyses suggest that SVM and RF are suitable machine learning methods for producing accurate classifier for Andes mountain forests.
Author Contributions: Y.H. and L.A.V.I conceptualized the manuscript topic.Y.H conceptualized the data sampling design and was in charge of overall direction and planning.L.C.V.S. and N.S.T provided valuable information for field survey site selection.All co-authors were involved in field data collection.L.C.V.S. and N.S.T processed the field data and were in charge of land cover data collection using the software "Collect Earth".Y.H carried out the atmospheric-topographic correction analysis of the images.L.A.V.I processed the images, performed the scientific data analysis, and prepared the draft of the manuscript.Y.H. reviewed and edited the first draft of the manuscript.All co-authors contributed to the final writing.

Figure 1 .
Figure 1.Study area in Peru.The red box outlines the image area.

Figure 1 .
Figure 1.Study area in Peru.The red box outlines the image area.

Figure 3 .
Figure 3. Critical Difference (CD) diagram for the Nemenyi test showing the results of the statistical comparison of all models against each other by mean ranks based on AUC values (higher ranks, such as 5.9 for SVM:All, correspond to higher values of AUC).Classifiers that are not connected by a bold line of length equal to CD have significantly different mean ranks (Confidence level of 95%).

Figure 3 .
Figure 3. Critical Difference (CD) diagram for the Nemenyi test showing the results of the statistical comparison of all models against each other by mean ranks based on AUC values (higher ranks, such as 5.9 for SVM:All, correspond to higher values of AUC).Classifiers that are not connected by a bold line of length equal to CD have significantly different mean ranks (Confidence level of 95%).

Figure 4 .
Figure 4. Critical Difference (CD) diagram for the Nemenyi test showing the results of the statistical comparison of all models against each other by mean ranks based on Cohen's Kappa values (higher ranks, such as 4.9 for SVM:All, correspond to higher values of Cohen's Kappa).Classifiers that are not connected by a bold line of length equal to CD have significantly different mean ranks (Confidence level of 95%).

Figure 4 .
Figure 4. Critical Difference (CD) diagram for the Nemenyi test showing the results of the statistical comparison of all models against each other by mean ranks based on Cohen's Kappa values (higher ranks, such as 4.9 for SVM:All, correspond to higher values of Cohen's Kappa).Classifiers that are not connected by a bold line of length equal to CD have significantly different mean ranks (Confidence level of 95%).

Table 1 .
Satellite and DEM data used in the analysis.

Table 2 .
Land cover dataset gathered by the field survey and by using Collect Earth software.

Table 2 .
Land cover dataset gathered by the field survey and by using Collect Earth software.

Table 3 .
Total sum of the Akaike weights for the variables in the study.Higher values indicate greater importance.

Table 4 .
Pearson correlation matrix for the variables in the study.Significant low correlation values are shaded.

Table 5 .
Models ranked by AUC values obtained by 10-fold cross-validation.
SD: standard deviation; BCI: bootstrap confidence interval calculated by bootstrap analysis with 1000 repetitions.

Table 6 .
Models ranked by Cohen's kappa values obtained by 10-fold cross-validation.
SD: standard deviation; BCI: bootstrap confidence interval calculated by bootstrap analysis with 1000 repetitions.

Table 7 .
p values of the Nemenyi post-hoc tests for multiple comparisons.The table shows the p values resulting from the pairwise comparisons of AUC values produced by classifiers using the same variables but produced by different MLAs.No significant differences (p < 0.05) are observed.

Table 8 .
p values of the Nemenyi post-hoc tests for multiple comparisons.The table shows the p values resulting from the pairwise comparisons of Cohen's Kappa values produced by classifiers using the same variables but produced by different MLAs.Significant differences (p < 0.05) are underlined.

Table 7 .
p values of the Nemenyi post-hoc tests for multiple comparisons.The table shows the p values resulting from the pairwise comparisons of AUC values produced by classifiers using the same variables but produced by different MLAs.No significant differences (p < 0.05) are observed.

Table 8 .
p values of the Nemenyi post-hoc tests for multiple comparisons.The table shows the p values resulting from the pairwise comparisons of Cohen's Kappa values produced by classifiers using the same variables but produced by different MLAs.Significant differences (p < 0.05) are underlined.