Prediction of Dominant Forest Tree Species Using QuickBird and Environmental Data

Modelling the spatial distribution of plants is one of the indirect methods for predicting the properties of plants and can be defined based on the relationship between the spatial distribution of vegetation and environmental variables. In this article, we introduce a new method for the spatial prediction of the dominant trees and species, through a combination of environmental and satellite data. Based on the basal area factor (BAF) frequency for each tree species in a total of 518 sample plots, the dominant tree species were determined for each plot. Also, topographical maps of primary and secondary properties were prepared using the digital elevation model (DEM). Categories of soil and the climate maps database of the Doctor Bahramnia Forestry Plan were extracted as well. After pre-processing and processing of spectral data, the pixel values at the sample locations in all the independent factors such as spectral and non-spectral data, were extracted. The modelling rates of tree and shrub species diversity using data mining algorithms of 80% of the sampling plots were taken. Assessment of model accuracy was conducted using 20% of samples and evaluation criteria. Random forest (RF), support vector machine (SVM) and k-nearest neighbor (k-NN) algorithms were used for spatial distribution modelling of dominant species groups using environmental and spectral variables from 80% of the sample plots. Results showed physiographic factors, especially altitude in combination with soil and climate factors as the most important variables in the distribution of species, while the best model was created by the integration of physiographic factors (in combination with soil and climate) with an overall accuracy of 63.85%. In addition, the results of the comparison between the algorithms, showed that the RF algorithm was the most accurate in modelling the diversity.


Introduction
The forests of Iran cover an area of about 12.4 million ha, comprising 7.4% of the country's area [1].Of the five vegetation regions, the most important according to forest density, canopy cover and diversity, is the Hyrcanian (Caspian) region [2].In forest ecosystems, trees and shrubs are living either independently (individually) or in association with each other, where some species are dominant over other species based on different biotic and non-biotic factors, comprising a stand with an area larger than 0.5 hectare or group species with an area smaller than 0.5 hectare.
Forest stand types or dominant trees and shrubs species mapping is one of the most important ways to manage and protect plant communities.Information about dominant tree species is required to assess forest resilience and vulnerability to any threat, for instance, drought and pathogens [3].Investigating and using other methods with less cost and time can be an alternative approach for Forests 2017, 8, 42 2 of 19 field mapping.One of the alternative ways of mapping is the use of environmental data which are spatially related with forest tree grouped species or stands.Plant spatial distribution modelling is one of the indirect methods for predicting the properties of plants and can be defined as the relationship between the spatial distribution of vegetation and environmental variables.Recognition of the relationship between environmental factors and the plant species distribution plays an important role in environmental planning and ecosystem management [4,5].Therefore, the appearance and stability of each plant species is generally influenced by environmental factors and the relationship between them and one or more environmental factors that have the greatest impact on their establishment.
In recent decades, airborne or space-borne remote sensing (RS) images, have been given the ability to record the electromagnetic radiation of vegetation at various wavelengths.Tree species identification using remote sensing is a classic topic in forestry [6] The RS data with different spatial, spectral, and radiometric resolutions are presented as another alternative way for estimating and modelling vegetation attributes [2].The estimated results were often differentiated according to chosen algorithms and studied area conditions [7].However, some studies have already addressed the challenges in accurate tree species classification; Shataee et al. [8] and Mohammadi et al. [9] reported that using only spectral data could not provide useful information.Despite the emergence of very high resolution (VHR) sensors and their significance in the classification of dominant tree species, the limited number of spectral bands did not permit precise species discrimination [10].Several other studies also investigated the possibility of using airborne hyperspectral imagery [11][12][13] LiDAR data or even a combination of multiple other techniques, for instance, auxiliary (environmental) data for prediction of tree species [14][15][16].It is clear that differences in biochemical properties of different tree species and structural parameters of trees such as the texture and surface of the leaves, are better preserved using hyperspectral information, allowing continuous sampling of the electromagnetic spectrum.Moreover, LiDAR-based information about canopy structure to hyperspectral responses is also crucial for improving tree species classification of dominant tree species [17].However, the operational use of these kind of data is still a challenging task, mainly because of their high cost and limited availability.
To overcome the above limitations, we attempted a different and more radical approach toward the prediction of dominant tree species, through a combination of satellite and auxiliary (environmental) data.Spatial distribution is directly dependent on both competition and environmental conditions such as the solar radiation, climate, water, topography and available nutrients.If these factors and their behaviors in relation to the distribution of species could be determined, it will be possible to achieve the prediction of dominant tree and shrub species distribution models [18].Therefore, it is possible to investigate spatial distribution of these features by the environmental variables [19,20].Until now, many studies have been conducted to predict the distribution of vegetation with physiographic, climate or soil factors.In many studies, physiographic factors have been reported as the most influential and were used to classify and predict vegetation characteristics as well as species distribution at different scales [21].For instance, the use of elevation and slope [22,23], aspect [23,24] and secondary topography variable of potential solar radiation [22,24] were reported as the most important factors influencing the distribution of tree species.Additionally, edaphic factors such as soil type, moisture, soil temperature and soil nutrient availability to plants, are the most important factors affecting the distribution of tree species [25][26][27].On the other hand, climatic factors can increase, decrease, or change the distribution of tree species depending on the species type.Several studies that investigated the relationship between climatic factors with the distribution of tree species, identified rain [28,29] and temperature [23,29] as the most important factors affecting the plant communities.
Also, previous studies have shown that the use of auxiliary data in combination with spectral data can improve the prediction results [30,31].The use of high and very high spatial resolution satellite imagery (such as QuickBird IKONOS and Landsat) can result in high-precision imputation, especially for tree species and plot estimation [2].Also, many studies concluded that, adding satellite-based spectral data in combination with environmental data can improve the prediction of qualitative and quantitative characteristics of the forests.Mohammadi et al. [31] showed an improvement in the classification results of forest types by combining spectral and auxiliary data by determining the prior probability and creating spatial models of occurrence of forest types.With their results, many studies verified the potential for classification improvements by adding auxiliary data.As an example, Wheatley et al. [32] used topographic data with Landsat thematic mapper (TM) images to improve the accuracy of land cover maps.Also, Saatchi et al. [33] used the moderate-resolution imaging spectroradiometer (MODIS), quick scatterometer (QSCAT), the shuttle radar topography mission (SRTM), and the tropical rainfall measuring mission (TRMM) together with elevation and solar radiation data to model the potential distribution of tree species and diversity.In addition, Wang et al. [34] combined TM and auxiliary data, including digital elevation model (DEM), slope and moisture percentage in vegetation classification.Hernández Stefanoni et al. [35] combined geostatistical modelling and RS data to improve tropical species richness mapping.In another study, Adhikari et al. [36] investigated the possibility of spatial modelling of Ilex Khasiana P., using 16 environmental parameters, while Riemer Sørensen et al. [37] modelled the distribution of 10 commercial palm species using climatic, topographic, and spectral data, as well as a combination of them.
In the past two decades, parametric algorithms such as multiple linear and non-linear regressions, have been popular methods for estimating forest characteristics using non-spectral and spectral data [38,39].Recently, non-parametric algorithms have been used for the prediction and estimation of forest attributes, because of some advantages they have over parametric algorithms, i.e., flexibility and the ability to describe non-linear dependencies, the fact that they are free from the assumption of any given probability distribution, and the fact that the observations are assumed to be independent of each other [40][41][42].So far, non-parametric algorithms such as the generalized linear model (GLM), artificial neural networks (ANN), random forest (RF), support vector machine (SVM), and the k-nearest neighbor (k-NN), are used to predict the biological properties of forests [43].The non-parametric machine learning techniques, have demonstrated superior performances over classic regression analysis for estimating forest attributes.The non-parametric models are more efficient than parametric models due to the possibility of using more than one independent variable simultaneously and also because there is no need for the data to be normally distributed.Among the non-parametric algorithms; the RF, SVM and k-NN are three of many machine learning algorithms that demonstrated good performance in forest attribute estimations [2].Thus, in the present study, these three commonly used data mining classifier algorithms are used for tree species prediction.

k-Nearest Neighbor
The k-NN method is one of the simplest and most popular data-mining algorithms used for classification and regression.k-NN is widely used for the estimation of forest description using various topographic and remote-sensing data [44][45][46][47][48][49].In k-NN implementations, three factors should be determined including the number of k, the type of distance measured and weights for nearest neighbors.

Support Vector Machine Classification
This algorithm is suitable for both classification and regression techniques based on statistical learning theory [50].Generally, SVMs focus on the boundary between classes and map the input space created by independent variables using a non-linear transformation according to a kernel function.Linear, polynomial radial basis function (RBF) and sigmoid are the most commonly used kernel types.The RBF is the most popular kernel, which is used in SVMs [51,52].According to our literature review, SVM has been used for forest classification [53][54][55].In their work, Sheeren et al. [10] found SVM, among various non-parametric techniques, to be the best classifier with very close results to other classifiers among them (k-NN and RF).

Random Forest
RF is a new algorithm to the field of data mining and is designed to produce accurate predictions that do not overfit the data [56,57].RF can also be used for regression-type problems (to predict a continuous dependent variable) and classification problems (to predict a categorical dependent variable).Implementation of RF depends on the regularization of decision tree and stopping parameters.The decision tree model parameters include the maximum number of trees that must be grown in the forest and the number of variables (k predictor or independent variables in each node for predicting depend values) that are randomly selected in each node [58].Alternatively, choosing a small number of predictor variables may downgrade prediction performance, because this can exclude variables that may account for most of the variability and trends in the data [58].The stopping parameters or control parameters are used to stop running the algorithm when satisfactory results have been achieved [2].In some studies, such as Shataee et al. [2] and Garzón et al. [59] the RF has been used for the prediction of forest attributes.
The main aim of this study was to investigate the possibility of using high resolution satellite images derived from QuickBird with non-parametric algorithms, to estimate the distribution of tree species and also to investigate the effect of the environmental factors upon tree group species.Our literature review showed that no study has been carried out on the application of machine-learning algorithms (k-NN, SVM or RF algorithms) using QuickBird data for the prediction of distribution of tree species groups in the Hyrcanian forest.Therefore, for the purposes of this study, three of the most commonly used non-parametric machine-learning methods were applied for the first time to classify the spatial distribution of forest tree species as grouped (forest stand types) by climate, topography, soil data and a combination of them with the spectral data of QuickBird, in the Dr. Bahramnia forestry plan site in the Golestan province, in northeastern Iran.Overall, we wanted to study how helpful the combination between spectral and non-spectral data can be, in increasing the accuracy of the species distribution modelling (SDM).

Study Area
The study area is located south-west of Gorgan city in Golestan province in Iran (36 • 43 N to 36 • 46 N and 54 • 21 E to 54 • 24 E).The projection which was used in both Figures 1 and 2 was UTM WGS 1984 zone 40 N.The total area is 1714 hectares, where the green markers indicate the sample plots as seen in Figure 1.The elevation of the study area ranges from 220 to 1012 m, the slope range is between 0 and 80% and the soil type is characterized as brown and grey-brown (Figure 2).The average precipitation is 649 mm.In regards to aspect, 45% of the total area is facing to the west, 42% to the north, 10% to the east and 3% to the south.Regarding slope, 81% of the study area was between 0% and 30%, 18% between 31% and 60% and 1% above 61%.Information for different species is presented in Table 1.

Field Survey
To determine the dominant trees, we used the information of measured trees from 518 permanent sample plots with a radius of 17.5 m, plotted in a systematic network of 150m × 200 m grids.The geographic position of plot centers and forest attributes such as diameter, height, crown diameter, name of species and tree health status were recorded on inventory forms.We included all trees with diameters greater than 10 cm for measuring the basal area.Based on the basal area factor (BAF) for every species in each plot, we were able to extract and specify the dominant tree species.Thus, any species with a basal area frequency above 50% in a plot, was selected as the dominant species.In addition, plots that did not contain a particular species group (basal area frequency of all species was lower than 50%) were excluded from the analysis, because they can cause uncertainty in tree species identification.

Environmental Data
We then constructed a DEM (cell size = 30 m) for the study area by using the topographic map (1/25,000 scale with 10 m contour interval).To evaluate the DEM quality, we first applied the hillshade tool in order to create a shaded relief from the DEM, by considering the illumination source angle and shadows to be able to visually identify large errors (noise and sudden change in values).Then, we used the contour tool in ArcGIS 10.1 software (ESRI, Redlands, LA, USA), to recreate the contours from the DEM (15 m contour interval) and finally compared the original contours with new contours which came from the interpolation.
Using a variety of software such as ArcGIS V.9.3 (ESRI, Redlands, LA, USA, 2008), and terrain analysis system (TAS V.1.0),(University of Western Ontario, Ontario, Canada, 2003), we used the DEM to construct primary and secondary topographic characteristic maps (Tables 2 and 3) and climatic maps such as average annual precipitation (Equation ( 1)), average annual temperature (Equation ( 2)) and average annual evaporation (Equation (3)).These equations were derived based on information provided by metrological stations during two decades [60]: where Y is the average annual precipitation (mm) and X is representing the elevation.
where T is the average annual temperature (c o ) and X is representing the elevation.
where ETP is the average annual evaporation (mm) and X is representing the elevation.
The three main terms account for direct-beam, diffuse, and reflected irradiance.A variety of methods are used by different authors to calculate these individual components.The methods vary tremendously in terms of sophistication, input data, and accuracy.
The zonal statistic algorithm was used for extracting the sub-order soil factor (Table 4) from [64] and other layers such as topographic and climatic factors, by using a buffer layer around the center of plots (17.5 m radius).

Pre-Processing and Processing of Spectral Data
In the first step, geometric correction was done using DEM with 10 m accuracy in order to eliminate the effect of displacement caused by the topography, using the quadratic equation and nearest neighbor sampling.The geo-referencing mean square error was less than one meter for multispectral bands and less than 40 cm for the panchromatic band.Processing of satellite images such as principal component analysis (PCA), tasseled cap transformation (brightness, greenness and moistness) (Table 5), normalized difference vegetation index (NDVI) and texture analysis was performed in order to enhance the spectral differences between tree species.
In this study, 13 texture analysis variables were used, such as mean, variance, entropy, contrast, heterogeneity, homogeneity, angular second moment, correlation, gray-level difference vector (GLDV) Angular second moment, GLDV entropy, GLDV mean, GLDV contrast and inverse difference with kernel window size 12 × 12 for RGB and infrared (IR).While 50 × 50 kernel size was

Pre-Processing and Processing of Spectral Data
In the first step, geometric correction was done using DEM with 10 m accuracy in order to eliminate the effect of displacement caused by the topography, using the quadratic equation and nearest neighbor sampling.The geo-referencing mean square error was less than one meter for multi-spectral bands and less than 40 cm for the panchromatic band.Processing of satellite images such as principal component analysis (PCA), tasseled cap transformation (brightness, greenness and moistness) (Table 5), normalized difference vegetation index (NDVI) and texture analysis was performed in order to enhance the spectral differences between tree species.
In this study, 13 texture analysis variables were used, such as mean, variance, entropy, contrast, heterogeneity, homogeneity, angular second moment, correlation, gray-level difference vector (GLDV) Angular second moment, GLDV entropy, GLDV mean, GLDV contrast and inverse difference with kernel window size 12 × 12 for RGB and infrared (IR).While 50 × 50 kernel size was used for the panchromatic band.Overall, 75 RS data layers were included in this study with other non-spectral layers.

Extracting Data
The whole structure of layer preparation in this work is shown in Figure 4. To extract data, the zonal statistic algorithm in ArcGIS 10.1 software was used to extract the mean of non-spectral and spectral values for each plot, by using a buffer layer around the center of plots with a radius of 17.5 m. used for the panchromatic band.Overall, 75 RS data layers were included in this study with other non-spectral layers.

Extracting Data
The whole structure of layer preparation in this work is shown in Figure 4. To extract data, the zonal statistic algorithm in ArcGIS 10.1 software was used to extract the mean of non-spectral and spectral values for each plot, by using a buffer layer around the center of plots with a radius of 17.5 m.
Spatial distribution of forest species group modelling was done by grouping variables of topographic, climate, soil, and spectral RS data as a single group or combined using RF, k-NN and SVM algorithms, on 80% of the plots.Algorithm performances were evaluated by using 20% of unused plots in the modelling process by overall accuracy, user, and producer accuracy indicators.

Randomly Stratified Sample Splitting Method
In all statistical analyses that need to have prototype samples, i.e., training, test and validation data sets, the sample should be uniform and representative of all the data [66].Therefore, the plots were first stratified based on their frequency distributions in different internal categories and then randomly divided into two training and validation sets.In other words, they had a sufficient distribution range in terms of their frequency in whole classes.This sample splitting method can be called a randomly stratified sample splitting [2] (Table 6).To achieve this, the values of plots based on basal area were stratified into categories that are currently being used for classification or mapping of discrete variables in forest management, and the frequency of plots was computed for each Spatial distribution of forest species group modelling was done by grouping variables of topographic, climate, soil, and spectral RS data as a single group or combined using RF, k-NN and SVM algorithms, on 80% of the plots.Algorithm performances were evaluated by using 20% of unused plots in the modelling process by overall accuracy, user, and producer accuracy indicators.

Randomly Stratified Sample Splitting Method
In all statistical analyses that need to have prototype samples, i.e., training, test and validation data sets, the sample should be uniform and representative of all the data [66].Therefore, the plots were first stratified based on their frequency distributions in different internal categories and then randomly divided into two training and validation sets.In other words, they had a sufficient distribution range in terms of their frequency in whole classes.This sample splitting method can be called a randomly stratified sample splitting [2] (Table 6).To achieve this, the values of plots based on basal area were stratified into categories that are currently being used for classification or mapping of discrete variables in forest management, and the frequency of plots was computed for each category.The training and validation samples were randomly selected as 80% for training plots and 20% for validation of the modelling in each category.

Implementation of Machine Learning Methods
k-Nearest Neighbor.In k-NN implementations, the number of k-NNs, the type of distance measure and the weighting for nearest neighbors are three important parameters.Determination of k is very important in terms of calculation time and producing unbiased results.
According to McRobert [67] a 1-20 range of k is the best option in terms of using the k-NN algorithm.He also stated that a higher k value may lead to lower noise effect during calculation, however, this requires taking a larger number of samples.It is important to mention that in the case of a higher k value, the pixel-level results will average towards the mean [68], leading to a higher bias and less precision.In most of the studies, the optimal k was reported to be between 5 and 10 [69-71].However, in some other studies, such as [48] with k between 1 and 35 and [67] with k between 1 and 50, k values decreased the bias of the modelling process.
A smaller k often leads to a higher variance and less stable results [72].Therefore, the optimal k is dependent on the data and goals of the estimation [73].In addition, for an efficient comparison of distance measures in the k-NN algorithm, four distance measuring methods are available in Statistica software (StatSoft.Inc., Tulsa, OK, USA), including Euclidean, squared Euclidean, city block (Manhattan) and Chebychev.In this study, after a primary test on the results of different distance measurement methods and k values (between 5 and 50), weighted squared Euclidean distance, (k = 50) was used for modelling with the k-NN algorithm and its results were compared with each other.
Support Vector Machine Regression.The prerequisite for SVR to achieve better results, is the appropriate determination of the parameters that play key roles in achieving higher accuracy and better performance [74].The specified grid search using v-fold cross-validation [52] is the most commonly used method to identify suitable parameters, i.e., epsilon (ε) and capacity (C) with fixed gamma that would produce high-accuracy results.A brief description of the proposed methods is summarized in [52].In this study, RBF kernel was examined in a fixed number of gamma that are calculated based on 1/number of independent variables [75].For selecting the best parameters, 10-fold cross-validation with 1000 iterations were used for minimizing the error function [76].Chang and Lin et al. [75] used a specified grid search method to determine the best capacity and epsilon rates.The specified grid search included a range of capacity from 1 to 40, which is equal to the range of input variables [77] and epsilon values from 0.1 to 0.5.
Random Forest.For a high quality of RF classification performance, the decision tree model and stopping parameters should be regularized.For determination of the optimal tree number, 2000 initial trees, were used to produce a graph, showing the average squared error rates against each number of trees for training and test samples.RF is a powerful analytical tool for the exploration of data and verification of the optimal number of trees.By interpreting the graph, the optimal number of trees is found based on a tree number that produces a stable error.Then, the RF implementation was again repeated by using this optimal number of trees and other fixed parameters.In addition, default rates of stopping and splitting parameters were used to stop the process of growing trees, when stopping conditions were reached.The stopping parameters for all estimations included a minimum of one child node and a maximum of 100 nodes to stop growing the trees in 10 iterations for calculating the mean error and a 5% decrease in the training error.

Results and Discussion
Table 6 contains the numbers of training and test samples and indicates that the Fagus orientalis L., Carpinus betulus L., and Parrotia persica are the most frequent and dominant species.After modelling with each of the group variables individually as well as combined, the overall producer and user accuracies of the classes were presented in Table 7.The RF algorithm had a slightly higher performance (higher overall accuracy compared to the SVM and k-NN algorithm in almost all group variables).The best overall accuracy was obtained using a combination of topography, soil and climate factors (63.85%) by the RF algorithm, as shown in Table 7. Also, the results showed that modelling based on basal area frequencies for the species of Carpinus betulus L, Fagus orientalis L. and Parrotia persica, had the highest accuracy in classification compared to other species shown in Table 6.This may be due to greater frequency of training and test samples for these species groups.

Topographic Variables
The results (Table 7) showed that use of primary and secondary topographic variables derived from DEM could have a better result for two dominant group species-Fagus orientalis L. and Carpinus betulus L.-compared to any other species.Generally, the use of primary and secondary topographic variables can produce better results for modelling the distribution of a group or community, than for the distribution of individual tree species [78].In other words, the spatial distribution of species communities are more dependent on the environmental conditions compared to the spatial presence of single trees or species.For example, one tree species such as Fagus orientalis L., can be presented in an extended geographical area, but as a community, it can only be present in limited areas.Topographical factors play a significant role in many environmental processes since they can create different microclimate conditions that could affect the habitats of species.In general, topography is one of the most important factors affecting species composition and distribution [79].
Elevation changes can cause ecological changes such as the air pressure, change of the amount of ultraviolet light, type and amount of rainfall, fluctuations in relative humidity and absolute humidity.Hence, these factors could also affect the composition and distribution of tree species.Based on the results, elevation between topographic variables had the highest effect on the spatial distribution of tree species.According to the RF results, elevation was introduced as the most important factor in the process of modelling.Our results corroborated previous studies [22,26,36,80,81] when we tried to examine the relationship between elevation and the spatial occurrence of species.However, our results contradicted the results of Wheatley et al. [32].
Also, primary topographic features such as slope and aspect had lower correlations with the spatial distribution of tree species and are in accordance with the results of Wheatley et al. [32], but opposed the results of other studies [23,24,27,80].Similar biological characteristics of dominant species, as well as human interference, for example silvicultural treatments, management and utilization of forests, can lead to an ineffective relationship between slope-aspect parameters and the spatial distribution of tree species.Moreover, in our study, 80% of the area was in range of 0% to 30% slope; this homogeneity can reduce the effectiveness of the slope factor.In addition, about 90% of the total area has north and west aspects, due to the wide area in each direction, dominant species groups (Fagus orientalis L., Parrotia persica, and Carpinus betulus L.) are distributed in both directions and can result in the reduction of the coefficient of the aspect layer in modelling.Following the elevation parameter, solar radiation was the most important and influential independent variable in predicting the spatial distribution of species groups.The importance of solar radiation in the growing season on species spatial distribution was agreeable with the results of [22,24,26], but opposed the results of Wheatley et al. [32].Wheatley et al. [22] proposed that using ground variables for mapping regions with homogenous elevation is the main reason for this contradiction.Spatial prediction of tree species, for the species which had limited ecological range (i.e., Fagus orientalis L.) were more correspondent in reality, compared to species which had wider distribution [82].However, distribution of plants cannot be completely restricted by the topographic characteristics.This is because some species are resistant to various conditions, and they have wider ecological domains and higher ecological tolerance.

Climate
The results which predicted the spatial distribution of tree species using the climatic variables group showed that all three variables-temperature, evapo-transpiration and rainfall-were significant parameters in modelling the distribution of tree species, and agreed with the results of [25,33].Also, our results agreed with many other studies with regards to the importance of temperature [23,25] and rainfall [23,28,37], in modelling the distribution of tree group species.

Sub-Order Soil
In general, Soil type is one of the most important factors in the development of plants.The physical, chemical, and biological properties of soil affected the distribution and establishment of plants.Previous studies [25,83,84] emphasized the importance of soil characteristics in species distribution.

Spectral Data
Using spectral data in modelling had lower accuracy (54.21%) in all algorithms compared to non-spectral data (Table 7).The result showed that applying spectral data as auxiliary data cannot improve the accuracies.The results of modelling with QuickBird spectral data proved that-with main bands and the texture analysis, such as mean and greenness-they were the best satellite data in modelling according to the importance factors obtained from RF analysis results [33,85].Mean texture analysis of all bands had the highest importance coefficient in the distribution modelling of tree species groups, which agreed with [63], demonstrating the power of data obtained from the texture analysis of high-resolution images in presentation of the forest qualitative and quantitative characteristics.
Overall, according to Table 7, forest species group modelling, based on the frequency of the basal area, showed that topography variables had the best results (overall accuracy of 62.67%) compared to soil or climate variables.However, a combination of topography and soil variables improved the overall accuracy to 63.85%, but the combination of three variables, namely topography, soil and climate, could not improve the overall accuracy (60.24%).Finally, in this study, the synthesis of spectral data with the data from topography, climate and soil did not improve the results (overall accuracy of 61.44% using the SVM).

Conclusions
In remote sensing studies, it is common that a combination of spectral information of different objects can cause errors in sampling; in other words, uncertainty in samples.In our study, we experienced a similar problem because of the uncertainty of a few samples, in which the percentage of dominant tree species was close to 50%.To solve this problem, we tested our models by using plots in which the frequency of dominant trees was more than 75%, but still it could not significantly increase the accuracy of the models.Besides, we had to eliminate some tree species whose frequency was ranging between 50% and 75% per plot.Consequently, we decided to ignore the uncertainty of samples and the errors that they can cause.
In most studies, spectral data serve as the basis and non-spectral data are used as the auxiliary.However, the combined use of spectral and non-spectral data in this study showed much higher accuracies than modelling with only spectral data.Our results correspond to results of previous studies [8,33,34,36,37,86].Integrating auxiliary data with data obtained from RS increases the modelling accuracy and contravenes with the results of [32].
In dense forests such as those found in the Hyrcanian region in the north-east part of Iran, spectral data can only be representative of the reflectances of dominant tree species in the upper part of the forest canopy and cannot provide spectral information regarding species, which may be encountered lower in the canopy.Therefore, the integration of non-spectral data with spectral data has caused an increase in classification accuracy of Fagus orientalis L., which is the dominant species in the canopy (over 90%).Consequently, inadequate accuracy of the exact location of the measured samples could be one of the reasons for reducing the effect of spectral data.High-resolution satellite images are sensitive to low errors in the spatial coordinates recorded by GPS devices.
Regarding the comparison analysis between the algorithms that we used, the classification accuracy of tree species by the RF algorithm could achieve higher results in terms of accuracy than the SVM and k-NN algorithms.Also, the RF algorithm was shown to provide generally higher accuracy and validity of the user and producer accuracy indicators than the SVM and k-NN algorithms.The capability of the RF algorithm in determining the weight coefficients of the independent variables without pruning the tree structure, increased the classification accuracy of the algorithm.On the contrary, the k-NN and SVM algorithms are not able to recognize the importance of each variable and they consider the same weight for all the independent variables in the modelling process.Moreover, RF results had the highest user accuracy compared to other algorithms.This means that understanding of the RF algorithm in change of tree species in each region is more reliable.
According to Yanoviak et al. [87], the RF algorithm compared to ANN and the decision tree methods, had higher modelling accuracy for the distribution of pine.In addition, according to Kernes et al. [88] tree models had a better understanding of the relationship and the boundaries, than logistic regression models for predicting the shrub cover spatial shifting.Also, according to [89], RFs are often used in very large geographical areas and when the number of samples in classes is unbalanced, the RF algorithm can be used with an acceptable level of accuracy for classification in such instances and it can be one of the factors for superiority over the SVM and k-NN.Naidoo et al. [90] studied the possibility of modelling savanna tree species in the Kruger national park in South Africa, using integrated hyperspectral, light detection and ranging (LiDAR) data and the RF algorithm and their results showed that the RF model produced 87% accuracy.
In general, data mining algorithms could not provide the desired accuracy in the mapping distribution of tree species.Various factors, such as lithology, geology, and human activities affect the distribution of tree species, however, these effects have not been considered in this study.Furthermore, recorded data plots with low accuracy (error in sampling) may reduce the classification accuracy of the distribution of tree species [36].
In conclusion, our results showed that topography, soil, and climate variables, influenced the distribution of tree species.However, topographic variables were the most important factors affecting the distribution of tree species.Combining spectral data with auxiliary data did not improve classifications results.

Figure 1 .
Figure 1.Location of the research area in Dr. Bahramnia's forestry district, Golestan province in northern Iran.

Figure 2 .
Figure 2. Classification of Sub-order types of soil within the study area.

Figure 1 .
Figure 1.Location of the research area in Dr. Bahramnia's forestry district, Golestan province in northern Iran.

Figure 2 .
Figure 2. Classification of Sub-order types of soil within the study area.

Figure 2 .
Figure 2. Classification of Sub-order types of soil within the study area.

2. 4 .
QuickBird Data QuickBird was a high-resolution, commercial earth observation satellite, owned by Digital Globe launched in 2001 and decayed in January 2015.QuickBird used Ball Aerospace's Global Imaging System 2000.The satellite collected panchromatic (black and white) imagery at 61 cm resolution and multispectral imagery resolution of 2.44 m at 450 km, and up to 1.63 m at 300 km, respectively as shown in Figure 3.In this study, a window of QuickBird image from 8 October 2008 has been used in Forests 2017, 8, 42 8 of 19 four multi-spectral bands with 2.4 m spatial resolution and one panchromatic band with 60 cm spatial resolution.The quantization radiometric level of images was 11 bits [65].Forests 2017, 8, 42 8 of 19 respectively as shown in Figure 3.In this study, a window of QuickBird image from 8 October 2008 has been used in four multi-spectral bands with 2.4 m spatial resolution and one panchromatic band with 60 cm spatial resolution.The quantization radiometric level of images was 11 bits [65].

Figure 4 .
Figure 4. Data processing workflow illustrates the structure and order of layer preparation and different categories of layers which are used in our study.

Figure 4 .
Figure 4. Data processing workflow illustrates the structure and order of layer preparation and different categories of layers which are used in our study.

Table 1 .
Statistical table displays the species distribution based on inventory data.

Table 4 .
Characteristics of Sub-order types of soil.

Table 5 .
Conversion coefficient of tasseled cap for QuickBird images.

Table 5 .
Conversion coefficient of tasseled cap for QuickBird images.

Table 6 .
Number of train and test plots in each class.

Table 7 .
The classification accuracy assessment of tree and shrub species groups by using Random forest (RF), support vector machine (SVM) and k-nearest neighbor (k-NN) algorithms.