Factorsr: an Rwizard Application for Identifying the Most Likely Causal Factors in Controlling Species Richness

We herein present FactorsR, an RWizard application which provides tools for the identification of the most likely causal factors significantly correlated with species richness, and for depicting on a map the species richness predicted by a Support Vector Machine (SVM) model. As a demonstration of FactorsR, we used an assessment using a database incorporating all species of terrestrial carnivores, a total of 249 species, distributed across 12 families. The model performed with SVM explained 91.9% of the variance observed in the species richness of terrestrial carnivores. Species richness was higher in areas with both higher vegetation index and patch index, i.e., containing higher numbers of species whose range distribution is less fragmented. Lower species richness than expected was observed in


Introduction
The identification of the main factors affecting species richness or extension risk is an important topic in macroecological, biogeographical, and conservation studies [1,2].Multiple regression (MR) has been frequently used to identify factors influencing response variables such as species richness or occurrence [3], but predictor variables are frequently significantly intercorrelated (multicollinearity), and the most frequently used stepwise-selection techniques (e.g., Akaike Information Criterion (AIC), quasi-Akaike Information Criterion (QAIC), Bayesian Information Criterion (BIC), etc.) are not the most effective way of identifying those variables most likely to influence variation in species richness, because these techniques are focused on finding a single best predictive function and not drawing inferences about the likely causality of variables [4].Hierarchical partitioning (HP) [2] has been suggested as a possible solution to identify the most likely causal factors in an MR, because it considers all models in an MR setting jointly [5].
The relationship between species richness and some factors may be non-linearly related [6].Therefore, the selection of the factors significantly affecting species richness would be improved if both untransformed and/or log-transformed factors are taken into account when selecting the factors.
Once the variables have been selected, it may be useful to perform a model for predicting species richness.Techniques such as the Support Vector Machine (SVM) models can model complex and non-linear patterns without making any assumptions.Unlike MR or other classical statistical techniques, SVM uses kernel functions in order to operate in a feature space with a higher dimension than the input space [7].Therefore, this facilitates overcoming common problems in species distribution modeling such as the autocorrelation [8].Another difference compared to MR is that SVM is based on the structural risk minimization principle, so it attempts to minimize the generalization error instead of minimizing the observed training error.As a consequence of this theoretical basis, SVM is a more robust technique with a better generalization capability and a lower risk of overfitting [8].Some comparative studies in environmental applications have shown that SVM outperforms MR or other modeling techniques [9][10][11][12][13].SVM has been successfully applied to explain the factors affecting species richness of marine elasmobranchs [14] and of freshwater fish species [15].
We herein present FactorsR (Supplementary Material: Help FactorsR), a freely available RWizard application which provides tools for the identification of the most likely causal factors (working with untransformed and/or log-transformed variables) significantly correlated with species richness and for depicting on a map the species richness predicted by an SVM model.The algorithm of FactorsR follows these steps: (1) quantification of the severity of multicollinearity among the factors; (2) estimation of the contribution of factors to species richness by using hierarchical partitioning; (3) selection of the factors with higher contribution among auto-correlated factors; (4) selection of those factors significantly correlated with species richness using the Akaike Information Criterion (AIC) in a stepwise multiple regression; and finally (5) species richness predictions and residuals of the SVM model performed with the factors selected are shown on a map.The application also performs statistical analyses to test for normality and homoscedasticity of residuals and autocorrelation in the MR.

Installation and Platform Availability
FactorsR can be installed and started as a regular R add-on package in both Linux and Mac OS installing the file PlotsR.tar.gz(Package source) and in RGUI (R Graphical User Interface) for Windows installing the file PlotsR.ZIP (Windows binaries), both files are available at the website [16], in the downloads section, i.e., FactorsR can be used without the need of RWizard.However, it is only when it is used in conjunction with RWizard that all its capabilities are truly revealed.To use FactorsR in RWizard, it is necessary to install the file Setup.FactorsR.v1.0.RAR is available on the website mentioned above.In this way, FactorsR will be added as an RWizard application.FactorsR is not available through Comprehensive R Archive Network (CRAN) because its large size is not acceptable according to CRAN policies.This large size is due to figures and a full explanation of the algorithm included in the help manual to facilitate the work of users.
RWizard is an open source interface under GNU General Public License that has been developed in C# on the Net platform.It is designed as an interface to facilitate the interaction with R (see video demonstration at the website [16].The only requirement for the installation of RWizard is that R and Net Framework 4.0 must be already installed, and then the file Setup.RWizard.V1.1.exe(also available as RAR file) must be installed.
Geographic data about country, region, and province, etc. borders were obtained from the web page [17].

Software Design
It has a friendly menu to easily import a file (CSV format), to transform the variables, to select the statistical analyses to be performed, to export plots and maps to JPEG files or windows device, and to select the administrative areas to be included in the maps (from the entire world to any specific country, region, province, etc.).Any modification on this menu is passed to the script.For further details, see the user's manual of Rwizard only available at the website [16] in the documentation section.

Functions of R Used in FactorsR
The goodness-of-fit measures for the entire hierarchy of models using all combinations of N independent variables is calculated using the package hier.part[18].Data is analyzed using multiple regressions and the SVM for regression.In both models there is the possibility to use log-transformed variables.Stepwise multiple regressions were performed with the stats R package [19].The relative contribution of each variable in the regressions was also estimated with the different methods available in the R package relaimpo [20,21].The Kolmogorov-Smirnov test with Lilliefors correction was used to test for normality of residuals and performed with the package nortest [22].Residuals from a linear regression or multiple regressions must be independent and the Durbin-Watson statistic is used to detect the presence of autocorrelation violating this assumption, which was performed with the package lmtest [23,24].The Breusch-Pagan test was used to test for homoscedasticity of residuals and performed with the package lmtest [23,24].
The presence of a multicollinearity-linear relationship among regressors-may be a major problem due to causing instability of estimates.The variance inflation factor (VIF) quantifies the severity of multicollinearity, thus providing an index that measures how much the variance of an estimated regression coefficient is increased because of collinearity [25], and this analysis was performed with the package usdm [26].
SVMs are supervised learning models with associated learning algorithms that analyze data and recognize patterns, and are used for classification and regression analysis.The SVM has excellent theoretical properties and a good performance under very general conditions and no strict assumptions are needed [10][11][12][13].We used the function ksvm of the package kernlab [27,28], which supports the well-known C-svc, nu-svc, (classification) one-class-svc (novelty) eps-svr, and nu-svr (regression) formulations along with native multi-class classification formulations and the bound constraint SVM formulations.SVM is a discriminative classifier formally defined by a separating hyperplane, so given labeled training data the algorithm outputs an optimal hyperplane, which categorizes new examples.The operation of the SVM algorithm is based on finding the hyperplane that gives the largest minimum distance to the training examples.The optimal separating hyperplane maximizes the margin of the training data.

Species Distribution
The range maps of terrestrial carnivores were developed using ModestR [29,30] functionalities by importing shape files downloaded from the website of The International Union for Conservation of Nature [31] and geographic records from The Global Biodiversity Information Facility [32].This included a total of 249 species of terrestrial carnivores distributed across 12 families (Table S1.List of species).All families of the order Carnivora without marine species were included, therefore excluding the families Odobenidae, Otariidae and Phocidae.A new species of the olinguito Bassaricyon neblina [32] was not included.
The data was filtered using the following data cleaning facilities available in ModestR [33]: (1) data cleaning when downloading records from Global Biodiversity Information Facility (GBIF) and (2) habitat data cleaning.
For model estimation purposes, the available data resolution was downscaled to 30′ × 30′.As the entire world was used, this encompassed areas in which species richness was zero.

Extent of Occurrence
Extent of occurrence (EOO) is defined as the area contained within the shortest continuous imaginary boundary, which can be drawn to encompass all the known, inferred or projected sites of present occurrence of a taxon [31].
One possibility to calculate the EOO in ModestR is to use the convex hull or the α-shape of the set of present occurrences of a species [30].In this study, the convex hull was used.

Area of Occupancy
Area of occupancy (AOO) is the spatial distribution of known, inferred or projected sites of occurrence expressed in km 2 and was estimated with ModestR as described by [34].We estimated AOO for each species with ModestR using rasterized maps with a resolution of 1′ × 1′ cells, in which the occupied area was estimated using the following equation: The value 1.852 is a nautical mile in km, 12,756.2 is 2× the radius of the earth in km and finally 21,600 in the value used to transform 1 min of arc.The latitude in the equation is the central value of the individual considered pixels.The equation was applied to all the pixels occupied by the species and all the obtained values were summed to calculate AOO.

Patch Distribution
The patch distribution of the species was estimated using the following equation, which ranges from 0 to 1, and is an indicator of habitat fragmentation.A value close to 1 indicates that AOO is equal to EOO, so the species occupies all areas (or habitats) within the geographical limits to its occurrence.A value close to 0 indicates that the species is patchy distributed.
(2) 2.2.5.Area of Occupancy Index and Patch Index Area of occupancy index (AOO index) is the mean AOO of all species present in each cell and was also estimated with ModestR [30].A lower AOO index indicates a higher number of species with a small geographic range size in the cell.
Patch index is the mean patch distribution of all species present in each cell and was also estimated with ModestR.A high patch index indicates a higher number of species with a range distribution less fragmented in the cell.

Environmental Variables
All the environmental variables used in this study are shown in Table S2.Environmental variables and were described in detail by [15].

Results
The factors selected with higher contribution (using hierarchical partitioning), without collinearity among them (a maximum VIF value of 9.4), and significantly correlated with species richness (using a stepwise regression) explained the 60.1% of the variance observed in species richness of terrestrial carnivores (Figure 1).Area of rivers in the cell, BIO3 and AOO index were negatively correlated with species richness.Patch index was the factor with the highest contribution to species richness followed by vegetation index (Figure 1). Figure 2 shows the relationships between species richness and those factors with a higher contribution.

Figure 1.
Relative contribution of the variables significantly correlated with species richness estimated by using hierarchical partitioning.Abbreviations: VI (vegetation index, in g C m 2 •d −1 ), BIO15 (precipitation seasonality, coefficient of variation), TPP (terrestrial primary production in g C m −2 •d −1 ), BIO2 (mean diurnal range in °C), BIO12 (annual precipitation in mm), altitude (in m), BIO3 (isothermality in °C), Pop (human population density in number of people per km 2 ), slope (topographic slope in degrees) and rivers (area of the river basins in km 2 ).LN means that the variable is ln transformed.
It was not possible to test the significance of the regression because the standardized residuals did not have a normal distribution (Shapiro-Wilk, p < 0.001), there was autocorrelation (Durbin-Watson, p < 0.001), and there was no homoscedasticity in the residuals (Breusch-Pagan test, p < 0.001).
The model performed with SVM explained 91.9% of the variance observed in the species richness of terrestrial carnivores.Figure 3 shows the actual richness, the richness predicted by SVM model, and the residuals of the SVM model.There are well-known and widely used estimators and species accumulation techniques directed to predict species richness and to identify well surveyed spatial units (WSsus).
The residuals of the model performed with SVM are also useful for discriminating the location of WSsus.The residuals showed a uniform green color in the map, indicating that most of the averaged residuals were close to zero (Figure 3, lower panel).Negative residuals, which may be areas with undiscovered and/or unregistered species, or areas with lower species richness due to the negative effect of anthropogenic factors, were observed in Chile and other areas of South America, in Madagascar, Sumatra, Taiwan, and Sulawesi (Figure 3).

Discussion
We demonstrated that the algorithm of FactorsR, combining classical statistical techniques for testing multicollinearity and the significance of factors with techniques focused on drawing inferences about the likely causality of factors, is useful for the identification of the factors affecting species richness.The novel contribution of the FactorsR's algorithm is to select among auto correlated factors (using untransformed and log-transformed factors) the most likely causal factors in controlling species richness, before to perform the stepwise multiple regression.Therefore, the algorithm is focused on drawing inferences about the likely causality of the significant variables, which is a more ecological rather than statistical approach.In the example used in this study to illustrate the usage of FactorsR, the variables selected corroborate many previous findings that terrestrial carnivores are vulnerable to habitat fragmentation and require landscape connectivity [36].Species richness was higher in areas with a higher vegetation index, high patch index, and low AOO index, i.e., areas containing a higher number of species confined to a specific, relatively small geographical area and/or species whose range distribution is not fragmented.Instead of measuring the fragmentation of the habitat, we approached the study by analyzing the degree of fragmentation of the species' range distribution.It is necessary to point out, however, that since habitat loss is assumed to be typically accompanied by habitat fragmentation, the results of empirical studies of habitat fragmentation are difficult to interpret if the measure fragmentation does not distinguish between habitat loss and habitat fragmentation per se, i.e., the breaking apart of habitat after controlling for habitat loss [37].Fragmentation of the species' range distribution may be the final product of habitat loss, habitat fragmentation per se, and/or other causes.
The other contribution of FactorsR's algorithm is to perform an SVM model for predicting species richness instead of using MR.The problem is that MR makes several strong assumptions that cannot be violated: no or little multicollinearity, multivariate normality, no auto-correlation, and homoscedasticity of residuals [38][39][40].It is easy to resolve the problem of multicollinearity, by just leaving only one variable among auto correlated variables [39].The problem resulting from residuals without normal distribution is that there can be no assurance that probability of the model, the degree of significance, is the correct one [39].If the assumption of no auto-correlation is not met, it is not possible to know exactly how much the variance is explained by the independent variables [39], so the r 2 is not correct.The fact of not fulfilling the homoscedasticity of the residuals means that the model is not as predictive for the entire range of values of the dependent variable [39].
It is difficult to fulfill all these assumptions of MR in ecological data, particularly the common problem of spatial autocorrelation [41].SVM models, however, can model complex and non-linear patterns without making any assumptions [7].In the example used in this study, it was not possible to test the significance of the regression because the standardized residuals did not have a normal distribution, and there was autocorrelation and there was no homoscedasticity in the residuals, common violations of the MR's assumptions in macroecological, biogeographical and conservation studies [39].Moreover, the variance explained by SVM (91.9%) was much higher than the one obtained with MR (60.1%); thereby corroborating many previous findings that SVM outperforms MR or other modeling techniques [10][11][12][13].

Conclusions
In summary, the novel statistical approach of the FactorsR's algorithm focusing on the identification of most likely causal factors significantly correlated with species richness, may be a useful tool in biodiversity research.Moreover, with its possibility of depicting negative residuals of the SVM model on a map, it may also be useful for identifying areas with undiscovered and/or unregistered species, or areas with lower species richness due to the negative effect of anthropogenic factors.

Figure 2 .
Figure 2. Relationships between species richness and some of the variables with higher relative contribution to species richness.Abbreviations as shown in Figure 1.The diagonal shows the frequency histograms of the variables, and the lower left panel indicates the smooth regression among variables and the upper right panel indicates the regression among variables.

Figure 3 .
Figure 3. Actual richness, the richness predicted by the Support Vector Machine (SVM) model and the residuals of the SVM in cells of 30′ × 30′.Negative residuals indicate a number of observed species in the cell lower than that predicted by the model.