For our study, we used RStudio for most of the data processing, as well as for model building and evaluation. For this, we used different imported packages, which are add-ons that extend the capabilities of what it is possible to do in R. The main packages used were raster, caret and dismo. The other packages used are used as side packages that are necessary for small tasks. This can also be implemented using Python, with packages such as scikit-learn [39
] for implementing the machine learning algorithms contained in the caret package, and rpy2 [40
] to allow the usage of the dismo package in Python. During the implementation of the study, we used different methods for processing the data, centered predominantly on feature selection. However, we also applied data transformation and feature extraction, mainly to understand and visualize our data structure. The full code can be accessed in the supplementary materials
We extracted the predictors’ values where the wildlife data points were located and merged this list with the wildlife data frame, resulting in a final dataset, where we had values for each predictor for each observation and absence point (an example of a predictor can be seen in Figure 3
In order to understand the importance of the different variables for the dataset, we performed a Principal Component Analysis (PCA), focused on gaining a better understanding of the structure of our data. We transformed the skewed predictors, and scaled and centered the data beforehand. We also needed to apply the one-hot encoding method, to create dummy variables for the categorical variables (land cover, ecoregions and Existing Vegetation (EVT)), followed by a preprocessing method in order to avoid a zero or near-zero value, which reduced the dataset to 34 variables.
In Figure 4
, we can inspect the observations and pseudo-absences across the environmental space that PCA produces. To interpret the biplot, the rules are:
The X-axis represents PC1, the first component of the PCA, and the Y-axis represents the second component, PC2;
The points in blue are presence points, and those in black, absence points;
The ellipses represent the average distributions of the presence and absence points;
The arrows represent variables, and when two variables are pointing in the same direction or opposite directions, they are highly dependent (thus, independent when pointing in orthogonal directions);
The longer the arrow, the higher the importance of the variable for the overall environmental variation.
We can see that most of the presence points (in blue) occupy a specific space, defined by PC1 and PC2, where the variation is largely explained by PC2. It is also possible to identify some variables that are correlated with each other (collinearity). We can see that all the remaining ecoregion categories were positively correlated with most of the EVT categories, explaining the same variations. The same observation can be drawn for some of the temperature and precipitation layers, where some of the latter are negatively correlated with some of the former. The importance of the temperature variables is also visible, by the fact that some of the arrows representing these (such as Bio2_Mean_Diurnal_Range) follow the longest axes of the species’ ellipses (sage-grouse distribution).
At the same time, we also computed a correlation matrix, with all our environmental variables, so we could understand and detect collinearity between the variables. We used the corrplot
function of the corrplot
], a package dedicated to the visualization of correlation matrices. The correlation matrix (Figure 5
) shows the pairwise correlation between two variables, where the areas of the squares show the absolute value-corresponding correlation coefficients.
At first glance, it is easy to detect a strong correlation between the climatic variables, especially among the precipitation and temperature variables. It is also possible to detect some correlation of these climatic variables with elevation. The other variables have very little- or less-relevant values.
We calculated the Pearson correlation coefficients between all the 27 environmental variables and then proceeded to check both correlation and importance with a pairs plot function. We applied a threshold of 0.85 for the correlation coefficient, selecting those variables that had a higher correlation and excluding the ones that were of least importance for explaining the variation of our data. The layers that were finally excluded are the following: Bio2, Bio4, Bio5, Bio6, Bio10, Bio11, Bio15, Bio16, Bio17, Bio18, Bio19 and the ecoregions, with a remaining final total of 15 predictors.
We extracted the environmental data for each of the points in the dataset, obtaining a data frame with both the presence and background sage-grouse points, and we then split the data into training (70%) and testing (30%) data [46
]. The split was made with a function that creates stratified random splits within each class, so that the distribution under each class is preserved as much as possible (function createDataPartition from caret package) [47
]. To guarantee reproducibility, we set a seed number prior to the partition.
For RF, SVM and ANN model training, the caret package was utilized, while for MaxEnt, we used the dismo package. For all the models, we used the same partitioned data.
Setting up for RF, SVM and ANN Models
For this paper, supervised classification was used as the approach for ML algorithms, which can be automated for larger areas, a benefit when studying the statewide area of Utah, which covers 219,887 km2
. Using ML algorithms, present sage-grouse sites were mapped, and thereafter, the probabilistic modeling of future sites could be predicted. The visualization of the maps resulting from the outputs can be accessed in the supplementary materials
The classification algorithms used were deemed suitable for our study due to the absence of any assumption of normal distribution; their abilities to deal with the complexities of feature space, patterns and relationships; and the robustness of each model. Moreover, the choice of both categorical and continuous input allowed for flexibility in the predictor variables used.
To perform the training model, some parameters had to be included:
trainControl: defines the type and number of resampling, as well as the search method. We used cross-validation with 10 folds, and with random search.
metric: determines how the final model is defined, by selecting the tuning parameters with the highest value of the objective function. Amongst the functions available, we set it to “Accuracy”.
tuneLength: sets the size of the default grid of the tuning parameters; set to 15 for all our models.
preProcess: we selected to center and scale before resampling.
The parameters were selected according to the literature, as well as by exploring different possible combinations, and their effects on the performance of the model. For the RF model, it was necessary to define the number of trees, which was set to 1000. For reproducibility, we set a seed number before each model. After completing our training set up, we ran each model and studied their outputs. After their final tuning, we ran the predictions for each model based on the trained model and on the layer of stacked predictors. This produced a plotted output of a suitability map that presented the areas predicted as habitats and non-habitats in the study area. The process of setting up MaxEnt was slightly different: MaxEnt is included in the dismo package, and the presence/background vectors could not be used in the form of a factor, unlike for the other models, but categorical data had to be transformed into a factor.
Once the model was created, we generated a first prediction map, which gave a map with the probabilities of each pixel in the area being suitable/unsuitable, ranging from 0 to 1. This was performed with the raw output of the model and differed from the other algorithms in that the default output was not a binomial suitability map. To create the binomial map, it is necessary to apply a threshold to the prediction map. Then, we proceeded to evaluate the MaxEnt model with the test data by using the evaluate function from dismo. The evaluation required the test data (with the environmental data) to be separated into presence and absence. Therefore, the test data were split according to these two categories. Then, the evaluation was performed with the test data, and the MaxEnt model was created.
The output is an evaluation model file that includes all the parameters necessary to evaluate the model. Since the other algorithms’ prediction maps are binomial ones showing suitable/unsuitable habitats, it was necessary to apply the True Skill Statistic (TSS) threshold to the predicted probability map from MaxEnt to transform it into a binomial map that we could compare with the outputs from the other models. We based the evaluation of each model’s credibility on performance-based statistics: Cohen’s kappa, Omission and Commission errors, Accuracy, and the confusion matrices; all provide relevant and useful information in model analysis.
The Omission and Commission Error can be used to analyze the accuracy of the models when classifying the input points. The Omission Error refers to reference points that were left out (or omitted) from the correct class in the classified map, while the Commission Error refers to sites that are classified as reference points that were left out (or omitted) from the correct class in the classified map.
From these errors, it is also possible to calculate the User’s and Producer’s Accuracy. The Producer’s Accuracy is calculated as 1−Omission Error, and it translates into the percentage of reference points that were not omitted, whilst the User’s Accuracy is calculated as 1−Commission Error for each of the classes and accounts for the percentage of correctly classified sites or pixels. For further assessment of the models, an external evaluation was performed for the state of Idaho, addressed later on in the paper.
Future Predictions for Each Scenario
For the future predictions, we used only the environmental data that were available with future scenarios, namely, the climatic variables from WorldClim, Land Cover and Elevation. Other environmental layers used previously were not included, since the same variables were needed for current and future scenarios.
Once all the data were collected and preprocessed, we followed the same steps in building the models as we did before for the present data, following the same code. Once the models were created again, we loaded the future layers into R to prepare for the predictions. The future raster layers were read into R and stacked accordingly. We performed two predictions with each algorithm, one for each of the climate change scenarios selected (SSP2-4.5 and SSP3-7.0). Once run through future habitat prediction models for each of the algorithms, the outputs were saved in .tif and .grd formats, which could then be read into a GIS software for further visualization and post-processing. All the outputs can be visualized in the supplementary materials