A Robust Prediction Model for Species Distribution Using Bagging Ensembles with Deep Neural Networks

Species distribution models have been used for various purposes, such as conserving species, discovering potential habitats, and obtaining evolutionary insights by predicting species occurrence. Many statistical and machine-learning-based approaches have been proposed to construct effective species distribution models, but with limited success due to spatial biases in presences and imbalanced presence-absences. We propose a novel species distribution model to address these problems based on bootstrap aggregating (bagging) ensembles of deep neural networks (DNNs). We first generate bootstraps considering presence-absence data on spatial balance to alleviate the bias problem. Then we construct DNNs using environmental data from presence and absence locations, and finally combine these into an ensemble model using three voting methods to improve prediction accuracy. Extensive experiments verified the proposed model’s effectiveness for species in South Korea using crowdsourced observations that have spatial biases. The proposed model achieved more accurate and robust prediction results than the current best practice models.


Introduction
Biodiversity is an indispensable asset for sustainable human life. Although biodiversity is vital to balance natural ecosystems, it has declined dramatically over the past few centuries [1,2]. Humans have consumed biological resources as raw food material, industrial products, and pharmaceutical resources [3][4][5], and their indiscriminate use, as well as and rapid environmental changes, have adversely affected biodiversity preservation [6]. For instance, changes in agricultural forms, residential development, and logging over the past decades have led to a 68% decline in the number of mammals, birds, reptiles, and amphibians worldwide between 1970 and 2016 [7]. Negative biodiversity changes can destroy stable ecosystem balances, and many ecologists have proposed strategies to efficiently maintain ecosystem balance, such as species protection laws and habitat preservation [8][9][10]. Many ecologists have addressed the importance of habitat preservation for species protection, because it is very difficult to restore a habitat's original condition once it is destroyed. Various studies characterized habitats associated with species survival and predicted species viability in new locations [11][12][13]. Species distribution models (SDMs), also known as ecological niche models, habitat models, and range mapping, predict species distributions based on observational data for the species and the related environment. SDMs have become standard tools for ecological studies [14][15][16][17], greatly helping to understand the suitability of a particular species to a habitat and derive possible habitat candidates. Methodologies for constructing SDMs are typically classified into presenceonly (PO) [18][19][20] and presence-absence (PA) [21][22][23][24] based SDMs. PO-based SDMs were widely used before the advent of PA-based SDMs, since they did not require absence data and provide a simple and fast assessment of environmental suitability at a given location, e.g., the surface range envelop (SRE) [25] and Bioclim [26] methods. PO-based models for constructing the environmental suitability model (ESM) were considered useful in terms of the intuitive understanding of a niche; they provide slightly different answers to ecological questions than those of the typical PA-based SDMs. That is, PO-based models have focused primarily on understanding the potential distribution of species from niches. In contrast, typical PA-based SDMs have focused much more on the realized distribution in the evaluation process [27,28]. Therefore, since these two models have different purposes, it is not appropriate to compare them quantitatively.
Recently, PA-based SDMs have emerged with the development of machine learning (ML) technology, making it possible to reflect relationships between various parameters in the prediction. Consequently, SDM capabilities have improved dramatically and provided a new perspective on ecological research. Starting with the generalized linear model (GLM) and generalized additive (GAM) model, various ML methods-such as maximum entropy (MAXENT), random forest (RF), and generalized boosted regression (GBM)-have been used to construct SDMs. [29][30][31][32][33][34][35]. Moreover, by incorporating the PA approach (PA-ML SDMs), SDM models can have their own feature learning process and model optimization strategy. Most works on PA-ML SDMs focused on selecting the ML model that best suits their purpose and finding the best configuration to improve the prediction performance of the realized distribution of a given species.
However, some critical issues remain to be addressed to achieve meaningful predictions using PA-ML SDMs. Typical PA-ML SDMs rarely considered strategies to generate pseudo-absence data, with most approaches generating pseudo-absence randomly. Thus, the pseudo-absence level was out of balance with the presence data, which negatively affected predictive performance. The neglect of the spatial distributions of PA data collected from citizen science databases can lead to bias because SDM prediction is strongly dependent on the spatial distribution of PA data. Species observations collected by citizen scientists are often biased and more aligned with citizen preferences than scientific objectives [36,37]. For instance, observation data tend to be prevalently collected near urban areas, where many citizen scientists live. Deep neural networks (DNNs) have achieved superb performance in diverse domains, including classification, translation, and prediction when enough data is available for model training. Even in SDM, DNNs could be used effectively to identify potential habitats [38] compared to traditional SDM approaches and overcome their shortcomings in prediction when sufficient data are available [39]. Moreover, the effectiveness of DNNs on SDM was evaluated based on various data sample sizes and layer configurations [40].
However, there are several critical issues that most previous DNN-based works on SDM have not considered: (1) Which method is more suitable for generating pseudoabsence data in the SDM process, (2) How to obtain enough observation data with relaxed spatial bias and generate a well-balanced dataset for SDM training, and (3) How to configure DNN to further improve SDM performance compared to the existing SDM approaches. In order to address these issues, in this paper, we propose a novel SDM model using a bootstrap aggregating (bagging) DNN ensemble. The main contributions of this paper are as follows.

2.
We investigate several bias-minimizing methods, including selecting valuable environmental features and generating bootstraps from PA datasets. We use variance inflation factor (VIF) analysis to select suitable environmental features and use random sampling with replacement to generate multiple bootstraps.

3.
We predict species distribution using our ensemble DNNs trained with generated bootstraps, three voting methods, and repeated cross-validation to ensure reliable results. The generated models were compared to state-of-the-art practical SDMs using five evaluation metrics.
The remainder of this paper is organized as follows. Section 2 describes the major steps required to construct the proposed model, and Section 3 shows experimental results to evaluate prediction performance and predicted species distribution visualizations. Section 4 provides an extensive discussion of the findings of the study, and Section 5 concludes the paper. Figure 1 shows the proposed model's overall structure. The model comprises three components: dataset construction, bootstrap generation, and ensemble model construction. We first describe how to construct the dataset, including species observation data, environmental variables, and pseudo-absence data. Then we explain how to generate bootstraps, considering PA data balancing, and finally we discuss how to train multiple DNNs and combine them to predict a particular species' distribution. The remainder of this paper is organized as follows. Section 2 describes the major steps required to construct the proposed model, and Section 3 shows experimental results to evaluate prediction performance and predicted species distribution visualizations. Section 4 provides an extensive discussion of the findings of the study, and Section 5 concludes the paper. Figure 1 shows the proposed model's overall structure. The model comprises three components: dataset construction, bootstrap generation, and ensemble model construction. We first describe how to construct the dataset, including species observation data, environmental variables, and pseudo-absence data. Then we explain how to generate bootstraps, considering PA data balancing, and finally we discuss how to train multiple DNNs and combine them to predict a particular species' distribution.

Dataset Construction
For dataset construction, we considered four least concern (LC) grade species and one endangered (EN) grade species that are considered indicator species for South Korea's ecosystem, with known different habitat characteristics, as shown in Table 1. The observation data of target species including Hynobius leechii [45], Cyanopica cyanus [46], Platalea minor [47], Hypsipetes amaurotis [48], and Hyla japonica [49] were collected from the GBIF database. Hynobius leechii, known as the Korean salamander, is a species of salamander commonly found on the Korean peninsula which typically inhabits forested hills and wetlands. Cyanopica cyanus mainly lives in Eastern Asia, including China, Korea, and Japan, and lives in coniferous and broad-leaved forests. Platalea minor, known as the Black-faced spoonbill, inhabits the marine coastal zone, the marine intertidal zone, and sea cliffs. Platalea minor is mainly observed in Macau, Hong Kong, Taiwan, Vietnam, and South Korea in winter. Hypsipetes amaurotis typically lives in the Far East of Russia, northeastern China, Japan, and the Korean Peninsula. Hypsipetes amaurotis prefer tropical forests, but can adapt to urban and rural environments. Hyla janoica, called the Japanese tree frog, is widespread in Japan, China, northern Mongolia, the Russian Far East, and Korea. Hypsipetes amaurotis lives in mixed and deciduous broad-leaved forests, wetlands, and river valleys with shrubs. This study is based on the observation data of the target species in South Korea, shown in Figure 2.

Dataset Construction
For dataset construction, we considered four least concern (LC) grade species and one endangered (EN) grade species that are considered indicator species for South Korea's ecosystem, with known different habitat characteristics, as shown in Table 1. The observation data of target species including Hynobius leechii [45], Cyanopica cyanus [46], Platalea minor [47], Hypsipetes amaurotis [48], and Hyla japonica [49] were collected from the GBIF database. Hynobius leechii, known as the Korean salamander, is a species of salamander commonly found on the Korean peninsula which typically inhabits forested hills and wetlands. Cyanopica cyanus mainly lives in Eastern Asia, including China, Korea, and Japan, and lives in coniferous and broad-leaved forests. Platalea minor, known as the Black-faced spoonbill, inhabits the marine coastal zone, the marine intertidal zone, and sea cliffs. Platalea minor is mainly observed in Macau, Hong Kong, Taiwan, Vietnam, and South Korea in winter. Hypsipetes amaurotis typically lives in the Far East of Russia, northeastern China, Japan, and the Korean Peninsula. Hypsipetes amaurotis prefer tropical forests, but can adapt to urban and rural environments. Hyla janoica, called the Japanese tree frog, is widespread in Japan, China, northern Mongolia, the Russian Far East, and Korea. Hypsipetes amaurotis lives in mixed and deciduous broad-leaved forests, wetlands, and river valleys with shrubs. This study is based on the observation data of the target species in South Korea, shown in Figure 2.         Several studies have shown that a sufficient level of species observation data must be obtained for a reasonable SDM performance [50,51]. Therefore, we collected observational data from four popular citizen science databases: GBIF, VertNet, BISON, and Naturing. As discussed above, species observational data in citizen databases are usually spatially biased. Hence, we used the spatial thinning (ST) algorithm in the "spThin" package [52] to alleviate data bias, which returns a refined dataset with the maximum number of presences for a given thinning distance when run for sufficient iterations. ST identifies several new subsets from a set of presences that meet the minimum nearest neighbor distance (NND) constraint. The specific procedure is as follows: (1) A thinning distance x is determined by the user. (2) Pair-wise distances between all presences are calculated. (3) For each presence, the number of presences within distance x is identified. (4) The presence with the highest number of neighboring presences within the NND is determined. (5) One of the presences identified in Step 4 is randomly removed. (6) Steps 3 to 5 are repeated until no presence in the dataset has a nearest neighbor closer than x. In this study, we set the thinning distance to 10 km by considering the target area and the locations of species observation data.
The places where the species were observed were closely related to environmental factors. In particular, land condition, season, and climate have a strong influence on species migration or habitat determination. The climatic condition is generally an important factor in determining the habitat in which a species can live, and the distribution of species can be derived based on this at various spatial resolutions. In addition, most species' occurrence patterns are highly influenced by temperature and moisture, which are associated with precipitation along the topography. Therefore, it is believed that the distribution of a given species is closely related to topographic and climatic conditions. As all five species have different suitable habitat types, and the sustainable living of each species relies heavily on the land type, we selected the land types as predictors, which elaborately represent the state of the land with a high spatial resolution. As a result, we considered 19 bioclimatic variables from the WorldClim dataset [53] and 14 land cover variables from GlobCover 2009 [54] as input variables. We used 30 arcsec grid bioclimatic variables corresponding to approximately 1 km resolution for small region predictions across South Korea. The GlobCover 2009 dataset was converted from ENVISAT's medium resolution imaging spectrometer data with approximately 300 m resolution for each spatial grid. We configured all layers' resolutions to 3000 × 3000 pixels and cropped all layers to the entire South Korean territory, represented by (125.000, 38.083), (129.583, 38.083), (125.000, 33.166), and (129.583, 33.166). Cropped layers were georeferenced based on the World Geodetic System (WGS84). Table 2 summarizes the 33 environmental variables used as input.
We used the variance inflation factor (VIF) to avoid over-fitting, where R 2 i is the coefficient of determination for x i on the other independent variables, and p is the number of independent variables. VIF estimates the impact of the collinearity of x i on the other independent variables.
VIF is calculated through linear regression, where the regression coefficient there is no multi-collinearity between the input variables, and vice versa. VIF > 10 represents a strong collinearity, which will adversely affect the modeling results. We performed a stepwise selection of environmental variables using VIF. Because VIF values change after each variable is removed, it is not sufficient to use the entire set of environmental variables in the initial comparison. Therefore, we eliminated environmental variables with strong collinearity using a stepwise process to recalculate the remaining input variable's VIF value after eliminating variables with VIF > 10 [55]. For instance, we calculated the VIF value for each variable, removed the variable with the highest VIF value, and then recalculated all VIF values with a new set of variables until all values were below the threshold.

Pseudo-Absence and Bootstrap Generation
To generate effective pseudo-absence data, we focused on how to balance the presence and absence data, and where to generate pseudo-absence data. Although PA data should be balanced for effective training, most previous studies have hardly considered this balance when creating datasets. This class imbalance has hindered the predictive performance for ML algorithms and can be alleviated by keeping the PA ratio intermediate [56].
Many SDM studies have shown that the locations implied by generated pseudoabsence data can affect SDM predictive performance [56][57][58]. Randomly generated pseudoabsence data is commonly used to construct SDMs, but this is not always best for all cases [58]. Hence, we investigated random generation (RG), random generation with exclusion buffer (RGEB), and random generation with environmental profiling (RGEP) to find a stable pseudo-absence generation method utilizing the "mopa" package from the R software repository [59]. This package provides designing tools for several factors that influence SDM uncertainty, such as pseudo-absence generation, statistical analysis of selected predictors, and climate projections widely accepted in the species distribution modeling process. RG generates pseudo-absence data randomly across the entire area of interest, whereas RGEB adjusts distances between pseudo-absence data using an exclusion buffer. Several empirical studies have recommended the exclusion buffer = 10 km around each presence location to avoid grids containing both presence and pseudo-absence data [52][53][54]. Therefore, we also adopted a 10 km exclusion buffer when generating random pseudo-absences. RGEP defines the environmental range in which pseudo-absences are sampled. Inappropriate environmental areas can be inferred by one class support vector machines (OCSVMs). An OCSVM, which is one of the unsupervised learning algorithms, is trained only on normal data. In this study, the OCSVM learns the valid boundaries of presences, so any point outside the boundary can be considered a valid site when generating pseudo-absences. Therefore, pseudo-absence locations generated by RGEP are not environmentally similar to presence locations. We have attached the pseudo-absence generation code using the "mopa" package in the Supplementary Materials. Figure 3b-d show pseudo-absence data generated by RG, RGEB, and RGEP, respectively, from the corresponding presence data in Figure 3a.
We subsequently applied bootstrap sampling considering the PA balance. Bootstrap sampling is a resampling method that samples independently with replacement from a sample dataset with the same sample size and performs inferences among these resampled data. Bootstrap sampling reduces biases and strengthens prediction model generalizability [60,61]. Prediction model robustness can be improved by training using bootstraps and combining them, which is known as bootstrap aggregating or bagging. Many studies have shown advantages from bagging [60][61][62]. We generated 10 times more pseudo-absence data than presence data using three pseudo-absence data generating methods, as shown in Figure 3b-d, and selected the same number of pseudo-absence data points as the number of presences, allowing for resampling with replacement. Bootstrap samples were generated using the "boot" R package [62]. We included the code for generating bootstraps using the "boot" package in the Supplementary Materials. Thus, we obtained N bootstrap samples suitable for training machine learning models.
ating pseudo-absences. Therefore, pseudo-absence locations generated by RGEP are not environmentally similar to presence locations. We have attached the pseudo-absence generation code using the "mopa" package in the Supplementary Materials. Figure 3b-d show pseudo-absence data generated by RG, RGEB, and RGEP, respectively, from the corresponding presence data in Figure 3a. (c) Pseudo-absence generated by RGEB (d) Pseudo-absence generated by RGEP We subsequently applied bootstrap sampling considering the PA balance. Bootstrap sampling is a resampling method that samples independently with replacement from a sample dataset with the same sample size and performs inferences among these resampled data. Bootstrap sampling reduces biases and strengthens prediction model generalizability [60,61]. Prediction model robustness can be improved by training using bootstraps and combining them, which is known as bootstrap aggregating or bagging. Many studies have shown advantages from bagging [60][61][62]. We generated 10 times more pseudo-absence data than presence data using three pseudo-absence data generating methods, as shown in Figure 3b-d, and selected the same number of pseudo-absence data points as the number of presences, allowing for resampling with replacement. Bootstrap

Ensemble Approach for Model Construction
Recent deep learning developments have enabled new SDM possibilities. Since DNNbased SDMs have already shown better performance than various traditional models [38,63,64], we combined multiple DNNs constructed using bootstraps to maximize prediction performance, which is known as ensemble modeling. Figure 4a shows the proposed DNN model structure for species distribution predictions. The model comprises three layers, similar to typical DNN models, with the input layer receiving environmental variables corresponding to PA locations. Hidden layers perform a weighted summation of the inputs followed by a non-linear transformation. The output layer produces final  To optimize the DNN, we calculated optimal hyper-parameters using grid search [59,60], which includes an infinite number of grid search iterations, five cross-validations, four hidden layers (with 250, 200, 150, and 100 neurons each), a batch size of 75, a learning rate of 0.001, and 10,000 epochs with early stopping. Other hyper-parameters are L2 regularization, to prevent the overfitting of DNN, adaptive moment estimation (Adam) optimization, to update the weights of the DNN model iteratively based on training data, the rectified linear unit (ReLU) activation function, to overcome the vanishing gradient problem, and the cross entropy error loss function, to calculate the difference between two probability distributions. These hyperparameters were determined empirically because they provided the best performance during cross-validation.
To initialize the weights for neuron inputs, we used the he-normal (HE) initialization. Each DNN model was trained until the optimization process was done. Our trained DNN model was used to produce the probability of presence and absence for a species. The more suitable the regions in the study area for a particular species' habitat, the closer the probability is to 1, and vice versa.
Subsequently, we built an ensemble model based on the trained models to improve generality and predictive performance. The ensemble technique reduces over-fitting and provides a better predictive performance than a single model, and has been widely used To optimize the DNN, we calculated optimal hyper-parameters using grid search [59,60], which includes an infinite number of grid search iterations, five cross-validations, four hidden layers (with 250, 200, 150, and 100 neurons each), a batch size of 75, a learning rate of 0.001, and 10,000 epochs with early stopping. Other hyper-parameters are L2 regularization, to prevent the overfitting of DNN, adaptive moment estimation (Adam) optimization, to update the weights of the DNN model iteratively based on training data, the rectified linear unit (ReLU) activation function, to overcome the vanishing gradient problem, and the cross entropy error loss function, to calculate the difference between two probability distributions. These hyperparameters were determined empirically because they provided the best performance during cross-validation.
To initialize the weights for neuron inputs, we used the he-normal (HE) initialization. Each DNN model D i was trained until the optimization process was done. Our trained DNN model was used to produce the probability of presence and absence for a species. The more suitable the regions in the study area for a particular species' habitat, the closer the probability is to 1, and vice versa.
Subsequently, we built an ensemble model based on the trained models to improve generality and predictive performance. The ensemble technique reduces over-fitting and provides a better predictive performance than a single model, and has been widely used for ecological modeling [63,64]. Figure 4b presents the schematic of the ensemble model construction process. Each DNN model D i trained by bootstrap sample i performs a fivefold cross-validation. Then, each D i is evaluated by the true skill statistic (TSS) value that is used as the weight in ensemble modeling. The trained DNNs, which can be regarded as sub-models, were combined using majority (MV), weighted majority (WMV), and weighted soft (WSV) voting schemes. MV is a simple and effective combination method to solve classification problems, and we denoted class label outputs for sub-model D i as c-dimensional binary vectors, where i = 1, . . . , N; d i,c = 1 or 0 depending on whether sub-model d i,c correctly chooses c or not; and N is the number of sub-models. The majority voting rule gives an ensemble decisionŷ for class prediction.ŷ WMV assigns a weight to each sub-model and aggregates their prediction results. The underlying concept is that an outstanding sub-model will have a relatively high weight, ≈ 1, whereas weaker sub-models will have a relatively lower weighted value, ≈ 0. Weight w i can be assigned to the ith sub-model using the normalized true skill statistic for each sub-model, i.e., the average from a five-fold cross-validation. TSS is a reliable evaluation metric for SDM performance. Hence, we used a normalized TSS for the weight, with the final prediction:ŷ Assuming the ensemble model is well calibrated for predicting species distribution, a weighted soft voting scheme can be applied to combine the sub-models. Class labels are determined based on the predicted probability p for each classifier,

Evaluation Approach
Five evaluation metrics were employed to assess SDM performance: area under the curve (AUC), sensitivity, specificity, kappa statistic K, and true skill statistic TSS, which have been widely used for ecological studies [56][57][58]. Table 3 shows the confusion matrix summarizing correspondence between observations and predictions in terms of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). We used 0.5 as the threshold to calculate TP, TN, FP, and FN, which are widely used in SDM evaluation. Table 3. Confusion matrix for SDM evaluation.

Predicted Present Predicted Absent
Actually present True positive False negative Actually absent False positive True negative AUC represents how well the model discriminates presence from pseudo-absence data, where AUC closer to 1 implies better discrimination performance. Sensitivity, also called hit rate, measures how well the model infers presence data, Specificity indicates how well a model can infer absence data, Kappa statistic K measures the extent to which the agreement between observed and predicted is higher than that expected by chance alone, Therefore, K can alleviate overestimating accuracy. TSS is defined from the standard confusion matrix components and represents matches and mismatches between observation and predictions, TSS = Sensitivity + Specificity − 1 TSS is often used as an alternative to AUC, and is probably the best model performance summary. Table 4 shows the evaluation criteria for AUC, K, and TSS metrics to evaluate SDM performance.

Experimental Setting
The prediction performance for the proposed model was compared with the current best practice SDMs on several public datasets. Datasets were balanced using three data generation methods to solve the bias problem for most observation datasets, and their effectiveness was evaluated. Table 5 shows that the comparison SDMs included classification tree analysis (CTA), shallow neural network (SNN), flexible discriminant analysis (FDA), and multivariate adaptive regression splines (MARS), GLM, GBM, RF, SRE, and MAXENT. All models were implemented using "BIOMOD2" in R, a popular tool for ecological modeling [61]. Training strategies and selected parameters are also shown in Table 5. Three prediction models were considered depending on the ensemble method: MV-based DNNs for the ensemble (MV-EDNN), WMV-based DNNs for the ensemble (WMV-EDNN), and WSV-based DNNs for the ensemble (WSV-EDNN). These were built using the Scikit-learn Python package [65], with 80% of the target species observations as training and 20% as test sets. We used the "BIOMOD_Modeling" function in the "BIOMOD2" R software package to evaluate and calibrate the range of species distribution models techniques run over a given species [64]. In the Supplementary Materials, we included the calibration results of all prediction model and target species, as shown in Figures S1-S5. Our calibration process was performed with an 80% random subpart of the given species' presence-absence dataset. To validate the prediction models, we used a five-fold cross-validation with five-time

Pseudo-Absence Generation Strategy Effects
This section compares various SDM performances under three pseudo-absence generation strategies. Although RG is popularly used to generate pseudo-absence data for SDM, we also tested RGEB and RGEP to find the best method to improve predictive power. We set the presence and pseudo-absence ratio for the training and test sets to 0.5, with 80% and 20% dataset split, respectively. Tables 6-8 show average estimation results for the three generation methods for all target species.   Most SDMs showed a significant prediction performance improvement under RGEB and RGEP compared with RG. Although RG is the simplest and most common method to generate pseudo-absences, RGEB and RGEP provided a better predictive ability improvement, resulting in more reliable prediction maps for species distribution. Figure 5 compares the prediction performance for the three generation methods using a five-fold cross validation. RGEP provided a better performance and variance for all metrics compared with RG and RGEB, with mean sensitivity = 0.681, 0.589, and 0.645, respectively. In the case of specificity, RGEB and RGEP methods reduced false absence predictions, i.e., narrower specificity box ranges than RG. Considering AUC, K, and TSS, RGEB and RGEP methods provided a better prediction ability for most SDMs than RG. Thus, RGEP was the most effective overall method in terms of performance improvement.

SDM Stability for Unbalanced Datasets
This section investigates the SDM prediction stability for different presence to pseudoabsence ratios. Training sets were organized by randomly selecting 0.8 × n × π presences and 0.8 × n × (1 − π) absences from the full dataset excluding test data, where n and π represent dataset size and class ratio, respectively. We used 0.2 × n × 0.5 as test presence and absence. Table 9 shows the evaluation results for SDM target species with respect to π using a five-fold cross-validation and mean evaluation metrics.

SDM Stability for Unbalanced Datasets
This section investigates the SDM prediction stability for different presence to pseudo-absence ratios. Training sets were organized by randomly selecting 0.8 × n × π presences and 0.8 × n × (1 − π) absences from the full dataset excluding test data, where n and π represent dataset size and class ratio, respectively. We used 0.2 × n × 0.5 as test presence and absence. Table 9 shows the evaluation results for SDM target species with respect to π using a five-fold cross-validation and mean evaluation metrics.    Figure 6a shows that the highest results for all the evaluation metrics were achieved for balanced class ratio, i.e., π = 0.5. The average sensitivity for most SDMs decreased as π decreased, and vice versa for specificity. The three proposed models, MV-EDNNs, WMV-EDNNs, and WSV-EDNNs, achieved the best overall prediction performance for 0.2 < π < 0.5. The bagging-based SDMs' (the three proposed and RF models) performance reduced as π decreased from 0.5 to 0.2. Sensitivity decreased considerably, except for bagging-based SDMs, as π changed from 0.5 to 0.2, whereas specificity showed a relatively large increase for GLM, CTA, SNN, GBM, and MAXENT SDMs (Figure 6b). Figure 6c-e show that although GLM, SNN, MARS, GBM, MAXENT, SRE, CTA, and DNN achieved a relatively high AUC, they had a relatively low K and TSS performance.
Thus, these SDMs were good at predicting pseudo-absences, but were not good at predicting presences. AUC has been a standard method to assess predictive distribution model accuracy. However, it is highly affected by a well-predicted absence rate. All SDMs except bagging-based SDMs exhibited significantly decreased K as π decreased to 0.2, whereas bagging-based SDMs exhibited very little K reduction as π decreased. Thus, bagging-based SDMs outperformed typical SDMs as the species datasets became more unbalanced.

Impact of the Ensemble Size
This section investigates the impact of the ensemble size for constructing SDM efficiently. The ensemble size refers to how many bootstrap trained sub-models were combined to build the final prediction model. Combining multiple sub-models does not always improve prediction performance; hence, it is important to find optimal combinations in terms of processing time and cost. Therefore, we measured sensitivity, specificity, AUC, K, and TSS for increasing the ensemble size from 2 to 40 for five species (Hynobius leechii, Cyanopica cyanus, Platalea minor, Hypsipetes amaurotis, and Hyla japonica). As above, we used a five-fold cross-validation for MV-EDNN, WMV-EDNN, and WSV-EDNN, and average evaluation scores to assess the impact of the ensemble size. Figure 7 shows that although there is some difference depending on the species, performance improved as the ensemble size increased up to a certain point, with relatively little further improvement beyond that, e.g., maximum K and TSS = 0.891 and 0.890, respectively, for Hynobius leechii when ensemble size = 27 ( Figure 7a). Thus, an appropriate ensemble size can provide a better performance than individual models in most cases.  Figure 6a shows that the highest results for all the evaluation metrics were achieved for balanced class ratio, i.e., π = 0.5. The average sensitivity for most SDMs decreased as π decreased, and vice versa for specificity. The three proposed models, MV-EDNNs, WMV-EDNNs, and WSV-EDNNs, achieved the best overall prediction performance for 0.2 < π < 0.5. The bagging-based SDMs' (the three proposed and RF models) performance reduced as π decreased from 0.5 to 0.2. Sensitivity decreased considerably, except for bagging-based SDMs, as π changed from 0.5 to 0.2, whereas specificity showed a relatively large increase for GLM, CTA, SNN, GBM, and MAXENT SDMs (Figure 6b). Figure 6c-e show that although GLM, SNN, MARS, GBM, MAXENT, SRE, CTA, and DNN achieved a relatively high AUC, they had a relatively low Κ and TSS performance. Thus, these SDMs were good at predicting pseudo-absences, but were not good at predicting presences. AUC has been a standard method to assess predictive distribution model accuracy. However, it is highly affected by a well-predicted absence rate. All SDMs except bagging-based SDMs exhibited significantly decreased Κ as π decreased to 0.2, whereas bagging-based SDMs exhibited very little Κ reduction as π decreased. Thus, bagging-based SDMs outperformed typical SDMs as the species datasets became more unbalanced.

Impact of the Ensemble Size
This section investigates the impact of the ensemble size for constructing SDM efficiently. The ensemble size refers to how many bootstrap trained sub-models were combined to build the final prediction model. Combining multiple sub-models does not always improve prediction performance; hence, it is important to find optimal combinations in terms of processing time and cost. Therefore, we measured sensitivity, specificity, AUC, Κ, and TSS for increasing the ensemble size from 2 to 40 for five species (Hynobius leechii, Cyanopica cyanus, Platalea minor, Hypsipetes amaurotis, and Hyla japonica). As above, we used a five-fold cross-validation for MV-EDNN, WMV-EDNN, and WSV-EDNN, and average evaluation scores to assess the impact of the ensemble size. Figure 7 shows that although there is some difference depending on the species, performance improved as the ensemble size increased up to a certain point, with relatively little further improvement beyond that, e.g., maximum Κ and TSS = 0.891 and 0.890, respectively, for Hynobius leechii when ensemble size = 27 ( Figure 7a). Thus, an appropriate ensemble size can provide a better performance than individual models in most cases.

Impact of the Ensemble Size
This section investigates the impact of the ensemble size for constructing SDM efficiently. The ensemble size refers to how many bootstrap trained sub-models were combined to build the final prediction model. Combining multiple sub-models does not always improve prediction performance; hence, it is important to find optimal combinations in terms of processing time and cost. Therefore, we measured sensitivity, specificity, AUC, Κ, and TSS for increasing the ensemble size from 2 to 40 for five species (Hynobius leechii, Cyanopica cyanus, Platalea minor, Hypsipetes amaurotis, and Hyla japonica). As above, we used a five-fold cross-validation for MV-EDNN, WMV-EDNN, and WSV-EDNN, and average evaluation scores to assess the impact of the ensemble size. Figure 7 shows that although there is some difference depending on the species, performance improved as the ensemble size increased up to a certain point, with relatively little further improvement beyond that, e.g., maximum Κ and TSS = 0.891 and 0.890, respectively, for Hynobius leechii when ensemble size = 27 ( Figure 7a). Thus, an appropriate ensemble size can provide a better performance than individual models in most cases.

Case Study
We selected the Platalea minor's distribution to verify the prediction outcomes. Platalea minor, also known as black-faced spoonbill, lives mainly in Korea, Hong Kong, and Taiwan and is designated as "endangered" in the IUCN red-list (see Table 1). The species has also been designated as a Natural Monument (No. 205-1) by the Cultural Heritage Administration of Korea and is classified as endangered species Ⅰ by the Ministry of Environment of Korea [66]. This species is the rarest of those considered here, with less than 2700 individuals observed in Hong Kong and Taiwan during the winter season. Platalea minor was mainly observed on the west coast of the Korean Peninsula from June to August, which is closely related to their breeding season [66,67], and suitable habitats include marine coastal zones, estuaries, tidal flats, and fishponds. The bird is as tactile feeder, walking slowly and stirring the water with its beak to catch its prey. Therefore, tidal flats and marine coastal zones provide suitable foraging areas, combining shallow and turbid water. Platalea minor collect sticks, etc., from nearby trees or pastures to build their nests, and the distribution of observation data reflected their foraging and breeding habitats well [66]. Figure 8 shows that the observations of Platalea minor in South Korea were more prevalent in and around the Incheon Bay (IC), Seosan Bay (SS), Gunsan Bay (GS), and Jeju Coastal (JJ) regions. The endangered species has a limited habitat and hence a relatively narrow range of observations. We used observations from IC, SS, and JJ as the training set and observations from GS as the test set, with RGEP pseudo-absence generation. Figure 9 shows the distribution prediction results for Platalea minor. WMV-EDNNs achieved the best performance (Sensitivity = 0.976, TSS = 0.958) compared with other SDMs (Sensitivity = 0.699 ± 0.108, TSS = 0.842 ± 0.094). Although the performance of RF was slightly poorer than that of WMV-EDNN, it exhibited excellent prediction performance (Sensitivity = 0.932, TSS = 0.931), considering it employed bootstrapping. Thus, when using observation data with spatial biases collected in a narrow range (e.g., observation data of Platalea minor), typical SDMs showed a relatively lower prediction performance than bootstrap-based approaches.

Case Study
We selected the Platalea minor's distribution to verify the prediction outcomes. Platalea minor, also known as black-faced spoonbill, lives mainly in Korea, Hong Kong, and Taiwan and is designated as "endangered" in the IUCN red-list (see Table 1). The species has also been designated as a Natural Monument (No. 205-1) by the Cultural Heritage Administration of Korea and is classified as endangered species I by the Ministry of Environment of Korea [66]. This species is the rarest of those considered here, with less than 2700 individuals observed in Hong Kong and Taiwan during the winter season. Platalea minor was mainly observed on the west coast of the Korean Peninsula from June to August, which is closely related to their breeding season [66,67], and suitable habitats include marine coastal zones, estuaries, tidal flats, and fishponds. The bird is as tactile feeder, walking slowly and stirring the water with its beak to catch its prey. Therefore, tidal flats and marine coastal zones provide suitable foraging areas, combining shallow and turbid water. Platalea minor collect sticks, etc., from nearby trees or pastures to build their nests, and the distribution of observation data reflected their foraging and breeding habitats well [66]. Figure 8 shows that the observations of Platalea minor in South Korea were more prevalent in and around the Incheon Bay (IC), Seosan Bay (SS), Gunsan Bay (GS), and Jeju Coastal (JJ) regions. The endangered species has a limited habitat and hence a relatively narrow range of observations. We used observations from IC, SS, and JJ as the training set and observations from GS as the test set, with RGEP pseudo-absence generation.  Figure 9 shows the distribution prediction results for Platalea minor. WMV-EDNNs achieved the best performance (Sensitivity = 0.976, TSS = 0.958) compared with other SDMs (Sensitivity = 0.699 ± 0.108, TSS = 0.842 ± 0.094). Although the performance of RF was slightly poorer than that of WMV-EDNN, it exhibited excellent prediction performance (Sensitivity = 0.932, TSS = 0.931), considering it employed bootstrapping. Thus, when using observation data with spatial biases collected in a narrow range (e.g., observation data of Platalea minor), typical SDMs showed a relatively lower prediction performance than bootstrap-based approaches.

Discussion
The key to maintaining biodiversity is to preserve the habitats of endang threatened species, or to find alternative habitats that are more likely to survive. A ing to the IUCN report, it is estimated that as many as one million species are pr to be on the verge of extinction over the next decades. Therefore, a practical counte ure with a high probability of success is urgently needed. In this study, we show well-trained DNNs and their ensemble have achieved better prediction performan typical machine learning methods such as GLM, GAM, CTA, SNN, FDA, MARS, R and MAXENT, which are widely used for species distribution prediction. Our pr model can be used to find alternative habitats for endangered species, and can be re as a long-term species conservation strategy.
In general, species' presence and absence data are needed to build an SDM. If real absence data are not available, one alternative to obtaining absence data is to g pseudo-absences. Several studies have suggested that pseudo-absence data sho confined to locations that are clearly not suitable for species habitats [56,68]. Howe large study areas, it takes a very long time to individually identify such sites f species, and it is much more difficult to verify the ecological aspect. Therefore, we that an effective pseudo-absence generation strategy is required to construct a p SDM. By implementing RG, RGEB, and RGEP, we have found that the RGEP me the best pseudo-absence generation method when true-absence data are not avail the modeling process, the bootstrapping approach is robust to changes in prevale this approach is worth considering if the acquisition of presence data is limited. we used crowdsourced datasets to obtain data necessary for our SDM construction can cause some biases in the observation data and prediction results. As far as we this is an inevitable limitation when using crowdsourced data. In future studies, w to develop standard procedures for SDM that include the reduction of observation the selection of the best environmental variables, and self-optimization.

Conclusions
Many ecological models have been devised for various purposes, including bi sity conservation, rare species protection, and habitat suitability assessment. Ho current SDMs suffer from critical limitations, such as data imbalance and spati Therefore, we proposed a bootstrap aggregating (bagging) deep neural network

Discussion
The key to maintaining biodiversity is to preserve the habitats of endangered or threatened species, or to find alternative habitats that are more likely to survive. According to the IUCN report, it is estimated that as many as one million species are predicted to be on the verge of extinction over the next decades. Therefore, a practical countermeasure with a high probability of success is urgently needed. In this study, we showed that well-trained DNNs and their ensemble have achieved better prediction performance than typical machine learning methods such as GLM, GAM, CTA, SNN, FDA, MARS, RF, SRE, and MAXENT, which are widely used for species distribution prediction. Our proposed model can be used to find alternative habitats for endangered species, and can be regarded as a long-term species conservation strategy.
In general, species' presence and absence data are needed to build an SDM. If reliable real absence data are not available, one alternative to obtaining absence data is to generate pseudo-absences. Several studies have suggested that pseudo-absence data should be confined to locations that are clearly not suitable for species habitats [56,68]. However, for large study areas, it takes a very long time to individually identify such sites for each species, and it is much more difficult to verify the ecological aspect. Therefore, we believe that an effective pseudo-absence generation strategy is required to construct a practical SDM. By implementing RG, RGEB, and RGEP, we have found that the RGEP method is the best pseudo-absence generation method when true-absence data are not available. In the modeling process, the bootstrapping approach is robust to changes in prevalence, so this approach is worth considering if the acquisition of presence data is limited. Finally, we used crowdsourced datasets to obtain data necessary for our SDM construction, which can cause some biases in the observation data and prediction results. As far as we know, this is an inevitable limitation when using crowdsourced data. In future studies, we plan to develop standard procedures for SDM that include the reduction of observational bias, the selection of the best environmental variables, and self-optimization.

Conclusions
Many ecological models have been devised for various purposes, including biodiversity conservation, rare species protection, and habitat suitability assessment. However, current SDMs suffer from critical limitations, such as data imbalance and spatial bias. Therefore, we proposed a bootstrap aggregating (bagging) deep neural network (DNN) ensemble species distribution model. Specifically, we collected sufficient observations from various citizen science databases, generated bootstraps to train DNNs, and finally combined the DNNs using MV, WMV, and WSV techniques to provide stable ensemble prediction models.
We compared the models with other SDMs to verify the proposed approach's effectiveness using five evaluation metrics. WMV-EDNNs achieved a stronger and more stable prediction performance than the other two ensemble models and existing SDMs for diverse scenarios.
Platalea minor species distribution was visualized using map overlays to show prediction results intuitively. Although Platalea minor observations had a spatially biased distribution in the dataset, WMV-EDNN models achieved a superior predictive performance compared with current SDMs. Thus, bagging-ensemble-based SDMs achieved robustness prediction performance, although the observation dataset was spatially biased and unbalanced.