Identifying Reliable Opportunistic Data for Species Distribution Modeling : A Benchmark Data Optimization Approach

Yu-Pin Lin 1,*, Wei-Chih Lin 2, Wan-Yu Lien 1, Johnathen Anthony 1 and Joy R. Petway 1 1 Department of Bioenvironmental Systems Engineering, National Taiwan University, No. 1, Sec. 4, Roosevelt Road, Taipei 10617, Taiwan; wanyulien@gmail.com (W.-Y.L); eternalmoose2@gmail.com (J.A.); d05622007@ntu.edu.tw (J.R.P.) 2 Geographic Information Technology Co., 4F.No. 310, Sec. 4, Zhongxiao E. Rd., Taipei 10694, Taiwan; b97602046@ntu.edu.tw * Correspondence: yplin@ntu.edu.tw; Tel.: +886-2-3366-3467


Introduction
Improving both the quality and quantity of species occurrence data is crucial for biological monitoring and species distribution modeling (SDM) in the investigation of biodiversity [1][2][3][4].Although professionally collected data are the preferred data source for SDM, they are expensive to collect and are often in short supply.Data collected using proper crowdsourcing techniques, often termed "opportunistic data" [3][4][5][6][7][8][9][10][11][12] or unstructured volunteer data, can provide ecologists with a variety of biodiversity monitoring data.Consequently, volunteer-based citizen science monitoring systems have attracted a lot of attention.However, even professionally curated databases, which include portals for citizen scientists and increase the amount of structured data available for research, lack adequate coverage of species occurrence.Fortunately, opportunistic data are increasing Environments 2017, 4, 81; doi:10.3390/environments4040081www.mdpi.com/journal/environmentsexponentially as technology that is useful in wildlife monitoring is becoming more widespread, such as mobile phone use and smart phone application software [1].Volunteers can therefore contribute monitoring data to a variety of existing datasets.In the last decade, volunteer-based citizen science monitoring data (henceforth opportunistic data) have been collected by a number of platforms, e.g., eBird [13], BeeID [14], EpiCollect projects [15] and the EnjoyMoths project [3].The opportunistic data collected through these platforms have also been taken advantage of by many biological conservation studies, including invasive species [16,17], habitat loss [6], conservation prioritization [18], wild species turnover [10], and wolf colonization [11] studies.
Although the proponents of opportunistic monitoring techniques are quick to point out the benefits of this type of data [3,11], the data often lack structure and contain a number of other limitations [19][20][21] that have been identified by critics.Since opportunistic data are usually collected by volunteers, most of whom lack formal survey training [22], misidentification and biases such as overrepresentation of certain areas are more prevalent in these datasets [3,23], even though many of the opportunistic data may be reliable.Therefore, although opportunistic data may supplement professionally collected data, they are not a substitute for it.For example, Kamp et al. [24] demonstrated that opportunistic data might not fulfill one of the most critical functions of a structured monitoring program, i.e., the ability to identify population fluctuations.Spatial biases in opportunistic data can also be problematic when low quality species survey data lead to biased species distribution estimates, which may result in unsuitable biodiversity conservation policies [25].Compared to models based on monitoring data collected by experts, models based on data collected by untrained citizen scientists can contain higher variability [3,8,23].Such variability can arise from a number of sources, e.g., misidentification of species [3,23], and can result in under-or overestimates of species abundance [3,8].In addition, opportunistic data typically consists of species presence locations, without information on species absences [8,26].Munson et al. [27] also found that the eBird opportunistic data had more uncertainty than the professionally collected North American Breeding Bird Survey (BBS) data.Due to these and other issues, many still consider opportunistic data to be low quality and unreliable for research and conservation planning purposes [9].
Advocates of opportunistic data, however, contend that there are a number of techniques for handling reliability issues, and that as long as researchers are aware of the key limitations and use the data appropriately, opportunistic data can supplement professionally collected data and potentially help bridge the gap between science and action [12,28].Furthermore, the sheer quantity and spatial extent of opportunistic data can provide researchers and policy makers with information on ecological trends that may otherwise go unnoticed due to the relative scarcity of professionally collected data [4,10].Studies demonstrating the similar predictive results of models built on opportunistic data versus professional data further justify the use of opportunistic data [27].A study conducted by Bried and Siepielski [10] also indicated that their presence-only opportunistic datasets contained identical patterns to that of presence-absence systematic datasets.
Although opportunistic data can provide a number of advantages, opportunistic data accuracy varies with monitoring task difficulty [29].It is therefore essential to assess and maximize the analytical value of specific opportunistic data.Researchers have accomplished this using a number of techniques that balance data quantity with data quality [30].Since it is often difficult for researchers to objectively assess the merits of data collected by anonymous volunteers, opportunistic data quality is often evaluated in terms of its similarity to professionally collected benchmark datasets [29,31].Furthermore, cross-validating opportunistic data quality using predicted species presence probabilities [32] enables assessment of record outlier veracity, subsequent flagging, and filtering of these records.Therefore, opportunistic data reliability is increased by direct or indirect data integration with professionally collected field survey data [4,8].
In recent years, a number of statistical and data filtering tools have been proposed which could be effective at removing biases while maintaining biological change signals [1] and addressing data quality issues such as measurement error, spatial clustering, detection, and identification [27].Here we divide these approaches into two categories, including collection-oriented and species-oriented techniques.Collection-oriented techniques rely on data collection standards to validate opportunistic data, i.e., who uploaded the data and how were the data collected.One popular user-oriented technique that has spurred numerous related ecological sampling methodologies, considers the amount of filtered data extracted per species per specific site or visited location [9,17,24,33].Other examples of collection-oriented techniques are those that incorporate meta-data on the relative expertise of data collectors, e.g., Yu et al. [34].Species-oriented techniques, on the other hand, rely on known species distributions to validate opportunistic data.There have been a number of methods used in species-oriented techniques.The most common approaches evaluate the veracity of given opportunistic data based on how similar they are to current expectations.These approaches vary from basic statistical approaches that identify outliers, probabilistic models, multi-component occupancy-detection models, hierarchical based data filtering methods, Multivariate Conditional Autoregressive (MVCAR) models, to machine learning approaches [1,4,24,26,30,35].
In this study, we focus on opportunistic data collected in Taiwan from the EnjoyMoths project's social media Facebook (FB) page [36].Previously, Lin et al. [3], applied NLP to extract the names of species and places from text in EnjoyMoths FB page posts.We combined this resultant FB data with professionally collected data from the Global Biodiversity Information Facility (GBIF) using the proposed optimization procedure.This method extracts opportunistic data that correspond strongly with the environmental variables of professionally-collected data.We then statistically tested the differences between the corresponding environmental variables and the resultant SDM habitat suitability index (HSI) values from GBIF only data, FB only data, and GBIF plus optimally selected FB data.Next, we applied a bootstrapping method and four SDM types to validate our data extraction method.We then performed data uncertainty analysis on our five datasets: (1) GBIF only data; (2) FB only data; (3) GBIF plus FB data (GBIF + FB); (4) GBIF plus optimally selected FB data (GBIF + FB_o); and (5) GBIF plus randomly selected FB data (GBIF + FB_r).In addition, we performed correlation analysis between the SDM "benchmark output" averages and the SDM output averages.Benchmark outputs were based on the GBIF dataset whereas other outputs were based on the other datasets mentioned above.Based on our validation results, the proposed data filtering method is effective.The results indicate that this technique can extract complementary opportunistic data from existing datasets, thereby providing a more in-depth understanding of the status and trends of biodiversity.

Study Area and Focal Species
Taiwan is a subtropical island with an area of 36,000 km 2 .For this study, we selected nine moth species from the EnjoyMoths project: Asota egens indica, Asota heliconia zebrine, Biston perclarus, Chrysaeglia magnifica, Histia flabellicornis ultima, Hyposidra talaca, Lebeda nobilis, Spodoptera litura, and Traminda aventiaria (see Supplementary 2 for more details).We derived species observations and location coordinates from posts on the EnjoyMoths FB group [36].NLP then identified FB-posted observations by species name [3].
TaiBIF [37] is the GBIF Taiwan portal specifically designed to broaden the GBIF network and increase the availability of local biodiversity data.Therefore, we used the TaiBIF portal of GBIF as the benchmark dataset in this study.We selected species with more than 15 records in the GBIF TaiBIF dataset (see Supplementary 1 for more details) to include: 37 records of A. egens indica; 103 records of A. heliconia zebrine; 29 records of B. perclarus; 39 records of C. magnifica; 58 records of H. flabellicornis ultima; 34 records of H. talaca; 21 records of L. nobilis; 33 records of S. litura, and 39 records of T. aventiaria (see Supplementary 2 for more detail) (Figure 1).We used fourteen environmental variables as focal species SDM inputs, including the first to fourth principle components of monthly precipitation (mm), the first to third principle components of monthly temperature ( • C), elevation, and the Normalized Difference Vegetation Index (NDVI).We applied environmental descriptor variables to create a bio-climatic map for assessing the representation of bio-climate zones against an independent bio-climatic map of Metzger et al. [38].We found good agreement in bioclimatic and ecosystem patterns between the created and published bio-climatic maps [3].

Optimal Data Filtering Method
To increase the number of available samples, we developed an optimization procedure based on a simulated annealing (SA) algorithm to integrate opportunistic data with professionally collected GBIF data.Two key parameters in the SA procedure are the cooling rates and the number of iterations during the optimization procedure.This study used three cooling rates: 0.3, 0.4, and 0.5, to obtain optimal data sets from opportunistic data.Given opportunistic data and professionally collected data, the optimization procedure aims to choose opportunistic data that can increase the average likelihood of species occurrence.We defined the average likelihood by the following equation, which is the geometric mean of the maximum entropy presence-only species distribution modeling approach: where ( ) is the probability of target species presence at location , which can be derived from the following equation based on a maximum entropy approach.
where g is the number of the environmental variables; is the coefficient corresponding to driving factor defined by the maximum entropy model; and is a normalized constant.The optimization steps are as follows: Step 1. Select n random samples from (opportunistic data).

Optimal Data Filtering Method
To increase the number of available samples, we developed an optimization procedure based on a simulated annealing (SA) algorithm to integrate opportunistic data with professionally collected GBIF data.Two key parameters in the SA procedure are the cooling rates and the number of iterations during the optimization procedure.This study used three cooling rates: 0.3, 0.4, and 0.5, to obtain optimal data sets from opportunistic data.Given N c opportunistic data and N p professionally collected data, the optimization procedure aims to choose n opportunistic data that can increase the average likelihood of species occurrence.We defined the average likelihood by the following equation, which is the geometric mean of the maximum entropy presence-only species distribution modeling approach: where q λ (x i ) is the probability of target species presence at location x i , which can be derived from the following equation based on a maximum entropy approach.
where g is the number of the environmental variables; λ j is the coefficient corresponding to driving factor z j defined by the maximum entropy model; and G λ is a normalized constant.The optimization steps are as follows: Step 1. Select n random samples from N c (opportunistic data).
Step 2. Calculate the objective function, O, which is equal to the geometric mean of q λ (x) based on N p professional data and n opportunistic data.Step 3. Implement an annealing schedule: generate a uniform random number, r, between 0 and 1.If r < 0.5, add a sample into the n random samples from the rest of opportunistic data; otherwise, remove a sample from the n random samples at random.Calculate the objective function, O.
Step 4. Calculate M = exp [−∆O/T], where ∆ O is the change in the objective function, a comparison between the current O and the last O, and T is the cooling rate (0-1).
Step 5. Generate a uniformed random number (rand) in the range of 0-1.If rand < M, accept the new values; otherwise, discard the changes.Step 6. Repeat Steps 3-5 until either the objective function value falls beyond a given stop criterion (e.g., O > a default value) or a specified number of iterations (e.g., 100,000 runs) have been completed.

Statistical Testing on Environmental Variables and HSI Similarity among Datasets
To assess the difference between the corresponding environmental variables among the five datasets ((1) GBIF; (2) FB; (3) GBIF + FB; (4) GBIF + FB_o; and (5) GBIF + FB_r), we applied the Kruskal-Wallis test.We applied the same test to the resultant SDM HSI of each dataset.When we found significant differences among datasets, we performed multiple comparisons to identify how pairs of the three datasets GBIF, FB, and FB_o, differ in terms of variable importance and HSI values.In addition, we performed correlation analysis between the average SDM benchmark outputs and the average SDM outputs.

Model Performance Evaluation and Data Bootstrapping for Uncertainty Analysis
In this study, we used four SDM types, Generalized Additive Model (GAM), Generalized Linear Model (GLM), Maximum Entropy Modeling (Maxent), and Support Vector Machine (SVM), to estimate habitat suitability distributions [39] of the focal species in addition to assessing the robustness of our data extraction method.We used the area under the curve of the receiver operating characteristic (AUC) to evaluate the performance of the above SDMs each trained by one of the five datasets: (1) GBIF; (2) FB; (3) GBIF + FB; (4) GBIF + FB_o; and (5) GBIF + FB_r.We assume that SDM performance is better when trained by the GBIF + FB_o dataset, regardless of the SDM type used.Model performance in concurrence with this assumption validates the proposed data filtering method.Additionally, to understand the variability among each dataset better, we applied a bootstrapping method to each of the five datasets and generated 1000 subsamples for each dataset consisting of 80% of the original data.We used each of the 80% subsample datasets to train the four SDM types.We then used the remaining 20% subsample datasets to test the model performances in terms of the AUC values.Boxplots illustrate the statistical distribution of AUC results.We also applied a two-sample Kolmogorov-Smirnov (K-S) test to compare differences in the 1000 AUC values between pairs of 80% subsample datasets and 20% subsample datasets of the five datasets.
We performed a principal component analysis (PCA) to evaluate the consensus between projections of 1000 realizations created by each SDM type for each data source.PCA represents the variation of independent dimensions [40], and is applicable to SDM outputs [41].We treated the first principal component (PC1) axis as a consensus axis that reflects the general trend followed by 1000 realizations [42,43].We evaluated the variability between 1000 realizations by calculating the proportion of explained variance from PC1 axis.When all projections are fully consistent with each other, the PC1 axis explains 100% of the variation.In contrast, if the projections were completely inconsistent, the PC1 axis explains only 0.1% of the variation (=1/1000 × 100%, where 1000 is the number of realizations) [42].That is, a higher proportion of explained variance represents a lower variability between realizations.

Optimized Selection of Opportunistic Data
Figures S1, S2 and Figure 2 display objective function values and average log likelihood during the optimization process under 0.3, and 0.4, and 0.5 cooling rates, respectively.As shown in Figure 2, the average log likelihood of all species converged at iteration 1500 through 1800 under a 0.5 cooling rate.
The average log likelihood for nine species increased by 185.7% (0.36 to 0.93) from initial states to optimal states using the proposed approach under a 0.5 cooling rate.Moreover, Table 1 shows the number of NLP extracted opportunistic GBIF, FB, FB_o, as well as the proportion of the number of observations in FB_o to the number of observations in FB for each species under 0.3, 0.4, and 0.5 cooling rates.As can be seen, by using the proposed optimal data filtering technique with a 0.5 cooling rate, the available sample data increased to a total number of 91, 162, 72, 62, 69, 62, 41, 53 and 67, for Asota egens indica, Asota heliconia zebrine, Biston perclarus, Chrysaeglia magnifica, Histia flabellicornis ultima, Hyposidra talaca, Lebeda nobilis, Spodoptera litura, and Traminda aventiaria, respectively.That is, 2.46, 1.57, 2.48, 1.59, 1.19, 1.82, 1.95, 1.61 and 1.72 times that of the original data available in the GBIF database.The average log likelihood for nine species increased from 0.29 to 0.87 under a 0.4 cooling rate, and from 0.23 to 0.71 under a 0.3 cooling rate (Figures S1 and S2).available in the GBIF database.The average log likelihood for nine species increased from 0.29 to 0.87 under a 0.4 cooling rate, and from 0.23 to 0.71 under a 0.3 cooling rate (Figures S1 and S2).Under a 0.5 cooling rate, the greatest proportion of data utilized from the volunteer data (58%) appears in the data filtering results of A. heliconia zebrine, while H. flabellicornis ultima shows the lowest proportion of data utilization from opportunistic data (16%).Table 2 maps observation locations under a 0.5 cooling rate and species distributions modeled on opportunistic data from FB, professionally collected data from GBIF, and a combination of the two.The selected opportunistic data tend to reach the highest point-densities in the north of Taiwan.Figures S3 and S4 represent optimal observation locations selected under 0.3 and 0.4 cooling rates as well as corresponding simulated species distributions (Supplementary 3).Under a 0.5 cooling rate, the greatest proportion of data utilized from the volunteer data (58%) appears in the data filtering results of A. heliconia zebrine, while H. flabellicornis ultima shows the lowest proportion of data utilization from opportunistic data (16%).Table 2 maps observation locations under a 0.5 cooling rate and species distributions modeled on opportunistic data from FB, professionally collected data from GBIF, and a combination of the two.The selected opportunistic data tend to reach the highest point-densities in the north of Taiwan.Figures S3 and S4 represent optimal observation locations selected under 0.3 and 0.4 cooling rates as well as corresponding simulated species distributions (Supplementary 3).Under a 0.5 cooling rate, the greatest proportion of data utilized from the volunteer data (58%) appears in the data filtering results of A. heliconia zebrine, while H. flabellicornis ultima shows the lowest proportion of data utilization from opportunistic data (16%).Table 2 maps observation locations under a 0.5 cooling rate and species distributions modeled on opportunistic data from FB, professionally collected data from GBIF, and a combination of the two.The selected opportunistic data tend to reach the highest point-densities in the north of Taiwan.Figures S3 and S4 represent optimal observation locations selected under 0.3 and 0.4 cooling rates as well as corresponding simulated species distributions (Supplementary 3).Under a 0.5 cooling rate, the greatest proportion of data utilized from the volunteer data (58%) appears in the data filtering results of A. heliconia zebrine, while H. flabellicornis ultima shows the lowest proportion of data utilization from opportunistic data (16%).Table 2 maps observation locations under a 0.5 cooling rate and species distributions modeled on opportunistic data from FB, professionally collected data from GBIF, and a combination of the two.The selected opportunistic data tend to reach the highest point-densities in the north of Taiwan.Figures S3 and S4 represent optimal observation locations selected under 0.3 and 0.4 cooling rates as well as corresponding simulated species distributions (Supplementary 3).
Table 2. Observed species locations in FB red (left), GBIF blue (center), and GBIF + FB_o red/blue (right); and habitat suitability distributions based on the above-mentioned datasets for each species.

Asota heliconia zebrine
Environments 2017, 4, 81 7 of 19 Under a 0.5 cooling rate, the greatest proportion of data utilized from the volunteer data (58%) appears in the data filtering results of A. heliconia zebrine, while H. flabellicornis ultima shows the lowest proportion of data utilization from opportunistic data (16%).Table 2 maps observation locations under a 0.5 cooling rate and species distributions modeled on opportunistic data from FB, professionally collected data from GBIF, and a combination of the two.The selected opportunistic data tend to reach the highest point-densities in the north of Taiwan.Figures S3 and S4 represent optimal observation locations selected under 0.3 and 0.4 cooling rates as well as corresponding simulated species distributions (Supplementary 3).Under a 0.5 cooling rate, the greatest proportion of data utilized from the volunteer data (58%) appears in the data filtering results of A. heliconia zebrine, while H. flabellicornis ultima shows the lowest proportion of data utilization from opportunistic data (16%).Table 2 maps observation locations under a 0.5 cooling rate and species distributions modeled on opportunistic data from FB, professionally collected data from GBIF, and a combination of the two.The selected opportunistic data tend to reach the highest point-densities in the north of Taiwan.Figures S3 and S4 represent optimal observation locations selected under 0.3 and 0.4 cooling rates as well as corresponding simulated species distributions (Supplementary 3).Under a 0.5 cooling rate, the greatest proportion of data utilized from the volunteer data (58%) appears in the data filtering results of A. heliconia zebrine, while H. flabellicornis ultima shows the lowest proportion of data utilization from opportunistic data (16%).Table 2 maps observation locations under a 0.5 cooling rate and species distributions modeled on opportunistic data from FB, professionally collected data from GBIF, and a combination of the two.The selected opportunistic data tend to reach the highest point-densities in the north of Taiwan.Figures S3 and S4 represent optimal observation locations selected under 0.3 and 0.4 cooling rates as well as corresponding simulated species distributions (Supplementary 3).Under a 0.5 cooling rate, the greatest proportion of data utilized from the volunteer data (58%) appears in the data filtering results of A. heliconia zebrine, while H. flabellicornis ultima shows the lowest proportion of data utilization from opportunistic data (16%).Table 2 maps observation locations under a 0.5 cooling rate and species distributions modeled on opportunistic data from FB, professionally collected data from GBIF, and a combination of the two.The selected opportunistic data tend to reach the highest point-densities in the north of Taiwan.Figures S3 and S4 represent optimal observation locations selected under 0.3 and 0.4 cooling rates as well as corresponding simulated species distributions (Supplementary 3).Under a 0.5 cooling rate, the greatest proportion of data utilized from the volunteer data (58%) appears in the data filtering results of A. heliconia zebrine, while H. flabellicornis ultima shows the lowest proportion of data utilization from opportunistic data (16%).Table 2 maps observation locations under a 0.5 cooling rate and species distributions modeled on opportunistic data from FB, professionally collected data from GBIF, and a combination of the two.The selected opportunistic data tend to reach the highest point-densities in the north of Taiwan.Figures S3 and S4 represent optimal observation locations selected under 0.3 and 0.4 cooling rates as well as corresponding simulated species distributions (Supplementary 3).Under a 0.5 cooling rate, the greatest proportion of data utilized from the volunteer data (58%) appears in the data filtering results of A. heliconia zebrine, while H. flabellicornis ultima shows the lowest proportion of data utilization from opportunistic data (16%).Table 2 maps observation locations under a 0.5 cooling rate and species distributions modeled on opportunistic data from FB, professionally collected data from GBIF, and a combination of the two.The selected opportunistic data tend to reach the highest point-densities in the north of Taiwan.Figures S3 and S4 represent optimal observation locations selected under 0.3 and 0.4 cooling rates as well as corresponding simulated species distributions (Supplementary 3).

Statistical Testing on Environmental Variables and HSI Similarity among Datasets
The average environmental variables corresponding to the three datasets, GBIF, FB, and FB_o, and the results of a Kruskal-Wallis (KW) test on the environmental variables of different datasets are shown in Table S1, as are the results of multiple comparison tests between the datasets for each environmental variable.Datasets displayed in paired combinations are those datasets that differed significantly.Significant differences of variables such as watershed, distance to city, and the third principal component of precipitation, are apparent among datasets for all species.Forest, watershed, distance to city, and the first through third principal components of temperature all demonstrated significant differences for most species.Only one variable, distance to road, was not significantly different among datasets.Table S1 also displays the average HSI values of models based on the GBIF, FB, and FB_o datasets for each species, as well as the KW p-value results between datasets and multiple comparison results for each species.All species showed significant differences between HSIs among datasets.Through multiple comparison tests, the average HSIs of FB and FB_o were shown to be significantly different for all species though the average HSIs of FB_o were greater than those of FB.Table 3 shows the Pearson correlation coefficients between species distributions based on the GBIF dataset and alternative combinations of various selected datasets.The SDM output averages based on the GBIF dataset and combinations of the GBIF dataset plus optimal selected or random selected opportunistic datasets are highly correlated.

Statistical Testing on Environmental Variables and HSI Similarity among Datasets
The average environmental variables corresponding to the three datasets, GBIF, FB, and FB_o, and the results of a Kruskal-Wallis (KW) test on the environmental variables of different datasets are shown in Table S1, as are the results of multiple comparison tests between the datasets for each environmental variable.Datasets displayed in paired combinations are those datasets that differed significantly.Significant differences of variables such as watershed, distance to city, and the third principal component of precipitation, are apparent among datasets for all species.Forest, watershed, distance to city, and the first through third principal components of temperature all demonstrated significant differences for most species.Only one variable, distance to road, was not significantly different among datasets.Table S1 also displays the average HSI values of models based on the GBIF, FB, and FB_o datasets for each species, as well as the KW p-value results between datasets and multiple comparison results for each species.All species showed significant differences between HSIs among datasets.Through multiple comparison tests, the average HSIs of FB and FB_o were shown to be significantly different for all species though the average HSIs of FB_o were greater than those of FB.Table 3 shows the Pearson correlation coefficients between species distributions based on the GBIF dataset and alternative combinations of various selected datasets.The SDM output averages based on the GBIF dataset and combinations of the GBIF dataset plus optimal selected or random selected opportunistic datasets are highly correlated.

Statistical Testing on Environmental Variables and HSI Similarity among Datasets
The average environmental variables corresponding to the three datasets, GBIF, FB, and FB_o, and the results of a Kruskal-Wallis (KW) test on the environmental variables of different datasets are shown in Table S1, as are the results of multiple comparison tests between the datasets for each environmental variable.Datasets displayed in paired combinations are those datasets that differed significantly.Significant differences of variables such as watershed, distance to city, and the third principal component of precipitation, are apparent among datasets for all species.Forest, watershed, distance to city, and the first through third principal components of temperature all demonstrated significant differences for most species.Only one variable, distance to road, was not significantly different among datasets.Table S1 also displays the average HSI values of models based on the GBIF, FB, and FB_o datasets for each species, as well as the KW p-value results between datasets and multiple comparison results for each species.All species showed significant differences between HSIs among datasets.Through multiple comparison tests, the average HSIs of FB and FB_o were shown to be significantly different for all species though the average HSIs of FB_o were greater than those of FB.Table 3 shows the Pearson correlation coefficients between species distributions based on the GBIF dataset and alternative combinations of various selected datasets.The SDM output averages based on the GBIF dataset and combinations of the GBIF dataset plus optimal selected or random selected opportunistic datasets are highly correlated.

Statistical Testing on Environmental Variables and HSI Similarity among Datasets
The average environmental variables corresponding to the three datasets, GBIF, FB, and FB_o, and the results of a Kruskal-Wallis (KW) test on the environmental variables of different datasets are shown in Table S1, as are the results of multiple comparison tests between the datasets for each environmental variable.Datasets displayed in paired combinations are those datasets that differed significantly.Significant differences of variables such as watershed, distance to city, and the third principal component of precipitation, are apparent among datasets for all species.Forest, watershed, distance to city, and the first through third principal components of temperature all demonstrated significant differences for most species.Only one variable, distance to road, was not significantly different among datasets.Table S1 also displays the average HSI values of models based on the GBIF, FB, and FB_o datasets for each species, as well as the KW p-value results between datasets and multiple comparison results for each species.All species showed significant differences between HSIs among datasets.Through multiple comparison tests, the average HSIs of FB and FB_o were shown to be significantly different for all species though the average HSIs of FB_o were greater than those of FB.Table 3 shows the Pearson correlation coefficients between species distributions based on the GBIF dataset and alternative combinations of various selected datasets.The SDM output averages based on the GBIF dataset and combinations of the GBIF dataset plus optimal selected or random selected opportunistic datasets are highly correlated.

Statistical Testing on Environmental Variables and HSI Similarity among Datasets
The average environmental variables corresponding to the three datasets, GBIF, FB, and FB_o, and the results of a Kruskal-Wallis (KW) test on the environmental variables of different datasets are shown in Table S1, as are the results of multiple comparison tests between the datasets for each environmental variable.Datasets displayed in paired combinations are those datasets that differed significantly.Significant differences of variables such as watershed, distance to city, and the third principal component of precipitation, are apparent among datasets for all species.Forest, watershed, distance to city, and the first through third principal components of temperature all demonstrated significant differences for most species.Only one variable, distance to road, was not significantly different among datasets.Table S1 also displays the average HSI values of models based on the GBIF, FB, and FB_o datasets for each species, as well as the KW p-value results between datasets and multiple comparison results for each species.All species showed significant differences between HSIs among datasets.Through multiple comparison tests, the average HSIs of FB and FB_o were shown to be significantly different for all species though the average HSIs of FB_o were greater than those of FB.Table 3 shows the Pearson correlation coefficients between species distributions based on the GBIF dataset and alternative combinations of various selected datasets.The SDM output averages based on the GBIF dataset and combinations of the GBIF dataset plus optimal selected or random selected opportunistic datasets are highly correlated.

Statistical Testing on Environmental Variables and HSI Similarity among Datasets
The average environmental variables corresponding to the three datasets, GBIF, FB, and FB_o, and the results of a Kruskal-Wallis (KW) test on the environmental variables of different datasets are shown in Table S1, as are the results of multiple comparison tests between the datasets for each environmental variable.Datasets displayed in paired combinations are those datasets that differed significantly.Significant differences of variables such as watershed, distance to city, and the third principal component of precipitation, are apparent among datasets for all species.Forest, watershed, distance to city, and the first through third principal components of temperature all demonstrated significant differences for most species.Only one variable, distance to road, was not significantly different among datasets.Table S1 also displays the average HSI values of models based on the GBIF, FB, and FB_o datasets for each species, as well as the KW p-value results between datasets and multiple comparison results for each species.All species showed significant differences between HSIs among datasets.Through multiple comparison tests, the average HSIs of FB and FB_o were shown to be significantly different for all species though the average HSIs of FB_o were greater than those of FB.Table 3 shows the Pearson correlation coefficients between species distributions based on the GBIF dataset and alternative combinations of various selected datasets.The SDM output averages based on the GBIF dataset and combinations of the GBIF dataset plus optimal selected or random selected opportunistic datasets are highly correlated.

Statistical Testing on Environmental Variables and HSI Similarity among Datasets
The average environmental variables corresponding to the three datasets, GBIF, FB, and FB_o, and the results of a Kruskal-Wallis (KW) test on the environmental variables of different datasets are shown in Table S1, as are the results of multiple comparison tests between the datasets for each environmental variable.Datasets displayed in paired combinations are those datasets that differed significantly.Significant differences of variables such as watershed, distance to city, and the third principal component of precipitation, are apparent among datasets for all species.Forest, watershed, distance to city, and the first through third principal components of temperature all demonstrated significant differences for most species.Only one variable, distance to road, was not significantly different among datasets.Table S1 also displays the average HSI values of models based on the GBIF, FB, and FB_o datasets for each species, as well as the KW p-value results between datasets and multiple comparison results for each species.All species showed significant differences between HSIs among datasets.Through multiple comparison tests, the average HSIs of FB and FB_o were shown to be significantly different for all species though the average HSIs of FB_o were greater than those of FB.Table 3 shows the Pearson correlation coefficients between species distributions based on the GBIF dataset and alternative combinations of various selected datasets.The SDM output averages based on the GBIF dataset and combinations of the GBIF dataset plus optimal selected or random selected opportunistic datasets are highly correlated.
Note: Cooling rate is 0.5; Professionally collected data from Global Biodiversity Information Facility (GBIF); opportunistic data from Facebook (FB); opportunistic data from optimally selected Facebook dataset (FB_o).

Statistical Testing on Environmental Variables and HSI Similarity among Datasets
The average environmental variables corresponding to the three datasets, GBIF, FB, and FB_o, and the results of a Kruskal-Wallis (KW) test on the environmental variables of different datasets are shown in Table S1, as are the results of multiple comparison tests between the datasets for each environmental variable.Datasets displayed in paired combinations are those datasets that differed significantly.Significant differences of variables such as watershed, distance to city, and the third principal component of precipitation, are apparent among datasets for all species.Forest, watershed, distance to city, and the first through third principal components of temperature all demonstrated significant differences for most species.Only one variable, distance to road, was not significantly different among datasets.Table S1 also displays the average HSI values of models based on the GBIF, FB, and FB_o datasets for each species, as well as the KW p-value results between datasets and multiple comparison results for each species.All species showed significant differences between HSIs among datasets.Through multiple comparison tests, the average HSIs of FB and FB_o were shown to be significantly different for all species though the average HSIs of FB_o were greater than those of FB.Table 3 shows the Pearson correlation coefficients between species distributions based on the GBIF dataset and alternative combinations of various selected datasets.The SDM output averages based on the GBIF dataset and combinations of the GBIF dataset plus optimal selected or random selected opportunistic datasets are highly correlated.

Performance and Uncertainty Analysis
Table 4 shows the boxplots of 1000 AUC values of four model predictions (GAM, GLM, Maxent, and SVM) based on five datasets (GBIF, FB, GBIF + FB, GBIF + FB_o, and GBIF + FB_r) under a 0.5 cooling rate.Figures S5 and S6 show the boxplots of 1000 AUC values of the model predictions with cooling rates of 0.3 and 0.4, respectively.The GBIF dataset generally led to the highest AUC variance, i.e., standard deviations, and the lowest median AUC values.In addition, the median AUCs of GBIF+FB_o are highest among those of the five datasets for Asota egens indica in all models; for Biston perclarus in GLM, Maxent, and SVM models; for Chrysaeglia magnifica in all models; for Histia flabellicornis ultima in all models; for Lebeda nobilis in GLM, Maxent, and SVM; for Spodoptera litura in GAM and Maxent; and Traminda aventiaria in all models.The median AUCs of GBIF + FB_o were higher than those of GBIF + FB_r in all cases.In other words, the SDMs based on GBIF plus FB_o datasets outperformed those based on other datasets.The two-sample K-S test results show that, for most species and datasets, the resultant AUC values were significantly different.In addition, Table 5 shows the explained variation by the first PCA component of the 1000 species distributions derived from four SDMs based on five datasets for each species.The average explained variation under 0.5 cooling rate for GBIF, FB, GBIF + FB, GBIF + FB_o and GBIF + FB_r datasets were 0.62, 0.75, 0.81, 0.74, and 0.69, respectively.We observed similar average explained variation findings for cooling rates of 0.3 and 0.4 (Tables S2 and S3 in Supplementary).

Performance and Uncertainty Analysis
Table 4 shows the boxplots of 1000 AUC values of four model predictions (GAM, GLM, Maxent, and SVM) based on five datasets (GBIF, FB, GBIF + FB, GBIF + FB_o, and GBIF + FB_r) under a 0.5 cooling rate.Figures S5 and S6 show the boxplots of 1000 AUC values of the model predictions with cooling rates of 0.3 and 0.4, respectively.The GBIF dataset generally led to the highest AUC variance, i.e., standard deviations, and the lowest median AUC values.In addition, the median AUCs of GBIF+FB_o are highest among those of the five datasets for Asota egens indica in all models; for Biston perclarus in GLM, Maxent, and SVM models; for Chrysaeglia magnifica in all models; for Histia flabellicornis ultima in all models; for Lebeda nobilis in GLM, Maxent, and SVM; for Spodoptera litura in GAM and Maxent; and Traminda aventiaria in all models.The median AUCs of GBIF + FB_o were higher than those of GBIF + FB_r in all cases.In other words, the SDMs based on GBIF plus FB_o datasets outperformed those based on other datasets.The two-sample K-S test results show that, for most species and datasets, the resultant AUC values were significantly different.In addition, Table 5 shows the explained variation by the first PCA component of the 1000 species distributions derived from four SDMs based on five datasets for each species.The average explained variation under 0.5 cooling rate for GBIF, FB, GBIF + FB, GBIF + FB_o and GBIF + FB_r datasets were 0.62, 0.75, 0.81, 0.74, and 0.69, respectively.We observed similar average explained variation findings for cooling rates of 0.3 and 0.4 (Tables S2 and S3 in Supplementary).

Performance and Uncertainty Analysis
Table 4 shows the boxplots of 1000 AUC values of four model predictions (GAM, GLM, Maxent, and SVM) based on five datasets (GBIF, FB, GBIF + FB, GBIF + FB_o, and GBIF + FB_r) under a 0.5 cooling rate.Figures S5 and S6 show the boxplots of 1000 AUC values of the model predictions with cooling rates of 0.3 and 0.4, respectively.The GBIF dataset generally led to the highest AUC variance, i.e., standard deviations, and the lowest median AUC values.In addition, the median AUCs of GBIF+FB_o are highest among those of the five datasets for Asota egens indica in all models; for Biston perclarus in GLM, Maxent, and SVM models; for Chrysaeglia magnifica in all models; for Histia flabellicornis ultima in all models; for Lebeda nobilis in GLM, Maxent, and SVM; for Spodoptera litura in GAM and Maxent; and Traminda aventiaria in all models.The median AUCs of GBIF + FB_o were higher than those of GBIF + FB_r in all cases.In other words, the SDMs based on GBIF plus FB_o datasets outperformed those based on other datasets.The two-sample K-S test results show that, for most species and datasets, the resultant AUC values were significantly different.In addition, Table 5 shows the explained variation by the first PCA component of the 1000 species distributions derived from four SDMs based on five datasets for each species.The average explained variation under 0.5 cooling rate for GBIF, FB, GBIF + FB, GBIF + FB_o and GBIF + FB_r datasets were 0.62, 0.75, 0.81, 0.74, and 0.69, respectively.We observed similar average explained variation findings for cooling rates of 0.3 and 0.4 (Tables S2 and S3 in Supplementary).

Performance and Uncertainty Analysis
Table 4 shows the boxplots of 1000 AUC values of four model predictions (GAM, GLM, Maxent, and SVM) based on five datasets (GBIF, FB, GBIF + FB, GBIF + FB_o, and GBIF + FB_r) under a 0.5 cooling rate.Figures S5 and S6 show the boxplots of 1000 AUC values of the model predictions with cooling rates of 0.3 and 0.4, respectively.The GBIF dataset generally led to the highest AUC variance, i.e., standard deviations, and the lowest median AUC values.In addition, the median AUCs of GBIF+FB_o are highest among those of the five datasets for Asota egens indica in all models; for Biston perclarus in GLM, Maxent, and SVM models; for Chrysaeglia magnifica in all models; for Histia flabellicornis ultima in all models; for Lebeda nobilis in GLM, Maxent, and SVM; for Spodoptera litura in GAM and Maxent; and Traminda aventiaria in all models.The median AUCs of GBIF + FB_o were higher than those of GBIF + FB_r in all cases.In other words, the SDMs based on GBIF plus FB_o datasets outperformed those based on other datasets.The two-sample K-S test results show that, for most species and datasets, the resultant AUC values were significantly different.In addition, Table 5 shows the explained variation by the first PCA component of the 1000 species distributions derived from four SDMs based on five datasets for each species.The average explained variation under 0.5 cooling rate for GBIF, FB, GBIF + FB, GBIF + FB_o and GBIF + FB_r datasets were 0.62, 0.75, 0.81, 0.74, and 0.69, respectively.We observed similar average explained variation findings for cooling rates of 0.3 and 0.4 (Tables S2 and S3 in Supplementary).
Table 4. Boxplots of the AUC values derived from four models (GAM, GLM, Maxent, and SVM) of five datasets (GBIF, FB, GBIF + FB, GBIF + FB_o, and GBIG + FB_r) for nine species.strictly extracting opportunistic data.Another, more recent joint modeling method presented by Pacifici et al. [4], uses data from structured and unstructured surveys to directly inform SDMs by sharing parameters in jointly estimated likelihoods.Some approaches have even considered other criteria such as the relative competence of volunteers who upload opportunistic data [34,35].Despite the wide range of available approaches, maximizing likelihood is the most common approach to obtain presence-only data [26].This study used an SA approach to maximize the average likelihood of a maximum entropy presence-only species distribution model based on NLP extracted opportunistic data collected by citizen scientists and successfully identified reliable inputs with respect to the designated benchmark GBIF dataset.More specifically, in contrast to other studies [3,4,10,11], this study proposes and validates an optimal data filtering method using SA techniques to extract samples from opportunistic data by removing unrealistic entries as identified by GBIF based SDMs.While the validation results suggest that the proposed optimal filtering method successfully selected high quality data from the opportunistic data by maximizing the likelihood function in the study cases, there are a number of caveats.For example, a basic premise of this optimization technique is that the benchmark data is relatively free of data biases, i.e., in this case the professionally collected GBIF data is free of specimen misidentification, spatial over-or underrepresentation, etc.In addition, there are a number of parameters and settings that can affect the performance and efficiency of SA.In this study, we used an exponential cooling schedule at three cooling rates, all of which Robini and Reissman [45] reported as the most successful.Theoretically, the series of iterations converge to a global optimum while the cooling rates tend to zero.In this study, by using the proposed filtering method at the 0.5 cooling rate, the average species presence probability increased by 2.2 to 10 times.Despite this, the optimal selected sample size at the 0.5 cooling rate is less than the sample sizes identified at the 0.3 and 0.4 cooling rates.The 0.5 cooling rate also produced the highest average rate of increase in the maximum likelihood value.

Uncertainty Analysis
As seen in the variation explained by the first PCA component of species HSI [42,43], uncertainty analysis revealed low uncertainty in the HSI derived from the professionally collected GBIF data plus opportunistic data from optimally selected FB dataset (GBIF + FB_o) relative to the GBIF plus opportunistic data from randomly selected FB dataset (GBIF + FB_r).Interestingly, the GBIF plus all opportunistic data from Facebook (GBIF + FB) yielded the lowest uncertainty.Given the inverse relationship between uncertainty and sample size, this could be attributable to the greater number of samples from the GBIF+FB dataset used to train the SDMs.As Buisson et al. [46] noted, assessing the uncertainty that data collection introduces into species distribution predictions is crucial, as is comparing the effects of varying dataset sizes [47].

Method Validation
Recently, numerous studies successfully used opportunistic data with various datasets in their biological mapping and species modeling [3,4,10,11,28].In this study, we used three main validation criteria: (1) the strength of the relationship between datasets and environmental variables; (2) resultant SDM performance; and (3) uncertainty analysis.By using the proposed method to combine professionally collected GBIF data and opportunistic datasets, we obtained larger datasets that resulted in higher performing species distribution estimates with lower uncertainty than those estimates based solely on GBIF.In addition, although the GBIF + FB_o dataset contained more uncertainty than the GBIF + FB dataset, the GBIF + FB_o dataset better reflected species preferences.Our results suggest that the proposed approach improves biodiversity monitoring program data by identifying high-quality citizen science data.
Our results also clearly indicate that the proposed technique increases the number of available data by a factor of up to 2.47 times that of the original professionally collected GBIF dataset at a 0.5 cooling rate.Our approach can be used as a viable tool when combining multiple datasets of varying quality, i.e., a known high quality dataset combined with unknown quality datasets.Furthermore, our approach may be particularly useful when used in conjunction with other data integration modeling frameworks, such as that of Pacifici et al. [4], which used integrated data to handle data contamination issues.For example, the Pacifici et al. [4] correlation modeling method may enhance the randomly selected opportunistic data models in this study since the correlation model has the ability to utilize information from disparate datasets and models when parameters between various datasets cannot be shared directly, e.g., differing measurement scales or drastic differences in data quality.On the other hand, the optimally selected opportunistic data technique presented here may improve the presence only Pacifici et al. [4] shared model when the reliability of opportunistic datasets have been verified, i.e., the spatial structures and species specific environmental variables match the high-quality datasets to a high degree.
Contrary to results obtained by using NPL-extracted FB data alone [3], the optimally selected GBIF + FB_o dataset from this study tend to be located in areas that have higher GBIF-based SDM HSI.For some species, important predictive environmental variables found in GBIF and FB_o datasets exhibit more similarity with one another than with the FB datasets.For instance, for Asota egens indica, Biston perclarus, Chrysaeglia magnifica, Histia flabellicornis ultima, and Spodoptera litura, the average predictive strengths of the forest cover variable in FB_o datasets are significantly higher than those found in FB datasets.This is presumably due to the positive effects of forest cover on habitats.In contrast, FB_o datasets are less correlated with the distinct watershed area variable than the FB datasets, suggesting less bias associated with easily accessible or frequently visited recreational areas.These results indicate that suitable environmental variables should play a role in the extraction of reliable opportunistic data [10].Our method demonstrates this when considering the SDM-identified environmental drivers during screening for potentially reliable samples.
SDMs based on the GBIF + FB_o dataset outperformed other datasets in most cases, and reflect two benefits of using the proposed data filtering technique.First, the proposed optimal data filtering technique, in most cases, i.e., species and SDM model type combinations, may enable the selection of more biologically meaningful samples as indicated by a higher performing SDM than SDMs based on GBIF + FB_r.Second, higher performing models may also be a result of larger datasets.In this study, SDMs based on GBIF + FB_o performed better than GBIF-based SDMs.That is, the results indicate that the proposed approach increases SDM sample size and performance, and decreases model uncertainty.However, similar to the results of Munson et al. [27], the GBIF only dataset-based SDM performance and the GBIF + FB_r dataset-based SDM performance had similar predictive powers in some cases.In addition to this, the GBIF-based SDM output correlation analysis revealed a high correlation with both GBIF + FB_o and GBIF + FB_r dataset-based SDM outputs.Nonetheless, the differences in spatial structure identified in the Kruskal-Wallis analysis stage of our study, as well as differences in the distribution of the AUC box-plotted analysis suggest that a data-extracting procedure is advisable.The AUC values clearly indicate lower median AUC values and higher AUC variability of models based on opportunistic data only versus models based on professionally collected data.The effectiveness of the proposed technique is also apparent when iteration durations are increased, and when the similarities between GBIF based and FB_o based model outputs are compared.The three main validation criteria used in this study: the strength of the relationship between datasets and environmental variables, resultant SDM performance, and uncertainty analysis, further support the appropriateness of the proposed filtering method in identifying high quality data.The disadvantages of non-filtered opportunistic data have also been demonstrated and include: apparent spatial biases for uninformative environmental variables; inclusion of unrealistic observational data with exceedingly low HSI values; lower SDM performance when compared to GBIF + FB_o based SDMs; and finally, higher SDM uncertainty when compared to equal sized filtered datasets.If professionally collected benchmark data are relatively unbiased, and are representative of the species of concern, the proposed technique can fill gaps in professionally collected datasets.Users should be cautious since the converse is also true, namely, if any biases exist in the benchmark dataset the proposed technique may only serve to amplify them.Despite this, the proposed optimal filtering technique has the potential to make meaningful contributions in biological conservation and policymaking since it analyzes large datasets of citizen science, validates entries, and identifies unbiased records that improve SDM predictions [3].

Conclusions
Opportunistic data can provide ecologists with additional samples to compensate for data gaps that may exist in the relatively small number of professionally collected, high-quality structured samples available from other sources.Our approach efficiently selected high quality, opportunistically sourced data using the proposed optimization technique with an automated NLP component, and combined this data with professionally collected GBIF data for modeling moth distributions.We used the Kruskal-Wallis test to analyze the properties of different datasets and the statistical differences between environmental variables and HSI values corresponding to benchmark data, opportunistic data, and filtered datasets.We also addressed the performance and uncertainty in SDM outputs based on different datasets by using a bootstrapping approach to generate random SDM data subsets.Our proposed data filtering method is a tool for filling current data gaps and improving biodiversity monitoring or biological conservation initiatives.By referencing reliable benchmark data, the proposed data extraction technique can garner valuable data from large unstructured datasets, thereby improving ecological data quality and quantity.

Step 2 .
Calculate the objective function, , which is equal to the geometric mean of ( ) based on professional data and n opportunistic data.Step 3. Implement an annealing schedule: generate a uniform random number, , between 0 and 1.If < 0.5, add a sample into the n random samples from the rest of opportunistic data; otherwise, remove a sample from the n random samples at random.Calculate the objective function, .Step 4. Calculate M = exp [−Δ /T], where Δ is the change in the objective function, a comparison between the current and the last , and T is the cooling rate (0-1).Step 5. Generate a uniformed random number (rand) in the range of 0-1.If rand < M, accept the new values; otherwise, discard the changes.Step 6. Repeat Steps 3-5 until either the objective function value falls beyond a given stop criterion (e.g., > a default value) or a specified number of iterations (e.g., 100,000 runs) have been

Table 2 .
Observed species locations in FB red (left), GBIF blue (center), and GBIF + FB_o red/blue (right); and habitat suitability distributions based on the above-mentioned datasets for each species.

Table 1 .
The number of observations in GBIF, FB, FB_o, as well as the proportion of the number of observations in FB_o to the number of observations in FB.

Table 1 .
The number of observations in GBIF, FB, FB_o, as well as the proportion of the number of observations in FB_o to the number of observations in FB.

Table 2 .
Observed species locations in FB red (left), GBIF blue (center), and GBIF + FB_o red/blue (right); and habitat suitability distributions based on the above-mentioned datasets for each species.

Table 2 .
Observed species locations in FB red (left), GBIF blue (center), and GBIF + FB_o red/blue (right); and habitat suitability distributions based on the above-mentioned datasets for each species.

Table 2 .
Observed species locations in FB red (left), GBIF blue (center), and GBIF + FB_o red/blue (right); and habitat suitability distributions based on the above-mentioned datasets for each species.

Table 2 .
Observed species locations in FB red (left), GBIF blue (center), and GBIF + FB_o red/blue (right); and habitat suitability distributions based on the above-mentioned datasets for each species.

Table 2 .
Observed species locations in FB red (left), GBIF blue (center), and GBIF + FB_o red/blue (right); and habitat suitability distributions based on the above-mentioned datasets for each species.

Table 2 .
Observed species locations in FB red (left), GBIF blue (center), and GBIF + FB_o red/blue (right); and habitat suitability distributions based on the above-mentioned datasets for each species.

Table 2 .
Observed species locations in FB red (left), GBIF blue (center), and GBIF + FB_o red/blue (right); and habitat suitability distributions based on the above-mentioned datasets for each species.

Table 2 .
Observed species locations in FB red (left), GBIF blue (center), and GBIF + FB_o red/blue (right); and habitat suitability distributions based on the above-mentioned datasets for each species.

Table 3 .
Average correlation between species distribution models based on GBIF dataset and alternative combination dataset.: Opportunistic dataset from Facebook (FB); Professionally collected data from Global Biodiversity Information Facility plus opportunistic data from Facebook (GBIF+FB); Professionally collected data from Global Biodiversity Information Facility plus opportunistic data from optimally selected Facebook dataset (GBIF + FB_o); Professionally collected data from Global Biodiversity Information Facility plus opportunistic data from randomly selected Facebook dataset (GBIF + FB_r).All correlations are significant at p value < 0.05. Note

Table 3 .
Average correlation between species distribution models based on GBIF dataset and alternative combination dataset.Opportunistic dataset from Facebook (FB); Professionally collected data from Global Biodiversity Information Facility plus opportunistic data from Facebook (GBIF+FB); Professionally collected data from Global Biodiversity Information Facility plus opportunistic data from optimally selected Facebook dataset (GBIF + FB_o); Professionally collected data from Global Biodiversity Information Facility plus opportunistic data from randomly selected Facebook dataset (GBIF + FB_r).All correlations are significant at p value < 0.05

Table 3 .
Average correlation between species distribution models based on GBIF dataset and alternative combination dataset.Opportunistic dataset from Facebook (FB); Professionally collected data from Global Biodiversity Information Facility plus opportunistic data from Facebook (GBIF+FB); Professionally collected data from Global Biodiversity Information Facility plus opportunistic data from optimally selected Facebook dataset (GBIF + FB_o); Professionally collected data from Global Biodiversity Information Facility plus opportunistic data from randomly selected Facebook dataset (GBIF + FB_r).All correlations are significant at p value < 0.05

Table 3 .
Average correlation between species distribution models based on GBIF dataset and alternative combination dataset.Opportunistic dataset from Facebook (FB); Professionally collected data from Global Biodiversity Information Facility plus opportunistic data from Facebook (GBIF+FB); Professionally collected data from Global Biodiversity Information Facility plus opportunistic data from optimally selected Facebook dataset (GBIF + FB_o); Professionally collected data from Global Biodiversity Information Facility plus opportunistic data from randomly selected Facebook dataset (GBIF + FB_r).All correlations are significant at p value < 0.05