Muddy Boots Beget Wisdom: Implications for Rare or Endangered Plant Species Distribution Models

: Species distribution models (SDMs) are popular tools for predicting the geographic ranges of species. It is common practice to use georeferenced records obtained from online databases to generate these models. Using three species of Phaedranassa (Amaryllidaceae) from the Northern Andes, we compare the geographic ranges as predicted by SDMs based on online records (after standard data cleaning) with SDMs of these records conﬁrmed through extensive ﬁeld searches. We also review the identiﬁcation of herbarium collections. The species’ ranges generated with corroborated ﬁeld records did not agree with the species’ ranges based on the online data. Speciﬁcally, geographic ranges based on online data were signiﬁcantly inﬂated and had signiﬁcantly different and wider elevational extents compared to the ranges based on veriﬁed ﬁeld records. Our results suggest that to generate accurate predictions of species’ ranges, occurrence records need to be carefully evaluated with (1) appropriate ﬁlters (e.g., altitude range, ecosystem); (2) taxonomic monographs and/or specialist corroboration; and (3) validation through ﬁeld searches. This study points out the implications of generating SDMs produced with unveriﬁed online records to guide species-speciﬁc conservation strategies since inaccurate range predictions can have important consequences when estimating species’ extinction risks.


Introduction
There has been a rapid increase in the accessibility of species occurrence data via the World Wide Web [1][2][3][4]. International programs such as the Global Biodiversity Information Facility [5] and Species Link [6], among others, have gathered data from different sources (e.g., herbarium and museum collections), allowing users to rapidly obtain massive amounts of data. These data are often used to generate species distribution models. Species distribution models (SDMs) are estimates of the geographic range of species on the basis of relationships between the known occurrence of the species and underlying environmental factors such as precipitation, temperature, and seasonality, among others [7]. In general terms, SDMs depend on three main components: (1) species occurrence data; (2) environmental variables, and (3) statistical models. The accuracy of the SDMs' predictions will be influenced by the precision of these three components. The potential applications of SDMs include conservation planning for the protection of rare and endangered species, prediction of invasive species propagation, estimates of niche evolution, reserve selection and design, and predictions of species' distributions under different past and future climate change scenarios [8][9][10][11][12][13][14][15][16].
For the conservation of rare and endangered species, SDM applications can be reduced by the limitations associated with these types of species. For instance, many rare and endangered species are knowny few geographic records. It has been proposed that to obtain accurate models of a species' geographic distribution, a minimum of 20 records per species is required [17,18]. In order to overcome this limitation, averages of ensembles of small models are used to estimate priority areas of conservation [19,20]. Even individual species conservation can be benefited by SDMs because it has been possible to find new populations in areas where SDMs suggested the presence of the species, even with models estimated with as little as five records [21,22].
Furthermore, for species occurrence data, there are many issues related to biological data acquired from large online databases that may limit their value in SDMs [23,24]. For instance, historical records from herbarium specimens often lack accurate geographical coordinates, information that is essential for carrying out SDMs [25]. Geographic coordinates can be procured from location data that may be available on records and using maps, gazetteer, or software [26,27]. Unfortunately, in many cases the geographical information found in the specimen labels is vague, and it might be difficult to assess to what extent errors were made in the original record. It has been suggested that Geographic Information System (GIS) analysis [28] and the use of environmental filters could help to screen for errors in species occurrence data [15]. Another potentially important but greatly underappreciated source of error in collection records is the need for taxonomic validation by experts since misidentifications will clearly have implications for the ability of SDMs to accurately portray species' ranges [29][30][31].
The aim of this study is to evaluate the impact of different species occurrence data sources on SDM estimates. We focus on the potential problems that can arise from generating SDMs using unverified species records as available from large online databases since this is common practice. Toward this goal, we compared SDM range predictions for three species of the plant genus Phaedranassa from the Northern Andes. We generated SDMs with standard species occurrence data available online through the Global Biodiversity Information Facility (GBIF) and other sources and compared them to range predictions generated using records validated through directed field searches and re-evaluation of specimen taxonomic identity. By comparing the two sets of range estimates, we hope to highlight the importance of data quality and taxonomic verification in SDMs. Our second goal is to show the influence of the range predictions obtained from these two types of data for conservation assessment.

Study Species
Phaedranassa (Herb.) is a small genus of the Amaryllidaceae family. This genus is known by eleven species that, except for one species from Costa Rica, are limited to moist slopes and dry valleys in the Andean mountains of Colombia and Ecuador [32]. Out of the eleven Phaedranassa species, eight are native to Ecuador with seven of these being endemic to the country [33]. All the Ecuadorean endemic Phaedranassa species are classified as either "endangered" or "vulnerable to extinction" under the International Union for the Conservation of Nature (IUCN) criteria [34]. Only P. dubia has been collected in a natural reserve. Our study focuses on three species: P. cinerea, P. schizantha, and P. dubia.
Both P. cinerea and P. schizantha are endemic to Ecuador, whereas P. dubia has been reported in Colombia, too [32].

Collection of Occurrence Data
We gathered online records from the Global Biodiversity Information Facility (GBIF) [5], as is common practice to estimate SDMs. These data (hereafter denoted to as "database" records) came mostly from individual herbarium collections at (1) the Missouri Botanical Garden [35], (2) the University of Aarhus (Denmark) [36], (3) the New York Botanical Garden [37], (4) the Royal Botanical Garden Kew [38], and (5) the University of Florida Herbarium [39]. We also included records from the virtual herbarium of the Herbario Nacional de Bogotá [40]. Database records were derived from the labels of herbarium specimens, and they include species name and locality. If geographical coordinates were available at GIBF, they were included with no modification. Once the database records were obtained, we took additional common-practice steps to ensure that our database set had the best-possible quality. Specifically, we excluded any record for which the geographic coordinates were obviously incorrect (e.g., located in a large body of water) or located in a country other than Ecuador (for the two species endemic to Ecuador). Two records before the year 1900 were excluded because of uncertainties. Sixty-six percent of the database records were collected after 1950. In order to increase the amount of usable data, the listed localities of records without coordinates were georeferenced using regional maps. Georeferenced records of threatened species (sensu IUCN) are not provided in some online databases to prevent over-collection. Thus, coordinates for specimens of endangered species of Phaedranassa located in the TROPICOS database [35] were not available through GBIF [5] and were obtained directly through personal communications with staff from the herbarium of Missouri Botanical Garden (MO). The location coordinates were improved by adjusting them using Google Earth as reference [41].
Our second database comprised a corrected version of the database records. The correction process involved the taxonomic revision of each collection's voucher and confirmation of the actual presence of the species at the field locations listed in the database records. The field searches, conducted by one of us (N.H.O.) between 1999 and 2009, were originally done as part of an extensive study of the population genetics of the genus [42]. All 60 Ecuadorean locations found in databases were visited at least twice to collect the species. During collections, the geographic position was recorded with an Etrex Garmin GPS unit on site. In addition, taxonomic identification of the records was confirmed by examining each voucher listed in the latest taxonomic treatment of the family [32]. In addition, more recent specimens were inspected personally by the author at the Herbario de la Pontificia Universidad Católica del Ecuador (QCA), Herbario Nacional del Ecuador (QCNE), and MO herbaria and identified to the species level using the keys in Meerow's treatment [32]. The specimens from Colombia are digitally available online: we evaluated those specimens as to their species taxonomic determination.

Model Building
Species distribution models (SDMs) for the three Phaedranassa spp. were generated through ensemble models [43] implemented in BIOMOD 2.0 [44], a platform available in R that combines different modeling predictions to derive consensus models. Four techniques were used to build the ensemble models: MAXENT [45], generalized linear models (GLM) [46], gradient boosting machine (GBM) [47], and multiple adaptive regression splines (MARS) [48]. These four techniques have been considered to produce more satisfactory results [49]. Default parameters were used for the four techniques. Ensemble models were produced using the proportional weighted means of probabilities option where the predictive ability of each individual modeling method determines its proportional contribution. Ensemble models are increasing in popularity for species distribution modeling because they avoid the need for choosing a single modeling technique. Specifically, we used the TSS of each individual model to assign its proportional contribution [50].
For this study we used WorldClim Bioclimatic variables [51,52] with a spatial resolution of 30 arc second (~1 km 2 at the Equator), as this is a common approach for SDMs. We removed correlated variables to avoid collinearity in the predictions, and generated models using the following variables: mean diurnal range (bio 2), isothermality (bio 3), temperature seasonality (bio 4), max temperature of warmest month (bio 5), precipitation of wettest month (bio 13), precipitation of driest month (bio 14), precipitation seasonality (bio 15), precipitation of warmest quarter (bio 18), and precipitation of coldest quarter (bio 19). Ensemble models are continuous probability maps that were transformed into a predicted bivariate map of potential presence versus absence of the species using a threshold approach. As we used presence-only data, we chose a threshold that minimizes the commission error, which predicts the presence of the species where it is not present. We allowed a commission error of 0.05 [49]. Modeling was conducted using 10,000 randomly generated pseudo-absences for each species.

Model Evaluation
We evaluated the predictive performance of each model by using a combination of three commonly used statistics, the receiver operating characteristic (ROC) and the true skill statistic (TSS) [53], and the Boyce index [54]. We incorporated a jackknife validation for small sample size [55]. We chose to evaluate the models with more than one statistic because measuring model predictive ability has been controversial, and the use of several algorithms has been recommended [56]. The ROC value shows the relationship between the false positive errors versus the true positive rate [7] and is usually reported as the area under the curve (AUC). The area under the curve is widely used because it is a threshold-independent measure that shows the probability that a random selection of species presence will show higher probability than an absence site chosen at random [7]. The TSS is a threshold-dependent statistic that has been proven superior for comparing binary models because it is independent of prevalence [53]. Both ROC and TSS values were obtained within BIOMOD 2.0, using the random pseudo-absences as part of the process. Finally, the Boyce index is a threshold-independent index that has been proposed to assess model performance of presence-only modeling methods [54]. The Boyce index was calculated for each ensemble model using the ecospat library [57] with default parameters.
Ideally, the evaluation process should be conducted using a set of data independent of the one used for the modeling. Thus, automatic altitude-based filters may not point out the error, this process was not possible because of the low number of available presences, since separating some of them to be left for the evaluation would have left too few remaining for model training, which would reduce model performance. This situation is commonly found when modeling in tropical areas [4,50]. To overcome this situation, we implemented a third step in the evaluation based on the jackknife validation approach proposed by Pearson et al. [55]. This approach represents an alternative to cross-validation with an independent dataset when fewer than 25 occurrences are available. It tests whether occurrences are correctly predicted as presences by binary models more often than expected at random [55].

Conservation Assessment
A common application of SDMs is to predict the impact of climate change on the potential range size and survivorships of focal species [58,59]. With this potential application in mind, we further evaluated the effect of using unverified data versus verified data in generating SDM range predictions. We compared the total area and elevational range of the distributions for each species obtained with each of the two datasets. Elevation ranges were based on the 95% quantiles of elevations within the predicted ranges as obtained from the altitude layer available in the WorldClim database [51].
The estimated distributional area for each species was evaluated against the IUCN criteria "extent of occurrence" (i.e., the area containing all the known or projected sites of current occurrence) for threatened species [60]. A taxon qualifies as "critically endangered" when its extent of occurrence is <100 km 2 , "endangered" when it is <5000 km 2 , and "vulnerable" when it is <20,000 km 2 . For this analysis, we used only P. cinerea and P. schizantha because the conservation status of endemic Ecuadorian species has been evaluated [34]. Phaedranassa dubia is located in Colombia and its global conservation status needs to be evaluated with information of both countries, which is not available at this point.

Results
Our results are based on a total of 66 records from databases and 34 records confirmed in the field (Table 1). We found 13, 27, and 20 records in the databases available online for P. cinerea, P. dubia, and P. schizantha, respectively. Out of those records, we could confirm a physical presence of the species in situ of 10, 14, and 10 records for P. cinerea, P. dubia, and P. schizantha, respectively. Fifty percent of the coordinates were different in the databases when compared with the records gathered in the field. We considered a different locality if the coordinate had at least a separation of one kilometer: most of the disagreements had a higher geographic separation. Taxonomic errors (misidentifications) were found in 14 records (21.21%) in the databases. Taxonomic issues were higher with P. dubia (37%).
The model performance as judged by TSS, AUC, and the Boyce index was high for the models (Table 1). Notably, the total geographic area predicted for each species was smaller when the model was generated using verified data ( Figure 1; Table 1). For P. cinerea and P. dubia, the predicted range areas were over 10 times larger when based on the database versus verified records and for P. schizantha, the predicted range area was 20 times larger when based on the database versus verified records ( Figure 1; Table 1). Likewise, the SDMs generated from the online database records resulted in significantly different and generally wider altitudinal range predictions for each species compared to the altitudinal range predictions generated with the verified data (Table 1). Based on the projected extents of occurrence using the distribution models with online databases, P. cinerea does not qualify as threatened under IUCN criteria (Table 1). In contrast, when the field-verified data were used, P. schizantha (859 km 2 ) and P. cinera (2584 km 2 ) qualify as "endangered." Considering only the geographic distribution of P. dubia in Ecuador (4396 km 2 ), this species will also be "endangered." Interestingly, we did not find P. dubia potentially distributed in Colombia and we did not find specimens of that species from Colombia.

Discussion
Online occurrence data available in venues such as the GBIF can contribute to a better understanding of geographic and ecological patterns in species distributions. However, the use of unverified data can potentially lead to inaccurate results and conclusions that are based on the predicted distributions. To highlight the need for careful verification, we compared range predictions of SDMs using data harvested directly from online databases (with standard levels of data filtering) to those obtained with data that has undergone more scrutiny (i.e., taxonomic and field verification).
Our results point out the necessity to carefully review and verify records from databases prior to executing modeling studies, especially in studies of a single species or in cases where sample size is limited. In this study, we found that the distribution models varied greatly on the basis of the data used even after accounting for differences in sample size. These disagreements are likely attributable to several different types of errors presented in the non-verified online data. For example, in one instance we found a collection record indicating that a specimen of P. cinerea had been collected from the Ecuadorian páramo (vegetation >4000 m) at higher altitudes than had previously been reported for this species. After further research (which involved contacting a crewmember of the botanic group that originally collected the plant specimen), we found that there was an error in the record´s label and that the specimen had been collected at a different location on the way to the páramo site. The label information on the specimen corresponded to the primary working area of the researchers and not the collection area of that particular plant. Along with this example, our results suggest that the geographic coordinates provided in the collection records were likely to be simple approximations of the actual collection locations with potentially large errors (error), which can be especially important for species that occur in topographically complex areas such as the Andes [15].
In this study, we dealt primarily with two types of errors: georeferencing errors and taxonomic misidentification. In order to minimize georeferencing error records in species occurrence data, filters using known features of the species range, such as altitude or ecosystem, can help to identify erroneous species records [4]. One potentially large source of error can be identified by reading the actual specimen labels. In this study, we found one living specimen for P. schizantha, of which the geographic coordinates was the address of the botanical garden where it is cultivated rather than the location of the natural origin of the specimen. By chance, this erroneous record fell within the natural altitudinal range of the species. Thus, automatic altitude-based filters may not point out the error.
The effects of taxonomic misidentifications have received considerably less attention. Our experience showed that taxonomic misidentification was higher in one species, P. dubia, probably because this species has the largest geographic distribution and collectors have identified the specimen using the most common species name. Furthermore, we did not find P. dubia in Colombia, because the available specimens are P. ventricosa. This has important conservation implications because based on the geographic distribution, P. dubia should be listed as endangered. Checking for taxonomic errors requires an evaluation of the physical herbarium specimen (or at least a digital image) by a specialist. Thus, checking for taxonomic errors cannot be easily automated and may be difficult to accomplish for large-scale, multi-taxa research. Taxonomic errors can be reduced by using records reported in recent taxonomic monographs because the taxonomy identification of the specimens has been verified by experts. Based on our experience, we can expect that the level of error of SDMs will be higher for species without recent taxonomic revisions.
A notable result of our comparisons is that the areas of the range predictions for the three studied species produced with the database records were always larger than those obtained with verified records. Because we have done an extensive search for the genus in the region for more than a decade, we can say that the geographic distribution predicted with online data does not represent the reality. The consistent inflation of range areas is likely the result of the georeferencing and taxonomic errors described above, which, in general, will expand estimates of species' ranges [15]. The inflation of range predictions has very important conservation implications. The Ecuadorian endemic species P. cinerea and P. schizantha are currently listed as "vulnerable" B1ab(iii) [34]. If we were to believe the distributions as predicted from SDMs based on the online data, then both species would be removed from the threatened category. In contrast, the range predictions based on SDMs using the verified data indicate that P. cinerea and P. schizantha have a range extent of <5000 km 2 and, thus, may qualify as "endangered." In Ecuador, P. dubia also will be considered "endangered" for the same reasons, and because we have not found specimens of the species from Colombia, this category probably is correct.
Beyond geographic extent, altitudinal range is an important feature of species distributions because of the connection between elevation and temperature and the fact that species with narrow altitudinal ranges are generally predicted to be more sensitive to climate-driven extinction than altitudinal generalists [61,62]. For example, if we compare the differences we found in altitude ranges predicted using the database versus verified records into differences in thermal niche breadths (assuming an adiabatic lapse rate of~5.5 • C colder per km elevation gain), for our most restricted species, P. schizantha, there is a difference of 5.1 • C in the estimates for the lowest altitudinal limit of the species and a difference of −4.3 • C in the upper altitudinal range limit, such as the thermal range of the species based on the online records is >9 • C wider than predicted with the verified records. In other words, the differences in thermal niche limits and breadths predicted for the study species on the basis of the different data sources is greater than some of the worst-case warming scenarios predicted for the Andes over the next century [63]. This result shows the danger of extrapolating the results of SDMs produced with unverified online records to species' extinction risks linked to climate change.
One of the main problems for SDMs of tropical plant species, endangered species in particular, is the few species occurrence records openly available [4]. When conducting our study, we found additional specimen records in local herbaria in the country of origin that were not included in the online databases (these records were not incorporated in our analyses). An exemplary program that bypasses this problem is the partnership between the Missouri Botanical Garden herbarium and the Herbario Nacional (QCNE, Quito, Ecuador) through quick data and specimen sharing. Exchange programs between international institutions and local herbaria are greatly needed to provide a relatively inexpensive way to increase species representation in online databases.
Species distribution models are useful and powerful tools for conservation. Our results show that it is essential to address the limitations of the species occurrence data in order to avoid erroneous conclusions and posterior extrapolations. Data quality matters and records downloaded from online sources do not have the certainty and precision that comes with careful taxonomic revisions and fieldwork.

•
Be critical of each occurrence record. For the conservation of rare and endangered species, each point should be revised.
• Check the taxonomy identification of the same collector's number voucher. Sometimes specialists curate one of those specimens but not all and because GBIF combines data from multiple herbaria, the same collection can be identified as different species.
• Plot the coordinates into a map. It is an easy way to identify outliers.
• Use filters such as altitude or habitat when that kind of information is known for the species.
• Read the original specimen label. Important annotations might not be available in the online database.
• Besides online data, complement the number of records with information from recent taxonomy monographs, databases from permanent plots, and local herbaria. Collaborate with taxonomic specialists! Their understanding of the species is invaluable.