1. Introduction
Spatial sampling at the scale to represent perfect species distributions, or other wide-ranging patterns, generally is not achievable now, resulting in varied realizations [
1,
2]. Nonetheless, the availability of georeferenced samples necessary for modeling numerous species distributions has progressed due to opportunistic sampling by community naturalists [
3]. Opportunistic sampling results in sampled areas, yet at the same time, the preferential sampling of accessible locations, such as near urban areas, is disproportionate relative to other locations [
4]. Differential sampling is considered to be sampling bias due to overrepresentation of available yet suitable environmental conditions, with not all conditions being equally and randomly sampled [
5]. However, conditions absolutely must be sampled for representative species distribution models, and sampled areas with correct yet uneven sampling from suitable locations is an advancement compared to unsampled areas [
6,
7].
Removing adjacent samples to reduce concentrated georeferenced records in clusters has not been demonstrated to be beneficial, due to information loss about conditions suitable for species [
2,
8,
9,
10]. To be clear, localized thinning to reduce clustered records is different than treating regional or continental differentials in sampling intensity, as occurs with high sampling intensity in Europe for most taxa relative to the rest of the world. Otherwise, samples from suitable conditions can supply information about missing environmental gradients in unsampled locations. For example, concentrated samples near cities or roads may differ in temperature relative to other locations and urban or road samples also can represent a wide gradient of climate conditions [
11,
12]. The applied species distribution modeling algorithm (e.g., random forests) can extend predictions from conditions in well-sampled (e.g., urban) areas to undersampled (e.g., rural) areas because modeling algorithms perform predictive modeling rather than statistical testing [
13]. Therefore, sampling for predictive modeling can be more opportunistic than the assumption of simple random sampling with balanced representation necessary for statistical tests. If entire areas of species distributions are missing from samples, removing or reducing (i.e., thinning) samples from suitable locations to reduce sampling bias will not fix and balance lack of sampling. For small samples, removing samples will result in loss of information necessary to differentiate suitable from unsuitable areas [
8,
9,
10]. That is, the precaution may cause the outcome that is meant to be prevented.
Similarly, the potential for erroneous locations of records has given rise to a standard practice of removing records to avoid issues, typically without any evidence for errors, rather than trying to avoid loss of samples (‘coordinate cleaning’; [
2,
14,
15,
16,
17,
18]). Relatively few coordinates are flagged as potential issues by programs (typically <5% flagged records in the Global Biodiversity Information Facility database; GBIF; [
15,
19,
20]). Of these flagged records, the most frequent errors encompass duplicated or missing coordinates, fossils or unknown types of observations, and older records [
19,
21], which unquestionably should be addressed as part of workflows for processing data, to meet objectives. Records collected before the year 1993 and development of georeferencing technology may be prone to locational errors if records are georeferenced based on imprecise descriptions of collection sites [
22]. Locational errors may arise from mistakes in data entry of georeferenced coordinates, including placement at biodiversity institutions and administrative unit centers (i.e., centers of countries, cities, counties; [
15]). The coordinate cleaning programs flag urban locations, because these locations may identify cultivated individuals or may be locations that previously were wildlands [
15,
16]. Conversely, flagged outliers may represent the missing, unsampled species distributions [
18]. Ocean coordinates also are among the most frequently flagged records [
19,
21]. However, coordinates from georeferenced records for species are probable to be more accurate than a 1:110,000,000 land polygon used to flag ocean coordinates ([
23]; default of 110 million scale documented in Zizka et al. [
15]). Map scales are ratios of map distances to the corresponding distances on the ground, such that even the 10 million map scale printed on paper may be equivalent to 3 km resolution, albeit with finer resolutions along coastlines [
24]. After basic processing of data, true locational errors are probably rare for modern georeferenced records within specified locational uncertainty distances (e.g., 1 km), particularly if from automated georeferencing within current data collection platforms, and not likely to affect species distribution models, given sufficient sample sizes (i.e., at least 300 to 400 records; [
8,
18,
25,
26]).
Sampling bias and coordinate errors have been attributed to be important issues for species distribution models that need to be addressed by removing samples [
2,
17]. However, without any particular foundation, reduction in sample sizes as a preventative measure may result in lack of complete species distribution models (omission errors) or incorrect models outside of distributions (commission errors); therefore, data cleaning beyond basic corrections for unique coordinates of observations during specific years may obscure the samples [
27]. For this study, I weighed the influence on species distribution models of sampling bias and various errors commonly identified by coordinate cleaning packages, by modeling 31 mammal species, with georeferenced records of at least 1000 samples (i.e., relatively stable due to enough information from samples; before thinning although potential errors and bias were added to thinned records) that provided realized sampling and also 1000 randomly generated samples from range maps, as objective measures of unbiased sampling with imperfect locations [
28]. Relatedly, I contrasted differences between models from thinned and unthinned (i.e., with sampling bias due to concentrated species observations) georeferenced records. The georeferenced records, of which 94% of records were sourced from iNaturalist [
3] in the GBIF database [
20], were a real, not virtual, source of potential errors. Basic processing limited observations to unique terrestrial coordinates with tagged uncertainty of less than 1 km, and few records occurred before even 2015 (i.e., <5%). These are impeccable georeferenced records, not museum records. All the randomly generated samples from range maps were sampled without bias yet locationally uncertain because no species occurrences in fact were sampled. Without empirical observations, these pseudopresence samples are generally correct representations of the extent of occurrence but without any locational precision of the area of occupancy. I performed the following assessments, holding all else in modeling equal, to examine how locational errors and sampling bias affect species distribution models (
Figure 1).
(1) Incorporation of obviously incorrect locations: I created centroid points from cities and administrative centers [
23], randomly selected 145 points, and then added the same erroneous points to 1000 range map samples for each of 31 species. This was a 13% error rate, which was at least double the <5% flagged records [
15,
19], with points that were not related to any species ranges. I added this type of error only to the range map samples because the range map samples do not contain concern about potential errors embedded in georeferenced records.
(2) Inclusion and exclusion of flagged records: For about 650 georeferenced records out of 126,000 georeferenced records flagged as potential errors (0.2% error rate of a few city and administrative centroids and primarily biodiversity institutions and outliers), I added flagged records to range map samples from each respective species, for an overall 2% error for 31,000 range map samples, varying by species. Conversely, I removed the flagged errors from the 126,000 georeferenced records.
(3) Inclusion and exclusion of sampling bias: I applied the urban samples from georeferenced records to random samples of range maps and also removed the urban samples from georeferenced records, artificially splitting species into a country subspecies. The sampling bias rate was high, with more urban samples (39,800 records) than randomly generated points from range maps (31,000). I also modeled urban points to demonstrate the extent of climate conditions contained within urban areas.
(4) Addition of adjacent random and clustered samples: To create comprehensive species-specific locational errors and sampling bias of uneven sampling to georeferenced records, I added random points (20% mean error rate, varying by species) and clustered points (40% mean error rate) to georeferenced records up to 35 km distances from the records, which is the maximum distance uncertainty for almost all georeferenced records (e.g., 99%, depending on percentage of records from iNaturalist [
3]). These types of error are relatively comparable to the addition of urban samples from the georeferenced records to range maps. While these added points may seem like distortions, in fact adding random samples near to georeferenced records is conceptually similar to synthetic samples generated by varying nearest neighbors approaches, to increase sample sizes and balance classes [
29].
(5) Differentiation of coordinates on land and at sea: I used two different scales of land area to demonstrate errors associated with this type of coordinate removal, and provided flexible solutions.
Samples from range maps, unthinned georeferenced records (with unique coordinates), and thinned georeferenced records, were modeled using consistent methods aside from eight additions or removals of errors to range maps or thinned georeferenced records.
4. Discussion
4.1. Muddying the Waters by Cleaning Coordinates
Potential biased sampling and locational errors are avoided by routine removal of samples. However, based on decomposing effects of errors added to samples from range maps and errors removed or added to georeferenced records, sampling and locations of georeferenced records did not produce biased species distribution models. Added errors that occurred within ranges were compatible with the other samples. Even for the poorest, most divergent model, arising from almost a 15% error rate added outside of the species range, the modeling algorithm of random forests could recognize disruptive errors that were added in locations beyond reason for the range, and the modeling algorithm incorporated those administrative centroid samples with uncertainty (predicted probabilities less than 0.75) into models. The georeferenced records did not contain enough errors to influence models, as apparent from models with flagged and urban records transferred to range map samples or from models with flagged and urban records compared to models without the records. Species distribution models proved extremely robust to error, particularly to added points within ranges. Errors and sampling bias do not matter if the climate amplitude of the species is sampled in the erroneous or biased points [
7].
The small amount of locational errors contained in georeferenced records will not influence the outcomes, given enough samples to establish suitable conditions. If errors are inside ranges and compatible with sampled climate conditions, they become samples that help the modeling algorithm differentiate suitable conditions from unsuitable conditions. If errors are outliers to the climate samples, the modeling algorithm will be able to use them to fill in ranges. If errors are not compatible with other samples, they will be discounted because the modeling algorithm maximizes classification accuracy. Even flagrant errors added without regard to species ranges did not perturb primary representations of species ranges. In this study, definite errors were almost 15% of samples from cities or administrative centers without any relation to georeferenced records and most likely beyond error rates in any untampered samples. For the poorest model of a species (0.79 niche overlap between the model with added locational error and model with no error), the modeling algorithm identified with high model certainty (predicted probabilities ≥ 0.75) that the range was limited to the Pacific Coast of the United States and communicated uncertainty about addition of samples scattered throughout the eastern United States. That is, the modeling algorithm was able to differentiate the corrupted samples because they were not compatible with the bulk of the samples, although the model predictions of presence were degraded by predictions outside of the range. Effects of errors depend on the species distributions, error rates relative to sample sizes, and how different the added errors are from the species samples.
Rather than discounting outlying samples, the modeling algorithm may determine that outlying samples are compatible with the other georeferenced records, enhancing the species distribution model [
18]. Generally, the addition or removal of the few georeferenced records flagged as potential errors (about 650 records, or 0.2% of all records) made no difference to species distribution models. Nonetheless, removal of outlying points deteriorated models of a few species, resulting in reduced predicted presence within ranges. Particularly for wide-ranging species, outliers may represent critical information, not errors [
18]. Because unsampled areas are an issue for generating representative species distribution models, sampled outliers may be the most important records of the potential errors to keep in terms of species representation. Outliers may help increase the width of climate amplitudes similar to synthetic samples generated by nearest neighbors [
29].
In contrast to georeferenced records that did not contain enough errors to influence models, models were more probable to be biased when samples were removed, due to loss of information about where species occurred. Indeed, species distribution models became less representative by removing georeferenced records of outlying records in undersampled areas. Similarly, predicted presence area for one of the modeled species from thinned records was greater than (114%) the range area of the species. Although the range map may be too small and captured only 94% of georeferenced records, the unthinned records avoided this error of overprediction and predicted presence area that was 84% of the range of the species. Even with more than 400 georeferenced records, thinned records may not provide enough information to allow the modeling algorithm to differentiate the climate of the presence records from the climate of the background samples. The unsampling procedures to remove samples decreased information content of input samples, resulting in unclear information delivered to the modeling algorithm. Deliberately withholding information by cleaning and other unsampling approaches can create modeling uncertainty, overpredictions, and underpredictions. Representative models can best be achieved by greater number of records that increase information about climate conditions for presence and absence classes, even if the records contain sampling bias and locational errors without cleaning [
18,
26,
27,
40].
Recognition of the trade-off in information contained by retaining samples and information lost by routine coordinate cleaning and record thinning is important. Given enough input information, the choice to thin samples or not to thin samples and clean coordinates or not clean coordinates resulted in model variation between precision and generality. If removal of samples does not affect representativeness of models, as indicated by visual inspection of predicted distribution compared to range maps, niche overlap, and area predicted as present relative to range maps, then exclusion of samples will result in slightly more general models than models from all samples. This is a small benefit relative to the risk of commission error outside of species ranges due to thinning of samples or omission error due to removal of samples from undersampled areas. Sufficient information content that includes potential errors produces representative models, with a probability of model improvement from greater sample sizes, because locational uncertainty and sampling bias are unimportant relative to sample size [
18,
25,
26,
27,
40,
41,
42].
Thinning samples and cleaning coordinates will not correct lack of information from unsampled areas and may reduce information needed by the modeling algorithm [
8]. Sampling bias may be reasonable to reduce if there is a surfeit of information provided by tens of thousands of samples of slightly different climate conditions to the modeling algorithm, and even necessary if this concentration of information overwhelms the information content from outlying samples, causing the modeling algorithm to discount these locations (specifically for continental sampling gradients between Europe and Asia for Eurasian species). Locational errors present in old records that are georeferenced by imprecise descriptions after collection may be helpful to identify and evaluate. Otherwise without evidence, reducing information through routine filtering and coordinate cleaning may be useless steps at best and detrimental to models if information is lost necessary to differentiate climate of presence samples from climate of the background (i.e., resulting in overpredictions outside of ranges) or more completely sample the distributions (i.e., resulting in underpredictions due to truncation of sampled conditions, including removal of outlying samples; [
18,
26,
27,
40]. To be clear, missing information about the climate of species distributions, rather than sampling bias and locational errors, results in poor, unrepresentative models [
6]. Therefore, removing samples with the intent to avoid potential errors and sampling bias can result in information loss and biased models, with added commission or omission error.
4.2. Locational Uncertainties and Concentrated Samples Rather than Error and Bias
In terms of sampling bias, the addition of urban and clustered samples changed the specificity of the species distribution models. Urban areas represent a wide range of climate, and modeling algorithms can use information from urban records to fill in unknown areas (i.e., models of climate from urban samples predicted ‘presence’ areas, or areas with similar climate greater by a factor of 25 than urban extents;
Figure 6; [
7]). Urban samples were representative of sampling bias, and remained extremely similar in comparison to unperturbed species distributions from range maps and georeferenced records. Despite spatial concentration, urban records captured meaningful ecological gradients. Nonetheless, due to concentration of samples, addition of urban records from georeferenced records to range map samples pushed species distribution models closer in similarity to the specificity of species distribution models from georeferenced records. Likewise, removal of urban records from georeferenced records pushed species distribution models closer in similarity to the generality of species distribution models from range maps. Random (20% error rate) and clustered (40% error rate) points added near georeferenced records had the same effects of diffuse and concentrated samples. The concentrated, clustered points pushed the predictions closer on the spectrum to the models from unthinned georeferenced records; likewise, the random points pushed the predictions closer on the spectrum to the models from thinned georeferenced records.
Assuming sufficient input information about climate within species ranges, thinned records will offer slightly different representations and characteristics than unthinned records, a trade-off between precision and generality quantified directly by area of predicted presence, which is a concrete measurement that demonstrates precision or generality in area of predictions. Greatest precision for model predictions required use of unthinned georeferenced records, for which the concentrated samples corroborated validity and the predictions were precise to sample locations, without generalizing to greater areas. Treating sampling bias involves removal of the concentrated samples by thinning samples, and results in more uncertain, generalized predictions, which are potentially erroneous by including commission error. Nevertheless, species distribution models from georeferenced records were extremely similar, specifically as judged by a mean niche overlap value of 1.0, with no niche overlap value less than 0.99 by species. This overlap in models is likely greater than model divergence generated by application of other modeling options (e.g., modeling algorithms) or the varied realizations of range maps developed by different expert opinions.
Characterizations of precision, sampling bias, and concentration on one end of a gradient and generality, locational errors, and dispersion at the other end of a gradient help to interpret the influences of potential locational errors and sampling bias. Species distribution models from concentrated georeferenced records are more locationally correct than species distribution models from thinned georeferenced records. The concentrated georeferenced records corroborated locational correctness of the georeferenced records and also limited generalizing model predictions to additional areas, resulting in decreased areas of potential locational error of predictions, relative to thinned georeferenced records. Referring to localized, clustered samples and sample concentration, rather than sampling bias, differentiates that predictive modeling is not hypothesis testing. Concentrated georeferenced records do have the greatest sampling bias, in the sense that correctly sampled locations have greater weight than lesser-known locations [
8]. Sampling bias in this regard is reduced by thinning georeferenced records and consequent model generalization to greater areas, but at the risk of known commission error, sometimes with potential overpredictions from species distribution models that exceed range areas. Unthinned georeferenced records create species distribution models characterized by precision, fit to samples of species occurrences, sampling bias or concentrated distributions relative to species distribution models from thinned records, which are characterized in terms of generality, reduced fit to samples, locational error or uncertainty, and dispersed distributions.
Addressing locational errors, which may result in commission error and model generalization, and sampling bias, which may result in omission error and model concentration, are conflicting objectives. All of the random samples from range maps were locational errors because the randomly generated samples have no evidence of correct observation, beyond generally being located within species ranges. Despite 100% locational errors, no sampling bias occurred as samples were completely random, and species distribution models were representative of species ranges, as developed by experts. In contrast, georeferenced records, particularly with a coordinate uncertainty of less than 1 km, are as locationally correct as possible, but with uneven sampling and unsampled areas. Given enough samples, georeferenced records allow generation of alternative representations of species ranges, the areas of occupancy, which are more precisely located and concentrated than general, continuously dispersed range maps.
4.3. Removal of Non-Terrestrial Points as Part of the Coordinate Cleaning Workflow
Intentional coordinate cleaning to simply remove duplicated coordinates and differentiate land areas is straightforward to add to workflows (
Figure 7). The major concerns, aside from coordinates in urban areas, are duplicated information, collection year, basis of record (i.e., observations, preserved specimens, fossils, unknown), and separation of land from water [
19]. Processing data to remove duplicated coordinates by species and fossils and unknown records is basic. Collection years likewise can be specified easily. Older records from museums probably contain errors but these require careful examination to understand the specific issues rather than automatic cleaning. Similarly taxonomic resolution also entails attention to preserve and reconcile at least primary synonyms, due to deprecation or unresolved and conflicting taxonomies, rather than simply removing samples of species with different scientific names in favor of one accepted name.
While terrestrial species with samples located in oceans, or water in general, seems to be the most straightforward concern related to coordinate cleaning to resolve, in actuality, this issue is complicated by the lack of resolution between water and land. The georeferenced records had a coordinate uncertainty less than 1 km, which is less than the coordinate uncertainty or resolution of map layers at 1:10,000,000 or 1:110,000,000 scales [
24]. Overall, then, the locations of the georeferenced coordinates are likely to be more accurate than the resolution of the land layers. In this study, adding a 2 km buffer to a 1:10 m land layer was able to salvage most georeferenced records (99.8%), as opposed to losing 2% to 3% of all records with a 1:10 m land layer or 1:110 m land layer, respectively, without buffering. While 2% to 3% of all records may not be that critical, these records will be proportionally greater in species with ranges along coastlines.
Indeed, even if exclusion of aquatic records for terrestrial species is not a particular interest, as a practical matter, if the study extent, such as the North American continent, contains a boundary between land and ocean, then the coastline will act as the border to retain or remove records. Delineation of coastlines based on resolution will affect the number of records kept in the study extent. Addition of a buffer is necessary to avoid unintentional sample exclusion along coastlines, and the buffer could match the maximum distance uncertainty of the georeferenced records (i.e., about 35 km).
It may be desirable to also remove georeferenced records (and predictions) located in inland water, but perhaps limited to relatively stable large lakes. With large lakes removed, lakes and rivers are 5% of the North American land surface [
43]. Lakes are not monolithic, and often have islands or inlets with trees, including in large lakes [
44]. Rivers can be extremely dynamic, with oxbow formation and other processes. Many rivers are not correctly delineated in land covers, particularly coarse ones (e.g., [
45]). Applying a coarse scale map to account for <5% of land area actually would contain unknown errors and be misleading in that it would appear that the water area is removed when in fact water remained and land was removed. The error may be magnified over time to the future predictions, if inland water levels changed with shifting precipitation and increased evapotranspiration under warmer temperatures, whereas sea level rose. If agricultural intensification occurs in Canada, water sources and hydrological networks may be modified through tile drainage to move water away from crops and terrestrial land, changing water locations [
46]. Making an adjustment to identify inland water beyond large lakes may be only an appearance of an adjustment, due to spatiotemporal dynamics. Moreover, unless resolution is very fine (e.g., 10 m), then multiple cover classes, including intermixed land and water, may be contained within one pixel. Assignment of one class for heterogenous classes is a limitation of coarse resolution data sources, along with accuracy of classes [
47].
5. Conclusions
Species distribution models using georeferenced records or random samples from range maps are both robust to perturbation by concentrated samples and locational uncertainties, given enough samples for the modeling algorithm to differentiate suitable and unsuitable conditions. Locational uncertainties inside species ranges are compatible with sampled climate conditions. If locational uncertainties are outliers in relation to the climate samples, the modeling algorithm will be able to use them to fill in the climate amplitude of species ranges and differentiate suitable conditions from unsuitable conditions. If the flagged errors are errors that are not compatible with other samples, they will be discarded or discounted by the modeling algorithm. Procedures to remove samples decrease information content of input samples, resulting in unclear information being delivered to the modeling algorithm, and resultant omission and commission errors. The risks of poor models due to unclear samples of suitable conditions after sample removal outweigh the benefits of limited generalization of models with thinned georeferenced records. Rather than removing samples unnecessarily to avoid potential errors, the major prerequisite for species distribution models is that researchers deliver clear information, by not cleaning samples beyond necessary, to allow the modeling algorithm to separate presence from absence classes.
In modeling, precision, sampling bias, and concentration characterize one end of the gradient and generality, locational errors, and dispersion characterize the other end of the gradient. To clarify, models with precise representation from concentrated samples (rather than sampling bias), with limited locational and commission error, represent one side of the modeling spectrum. Model generalization from thinned samples or samples from range maps, with limited sampling bias and omission error yet locational uncertainties (rather than errors), represents the other side of the spectrum. Randomly generated points from range maps are an extension of the concept of thinning. In fact, all of the random samples from within ranges may be locationally incorrect at fine resolutions but have no sampling bias and models are general representations of species ranges. Different realizations occur when depicting any landscape patterns, varying from models of comprehensive range coverage (generality, with locational/commission error) or models of the recorded observations (species area of occupancy, with omission error). Species distribution models require complete information available from samples, despite data cleaning and thinning becoming common procedures in the modeling of species distributions, which may result in different inferences that shape ecological interpretations and conservation outcomes.