Present address: U.S. Geological Survey Forest and Rangeland Ecosystem Science Center, 3200 SW Jefferson Way, Corvallis, OR 97331, USA

This article is an open-access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

Recently, techniques available for identifying clusters of individuals or boundaries between clusters using genetic data from natural populations have expanded rapidly. Consequently, there is a need to evaluate these different techniques. We used spatially-explicit simulation models to compare three spatial Bayesian clustering programs and two edge detection methods. Spatially-structured populations were simulated where a continuous population was subdivided by barriers. We evaluated the ability of each method to correctly identify boundary locations while varying: (i) time after divergence, (ii) strength of isolation by distance, (iii) level of genetic diversity, and (iv) amount of gene flow across barriers. To further evaluate the methods’ effectiveness to detect genetic clusters in natural populations, we used previously published data on North American pumas and a European shrub. Our results show that with simulated and empirical data, the Bayesian spatial clustering algorithms outperformed direct edge detection methods. All methods incorrectly detected boundaries in the presence of strong patterns of isolation by distance. Based on this finding, we support the application of Bayesian spatial clustering algorithms for boundary detection in empirical datasets, with necessary tests for the influence of isolation by distance.

In spatial ecology, a boundary is a region of abrupt change in a map of biological variables. Boundaries are of interest because their locations reflect underlying biological, physiological or social processes [

Two general families of techniques are currently used to identify boundaries in population genetics [

Edge detection techniques directly identify areas where changes in variables occur from the analysis of allele frequency data [

Effective conservation and management strategies depend on our ability to correctly identify boundaries in genetic data. Although studies have compared the relative performance of Bayesian clustering methods ([

The main objective of this paper is to compare the performance of spatial Bayesian clustering methods with that of edge detection methods for identifying genetic boundaries and to provide recommendations for their use. Our comparison includes three published spatial Bayesian clustering approaches and two direct edge detection methods (

Bayesian clustering approaches simultaneously aim to identify the number of clusters and to assign probabilistically either individuals (versions without admixture models) or a fraction of their genome (with admixture models) to identified clusters such that Hardy-Weinberg and linkage disequilibria are minimized. Numerous models and software programs have been developed to achieve these objectives; although their goals are similar (

The first Bayesian clustering algorithm we consider is BAPS5 [

TESS [

GENELAND [

Monmonier’s algorithm [

Wombling [

In this study, we compared the performance of spatial Bayesian clustering algorithms with that of edge detection methods, including (i) the spatial model of BAPS5, (ii) the non-admixture model of TESS, (iii) the admixture model of TESS 2.1, (iv) GENELAND, (v) Alleles in Space, and (vi) WOMBSOFT (

Standardized simulated data sets were generated using a time-forward Monte-Carlo procedure that encapsulated and generalized core processes and parameters of evolving spatially-structured populations: (1) organisms inhabit a landscape, (2) each organism is born at a landscape location, (3) distances between birth and breeding sites are a function of dispersal ability, (4) progeny genomes are inherited from parents, and (5) alleles inherited from parents can mutate. Given our interest in simulating barriers to gene flow and exploring emergent patterns associated with genetic discontinuities, our simulations were implemented in a two-step process: (1) simulation of an equilibrium global spatial evolutionary process without barriers, followed by (2) imposition of barriers to organismal movement.

Our simulation procedure was similar to the lattice model implemented in Slatkin and Barton [_{0} = 0, _{0} = 1). However, over the course of many generations, quasi-equilibrium spatial patterns of genetic structure and diversity emerged that reflected specified population sizes and mutation rates. At quasi-equilibrium, spatial distributions of alleles stochastically change over generations, but emergent properties of the system such as number of alleles (

Once simulations achieved quasi-equilibrium states, we imposed barriers to gene flow such that the original 100 by 100 landscape was divided into four separate 50 by 50 landscape subsections (see

We examined five simulation parameter combinations that allowed us to evaluate the effects of average dispersal distances (^{−4}; (2) δ = 11, ^{−4}; (3) ^{−5}; (4) ^{−5}. An additional set of simulations was run with 3% barrier permeability (^{−4}. Twenty-five independent simulation replicates were generated for each parameter combination, and 20 unlinked codominant loci were tracked over the course of each simulation replicate. In the case of a population of 10,000 individuals and ^{−4}, population genetic theory predicts expected values of ^{−5} case, ^{−4} and ^{−5}, respectively. Simulations were run for more generations in the ^{−5} case because, at low mutation rates, populations will take longer to reach an equilibrium state [

As descriptors of pre- and post-barrier simulation states, we used Arlequin 3.11 [_{ST} values among landscape subregions at each of the six time points described above for each parameter combination. Note that due to the spatially explicit nature of our simulations, _{ST} values should not be interpreted literally due to the fact that individual landscape subsections are themselves spatially structured, and therefore do not comprise true panmictic populations. Likewise, for the purposes of describing the pre-barrier degree of spatial genetic structure for each parameter set, we followed the general recommendation of Rousset [^{−4} and

We applied the five methods to the combination of 5 (parameter combinations) × 6 (generation times) × 25 (replicate simulations per parameter combination) = 750 datasets. The values of the parameters used for the statistical analysis are described in

For each data set, we performed 10 independent runs for values of

For Bayesian clustering methods, we considered boundaries to be correctly inferred if the number of spatial clusters estimated was four and if individuals were all correctly assigned to the clusters from their respective landscape region, except for generation time 0. At generation 0, we expected no detectable boundary and for all individuals to be assigned to the same cluster (

For each parameter combination and at each time point, we calculated the percentage (out of the 25 repetitions) of cases where the boundaries (or no boundary at generation 0) had been correctly detected.

For both edge detection methods, we visually checked outputs from each program to determine if boundaries were correctly detected (out of the 25 repetitions performed for each parameter set);

To evaluate the performance of the methods for the empirical datasets, we reanalyzed two previously published datasets and compared our results with those reported in the original papers.

The first dataset consisted of 540 pumas from the southwestern United States (Latitude: 31° to 42°, Longitude: −114° to −103°) genotyped for 16 microsatellite loci by McRae

The second dataset consisted of leaf samples of R. ferrugineum collected during summer of 2004 across the entire European Alps (latitude: 44°48′ to 48°36′; longitude: 5°20′ to 15°40′). A 12′ latitude × 20′ longitude (^{2} (see [

Over all simulation parameter combinations (including permeable and non-permeable barriers), the mean percentage of cases where the boundaries were correctly identified was highest for GENELAND and lowest for WOMBSOFT (

At generation 0, all spatial variation should occur across gradients and is due to isolation by distance, and no sharp boundaries should be detected. Our simulations showed significant isolation by distance patterns (larger slopes) with

As expected, the time elapsed (number of generations) following imposition of barriers was highly correlated with F_{ST}

The lower ranking of BAPS5 relative to the two other Bayesian clustering methods resulted from poor performance under parameter combinations with

For the datasets with permeable barriers (

Although TESS and BAPS5 identified boundaries between clusters that were largely consistent with the original studies, results still varied between the two methods, with different numbers of clusters identified for both species. Using the admixture option in TESS increased the number of identified clusters. Results of GENELAND were in both cases almost identical to the results of BAPS5. The WOMBSOFT results were difficult to interpret, identifying large heterogeneous areas whereas Monmonier’s algorithm never identified biologically interpretable boundaries, instead detecting individuals that were more genetically differentiated from their neighbors than expected (possible migrants).

All Bayesian approaches identified the strong boundary between northern and southern puma populations reported by the original authors (

TESS without admixture detected six spatial clusters among the

In this paper, we compared five methods with different underlying models in their efficiency and reliability for detecting genetic boundaries in both ideal (simulated) and realistic cases. Our simulations allowed us to compare the results from each method against barrier locations known with perfect certainty, and to evaluate effects of factors such as time since barrier creation, mutation rate, dispersal distance, and gene flow across the barrier on the ability of each method to correctly infer boundaries. Our empirical analyses allowed a more realistic test of each method’s performance, introducing complexities commonly found in real-world studies, including complicated and unknown population histories, irregular and ad-hoc sampling schemes (pumas), and genetic markers that do not conform to some of the methods’ assumptions (AFLPs). We do not claim to be exhaustive, but to present results from a set of cases carefully chosen to shed light on a rapidly expanding research area.

The most striking result of our simulations was that the spatial Bayesian clustering methods outperformed the direct edge detection methods. While TESS matched or outperformed the other spatial models for datasets with small dispersal distances (

Under our simulation parameters TESS with admixture generally performed poorly, an unsurprising result given that the imposition of impermeable barriers meant that there was no ongoing migration between subpopulations, and thus admixture was not to be expected. This poor performance can also be explained by the fact that using the admixture option in TESS increases the number of parameters to estimate, reducing the reliability of parameter estimations in the absence of admixture. This has been recently tested by Francois and Durand [

The success of GENELAND and failure of other methods for simulated datasets with permeable barriers used in this study suggests that GENELAND is better suited for migration scenarios that may be common in empirical datasets. Movement across barriers, in combination with large dispersal distances, allows for strong gene flow across entire landscapes, and as a consequence weak spatial structure that is difficult to detect. Poor results by TESS for permeable barriers can further be explained by its lack of a correlated allele frequencies model; this model better matched our simulations because of the instantaneous population fission scenario we used.

Another general result from our simulations was the presence of time lags between barrier imposition and their reliable detection by all methods (

Our empirical analyses also give more support to the clustering methods than to the direct edge detection methods. Monmonier’s algorithm, using both raw and residual genetic distances for both datasets, only detected boundaries around particular individuals that were genetically highly distinct from others in their immediate vicinity. Wombsoft detected, for both empirical datasets, wide heterogeneous areas difficult to interpret as boundaries. Although the exact numbers of spatial clusters detected by the Bayesian spatial clustering methods differed for empirical datasets, all of them gave solutions that were consistent with those reported in the original papers. For both datasets, results from GENELAND and BAPS5 matched very closely. For the puma dataset, both methods detected three spatial clusters and the same boundaries between northern and southern clusters as reported by McRae

WOMBSOFT performed particularly poorly with our simulated datasets, and both edge detection algorithms performed poorly with the empirical datasets. While wombling proved its potential in ecology [

Neutral genetic markers do not respond directly to underlying environmental factors. Boundaries detected in data based on species occurrence likely reflect real factors to which the species are responding. For example, an individual sample of plant community data surrounded by an inferred boundary likely indicates a localized area where environmental conditions (e.g., soil type) have resulted in a distinct community of species [

Based on the percentage of correct inference for simulated data (

In natural populations, substructuring of individuals can be caused by influences of isolation by distance, gradients of landscape resistance, and true barriers [

Thanks to Aurélie Coulon, Mark Hewison, Olivier François, Pierre Faubet and Jerko Gunjača for preliminary discussion. Thanks to the INTRABIODIV Consortium for providing Rhododendron dataset. Thanks to the Fulbright commission for funding SM (travel and stay) at Utah State University and allowing this work. Computational resources used to complete this work were graciously provided by the Utah State University Center for High Performance Computing. We also thank the two anonymous reviewers for their helpful comments.

Results obtained with simulated datasets at generation 2,500. Genetic boundaries were simulated using

(_{ST}

Results for puma dataset. (

Results for Rhododendron dataset. (

Main characteristics of software packages compared in this study.

Model | Spatial Bayesian clustering | Spatial Bayesian clustering | Spatial Bayesian clustering | Non parametric | Non parametric |

Analytical and Stochastic methods | Markov chain Monte Carlo | Markov chain Monte Carlo | |||

| |||||

Spatial | Colored Voronoi tessellation based on discrete sampling site | Hidden Markov random field | Free colored Voronoi tessellation based on continuous Poisson point process | Geographic coordinates are included in the local weighted regression | Delaunay triangulation ( |

| |||||

Clustering criteria | None | None | |||

| |||||

Local edge detection criteria | None | None | None | Average rate of change based on individuals located within a kernel of a given bandwidth size | High rate of change between paired individuals based on Delaunay link |

| |||||

Data | Co-dominant and Dominant | Codominant | Co-dominant and Dominant | Co-dominant, Dominant and categorical data | Co-dominant, Dominant, and Sequence Data |

| |||||

Platforms | Windows, Unix/Linux Mac OS X | Windows, Unix/Linux | R package Windows, Unix/Linux Mac OS X | R package Windows, Unix/Linux Mac OS X | Windows |

| |||||

Reference | [ |
[ |
[ |
[ |
[ |

| |||||

URL |

HWE: Hardy Weinberg equilibrium

LE: linkage equilibrium

Glossary of technical terms used.

Hidden Markov Random Field | A hidden Markov random field model is a special case of Hidden Markov Models (HMM). A HMM is a |

| |

Markov chain Monte Carlo | Markov chain Monte Carlo methods are a class of |

| |

Neighborhood graphs | Neighborhood graphs capture proximity between points by connecting nearby points with a graph edge. Many possible ways to determine nearby points lead to a variety of neighborhood graph types such as Voronoi tesselation and Delaunay triangulation. |

| |

Voronoi tesselation | Given a set of |

| |

Delaunay triangulation | The Delaunay triangulation graph connects the adjacent geographical positions of the samples on a map, resulting in a network that connects all the samples. None of the points is inside the circumcircle of any triangle. |

Input parameters used for Bayesian clustering methods, WOMBSOFT and Monmonier’s algorithm (AIS) in our application.

BAPS5 | 1–6 | 1–8 | 1–10 | |

Number of replications | 10 | 10 | 10 | |

| ||||

TESS | 1–6 | 1–7 | 1–7 | |

0–0.6 | 0.6–1 | 0.6–1 | ||

Number of Sweeps | 10,000 | 100,000 | 100,000 | |

Number of burnin period | 2000 | 10,000 | 10,000 | |

Number of runs | 10 | 10 | 10 | |

Admixture parameter | Yes and no | Yes and no | ||

| ||||

GENELAND | 1–6 | 1–7 | 1–7 | |

Number of iterations | 50,000 | 100,000 | 100,000 | |

Thinning | 10 | 10 | 10 | |

Number of replications | 10 | 10 | 10 | |

Allele frequencies | Correlated | Correlated | Correlated | |

| ||||

WOMBSOFT | Resolution of the grid | 100 × 100 | 100 × 100 | 34 × 16 |

Bandwidth | 7 | 70 km | 30 km | |

Binomial threshold | 0.3 | 0.3 | 0.3 | |

Statistical significance of the binomial test | 0.05 | 0.01 | 0.05 | |

| ||||

Monmonier’s algorithm | Genetic distances were specified | Residual | Raw and residuals | Raw and residuals |

Number of barriers to be identified. | 4 | 1–7 | 1–7 |

Psi: the interaction parameter of TESS can be interpreted as the intensity with which two neighbors belong to the same clusters. The higher the value of psi is the more likely the population may consists of a unique cluster with a high level of genetic continuity.

Admixture model was used although we know that our data have no admixture.

Mean number of alleles, mean gene diversity and mean isolation by distance (IBD) slope observed in analyzed data sets of 200 individuals for each parameter combination. The mean IBD slope was calculated using replicate data sets from the pre-barrier stage (generation 0). Standard deviations for all values are indicated in brackets. Mutation rates are given by μ and average dispersal distances by δ. In parentheses, the percentage of individual tests for each parameter combination that gave significant slopes at the α = 0.05 level is shown.

24.9 [0.59] | 0.86 [0.0088] | 0.265 [0.02] (100%) | |

9.5 [0.68] | 0.62 [0.0073] | 0.202 [0.03] (100%) | |

17.8 [1.28] | 0.79 [0.0182] | 0.004 [0.003] (12%) | |

6.32 [0.36] | 0.49 [0.0098] | 0.002 [0.002] (24%) |

Calculated over generation time and over repetitions.

Calculated only for generation time 0 over repetitions.

Percent correct inferences observed in simulated data for each parameter combination (25 replicates in each case) and for each generation time. At

0 | 0 | 0 | 0 [5.4] | 0 [5.2] | 0 [2.0] | 0 [6.0] | |

100 | 0 | 0 | 0 [5.2] | 0 [4.8] | 0 [5.2] | 0 [6.0] | |

500 | 36 | 0 | 72 [4.3] | 4 [4.6] | 0 [5.2] | 0 [6.0] | |

1000 | 68 | 8 | 84 [4.2] | 52 [4.1] | 4 [5.1] | 0 [6.0] | |

3000 | 96 | 20 | 100 [4.0] | 100 [4.0] | 60 [4.5] | 0 [5.9] | |

5000 | 100 | 24 | 100 [4.0] | 92 [4.0] | 88 [4.1] | 0 [5.9] | |

0 | 0 | 0 | 0 [5.4] | 0 [4.9] | 4 [5.0] | 0 [6.0] | |

100 | 0 | 0 | 0 [5.5] | 0 [4.6] | 0 [5.0] | 0 [6.0] | |

500 | 4 | 0 | 36 [4.7] | 0 [4.6] | 0 [5.2] | 0 [5.9] | |

1000 | 8 | 0 | 68 [4.3] | 12 [4.3] | 0 [5.6] | 12 [5.6] | |

3000 | 40 | 28 | 100 [4.0] | 60 [4.1] | 48 [4.7] | 16 [5.2] | |

5000 | 68 | 40 | 100 [4.0] | 80 [4.0] | 68 [4.4] | 64 [4.5] | |

0 | 0 | 0 | 100 [1.0] | 0 [3.5] | 100 [1.0] | 100 [1.0] | |

100 | 0 | 0 | 0 [2.6] | 0 [3.9] | 100 [4.0] | 0 [1.0] | |

500 | 36 | 32 | 100 [4.0] | 4 [4.0] | 100 [4.0] | 100 [4.0] | |

1000 | 60 | 72 | 100 [4.0] | 32 [4.0] | 100 [4.0] | 100 [4.0] | |

3000 | 96 | 92 | 100 [4.0] | 92 [4.0] | 100 [4.0] | 100 [4.0] | |

5000 | 100 | 88 | 100 [4.0] | 96 [4.0] | 100 [4.0] | 100 [4.0] | |

0 | 0 | 0 | 100 [1.0] | 0 [3.6] | 100 [1.0] | 100 [1.0] | |

100 | 0 | 0 | 0 [1.4] | 0 [3.8] | 96 [4.0] | 0 [1.0] | |

500 | 0 | 0 | 56 [4.0] | 0 [4.0] | 100 [4.0] | 44 [4.0] | |

1000 | 0 | 24 | 88 [4.0] | 4 [4.0] | 100 [4.0] | 96 [4.0] | |

3000 | 40 | 92 | 100 [4.0] | 40 [4.1] | 100 [4.0] | 100 [4.0] | |

5000 | 64 | 100 | 100 [4.0] | 80 [4.0] | 100 [4.0] | 100 [4.0] | |

0 | 0 | 0 | 100 [1.0] | 0 [2.4] | 100 [1.0] | 100 [1.0] | |

100 | 0 | 0 | 0 [1.0] | 0 [2.3] | 100 [4.0] | 0 [1.0] | |

500 | 0 | 0 | 0 [1.9] | 0 [2.3] | 100 [4.0] | 0 [1.0] | |

1000 | 0 | 0 | 0 [1.9] | 0 [2.4] | 100 [4.0] | 0 [1.0] | |

3000 | 0 | 0 | 0 [2.0] | 0 [2.6] | 100 [4.0] | 0 [1.0] | |

5000 | 0 | 0 | 0 [2.0] | 0 [2.9] | 100 [4.0] | 0 [1.0] | |