1. Introduction
The speed of beneficial mutations on their way to fixation in natural populations is a fundamental topic in population genetics. Knowing how fast selection can act to change allele frequencies is essential for understanding evolution. The time for a beneficial allele to spread through a natural population was investigated early in the history of population genetics, using deterministic models [
1]. Later, properties of the fixation time of individual advantageous mutations under the influence of positive directional selection and genetic drift in populations of finite size have been derived by several authors [
2,
3,
4].
However, the spread of two or more beneficial mutations that arise and interact during their fixation process is much less investigated. For two interacting mutations Otto and Barton [
5] have studied the case that the second mutation is less beneficial than the first one. In contrast, Cuthbertson et al. [
6] and Bossert and Pfaffelhuber [
7] considered the scenario that the second mutation is fitter than the first one. Furthermore, they assumed that there is a chance that the first and second mutants recombine such that the recombinant type has the highest fitness and eventually fixes. Here we will focus on this latter case.
We consider the mathematical analysis of the fixation time in conjunction with the theory of selective sweeps. Although beneficial mutations are a comparatively small fraction of all new mutations, some of them may reach fixation and are thus important in evolution. If the fitness effects of these beneficial mutations are sufficiently strong, they may cause selective sweeps, i.e., localized reductions of genetic variation along genomes [
8]. Such localized patterns of reduced genetic variation have been convincingly described in a variety of organisms.
Detecting signatures of selective sweeps in genomes is a major goal of current population genetics, as it allows estimating the rate of beneficial mutations going to fixation and finding the genes involved in selection. The inference methods for detecting sweeps depend critically on assumptions on whether the beneficial mutations occur sequentially (such that there is at most one beneficial allele on a chromosome on the way to fixation at a time) or whether beneficial alleles overlap with each other. Models of recurrent selective sweeps traditionally assume that in chromosomal regions of normal recombination rates, at most one beneficial allele is on the way to fixation [
9,
10,
11].
Here we follow the scenario proposed by Bossert and Pfaffelhuber [
7]. Thus we assume that, while a highly beneficial mutation spreads in a natural population, a second beneficial mutation arises before the first one has fixed. Furthermore, we envision that the first mutation is less fit than the second one and that recombination may occur between the two mutations. Under these conditions a haplotype may be formed that is fitter than the two individual mutations and may therefore eventually fix. To model this process for a population of finite size, Bossert and Pfaffelhuber [
7] used stochastic differential equations and calculated the fixation time of a recombinant haplotype under the assumption that it fixes. In contrast, we use a deterministic approach based on ordinary differential equations (ODEs). Thus, in our analysis an explicit assumption about the fixation of a recombinant is not necessary.
We begin by formulating the differential equations for the basic allele frequency changes. Then we introduce nonnormalized variables (that are proportional to the allele frequencies) to find approximate solutions of this system of ODEs. Subsequently, we provide explicit formulas for the fixation time of the recombinant type. Finally, we apply our results to simulation data by Chevin et al. [
12] to describe patterns of selective sweeps in the genome caused by the joint fixation of two mutations due to selection and recombination.
2. Model
We consider a two-locus model, with alleles
A and
a at locus 1 and
B and
b at locus 2, respectively. The upper case letters denote beneficial alleles with selection coefficient
at locus 1 and
at locus 2, whereas the alleles with lower case letters are assumed to be neutral (wildtype). This model has four haplotypes
AB,
Ab,
aB, and
ab, with frequencies given by the variables
and
(which add up to 1). Assuming additive selection, their relative fitnesses are
,
,
and 1, respectively, where
. Recombination between locus 1 and locus 2 occurs at rate
r. Since we are interested in closely linked loci, we assume that
for
i = 2, 3. In our deterministic setting (without genetic drift) the ODEs for the time change of the variables
Xi are obtained by adding the change due to selection and the change due to recombination ([
13], chapt. 2):
where
measures time in generations.
For
, which we assume throughout this paper, an exact solution of this system of ODEs is not known. An approximate solution can be obtained using nonnormalized variables
Yi (
i = 1, …, 4) (see [
14] for mutation-selection and [
15] for recombination-selection equations). These are related to the original variables of Equation (1) as:
Using (2) it can be shown that the nonnormalized variables satisfy the following ODEs:
Note that rescaling all
by a constant leaves
invariant. To fix this scaling, we use
for all
i = 1, …, 4. As outlined in
Appendix A.1, the following approximate solutions of the ODEs (3) can be found, assuming that both mutations
A and
B arise on background
ab (so that
:
These approximations were first established and tested by Yun Song (personal communication). Numerical analysis suggests that they are generally excellent for /10. In the following we refer to the scenario described by Equation (4) as the repulsion case. Similar equations can be obtained for the association case, for which we assume that B occurs on the same chromosome as A; i.e., and
3. Fixation Times
Using Equation (4), we can find the fixation time of two interfering mutations. We call an allele or haplotype fixed when it reaches frequency
and denote this time
. Thus,
measures the time from some initial frequency at
to
. The initial frequency may be the frequency of a newly arising mutation. In a haploid population of size
, which we consider here, this initial frequency is given by
. In our case, however, we are interested in the fixation of the double mutant
whose initial frequency is
as
arises during the fixation process due to recombination. Thus, the fixation time of
is found by solving the equation
where
is a small number.
Next we express Equation (5) in terms of nonnormalized variables and obtain
Using (4) with
and
, this equation can be approximated by
Evaluating the terms on the right-hand side of Equation (6) requires that we find useful approximations of the integrals in Equation (4) because closed formulae for these integrals are not known (see
Appendix A.2). In addition to the two aforementioned assumptions, we assume that
is sufficiently large such that the second mutation is eventually dominating the first one, i.e.,
, and population size is large
such that selection is strong
Under these assumptions we find for the integrals
defined in
Appendix A.2:
Inserting these formulas into Equation (6), this equation can be written in the following form:
We can neglect the first term on the right-hand side of Equation (8) for the following reasons: first, because we assumed that
and second since
is bounded by
the term
can be neglected compared to
. This leads to Equation (9):
Finally, we introduce population size
into this equation by writing
and
, where
is the number of
alleles at
Then solving the equation for
yields an explicit expression for the fixation time in the repulsion case:
Table 1 shows that this result agrees very well with the numerical solution of Equation (8).
The first five columns show the simulation data: the frequency of the first mutation when the second mutation is introduced; relative genetic diversity at a neutral locus between loci 1 and 2 near locus 1 (here ‘relative’ refers to expected diversity under neutrality); relative genetic diversity at a neutral locus in the middle between loci 1 and 2; relative genetic diversity at a neutral locus between loci 1 and 2 near locus 2; Tajima’s measure of the deviation of the level of variation from neutrality at the locus in the middle. is the fixation time from Equation (10) measured in generations, and is obtained by solving Equation (8) numerically. The parameter values are:
As expected, the formula for
is complex. However, in the interesting parameter range of small
values such that
, we may approximate Equation (10) as
This approximation works very well for small introduction frequencies
, i.e.,
and 0.024 for the parameter values given in
Table 1. The first term on the right-hand side of Equation (11) equals the fixation time of a new allele starting at frequency
and ending at
, driven by positive directional selection with selection coefficient
. This term also appears in the result of Bossert and Pfaffelhuber [
7]. The denominator
can be explained as follows. To go to fixation, the successful recombinant
AB with fitness
has to compete against the—at the time—dominant
aB type with fitness
having a fitness advantage
Furthermore, unless
is very small, the second term in Equation (11) is negative such that
is smaller than
. This is not surprising, as we are dealing here with an equation describing continuous input of new
alleles due to recombination, similar to the case of fixation under continuous mutation pressure and positive directional selection ([
16], Equation (8)). Thus, based on Equation (11), we obtain a relatively simple formula for the threshold of
above which recombination speeds up fixation time. For
, however, the second term in Equation (11) turns positive, such that the fixation time
becomes larger than
, meaning that the input of recombinants ceases.
In the association case the fixation time can be derived in a similar way. However, an explicit expression for the fixation time as in Equation (10) is not possible. Instead, we end up in a transcendental equation that can in general only be solved numerically.
4. Genomic Footprint of Competing Mutations
In this section, we analyze simulation data from a study of genetic variation at neutral sites located between two selected loci [
12]. The data were obtained using Monte Carlo simulations of a Wright-Fisher model ([
13], chapt. 3) with two selected loci and three neutral loci. The three neutral sites are located between the selected loci as described in
Table 1.
Table 1 also contains the parameter values used in the simulations. They meet the assumptions of our analysis, except for the population size. In the simulations
was used, while in our derivation
was suggested. To check whether this causes problems, we compared the analytical results for
T from Equation (10) with the numerical solutions of Equation (8) for
However, as
Table 1 (columns 6 and 7) shows, no discrepancies could be found.
The first observation concerns T as a function of . Since was used in all simulations and in all cases is larger than the threshold (Equation (12)), we expect that T is smaller than and decreases with increasing . This is indeed the case. The most pronounced effect of on fixation time is observed for small values of For larger values of , however, fixation time is relatively constant. This observation is consistent with the formulas for T, especially Equation (11), which shows that, for given, fixation time depends logarithmically on . This formula also says that recombination is most important in speeding up fixation when the second mutation is introduced at low values. Here the interference between the two mutations is largest.
Next we discuss the simulation results of Chevin et al. [
12] in light of our analysis. Since these simulations assume that the introduction of the second mutation into the population occurs in repulsion or association with
A, depending on the frequency of
A, we averaged the times to fixation for both cases. Average fixation time exhibits the same features as the fixation time for the repulsion case alone: a steep decay of the fixation time for small
and a rather constant level for larger introduction frequencies. This is because the fixation times in the association case are substantially shorter than in the repulsion scenario and show a narrow range from 104.8 generations for
to 97.0 generations for
This shows very clearly that the fixation time is largely determined by the repulsion phase.
Regarding variation at the neutral loci, Chevin et al. report that the two loci close to the selected ones show typical hitchhiking effects [
8]; i.e., variation is reduced relative to the neutral standard level such that stronger selection acting at locus 2
leads to a greater reduction than at the neutral site near locus 1. Furthermore, variation at the neutral locus in the middle between the two selected loci is greater than that expected for selection at a single locus with selection coefficient
or
. Such an excess of neutral variation between two selected sites was also observed by Kim and Stephan [
17]. It is essential for our analysis and will be further discussed below.
Increasing
generally leads to stronger hitchhiking effects such that levels of neutral variation decrease with
. This can be clearly observed at the neutral locus close to locus 1, whereas at the neutral locus close to the stronger selected site, this is hardly visible. At the neutral locus in the middle there is also a strong decay of variation with increasing levels of
. The effect of
on hitchhiking is likely due to the interference of the two mutations. The longer they compete with each other on their way to fixation, the weaker their hitchhiking effect. This has already been observed in other studies (e.g., [
17]).
Finally, we discuss
D, a statistic introduced by Tajima [
18]. In
Table 1 (column 5) only the
D values for the neutral locus in the middle between locus 1 and 2 are shown. All
D values at the other two loci are negative as expected from the theory of genetic hitchhiking. A negative
D is observed when an allele has either a lower or higher frequency than expected by the neutral theory. Interestingly, however, Chevin et al. [
12] observed strongly positive
D values for
and 0.077, whereas
D is around zero or negative for larger
Positive values of
D indicate that alleles are at intermediate frequencies, such as predicted for balancing selection. In our case, however, this is probably not a valid hypothesis, at least concerning the standard models of balancing selection. A plausible hypothesis proposed by Bossert and Pfaffelhuber [
7] is that positive
D may be observed when a haplotype structure arises in the genome through recombination between different haplotypes consisting of multiple polymorphic loci. Haplotype structures exist in populations only if polymorphisms at individual loci tend to be in intermediate frequency (such that the less frequent variants are not too rare). This may be the case for
and 0.077, but not for the larger
values, for which diversity is more heavily reduced (
Table 1). An alternative, though related, hypothesis postulates that the dynamics of the two selected mutations (while in repulsion) reach nonnegligible frequencies at similar times such that recombination may produce haplotypes with the two favorable alleles in coupling [
12].
If these hypotheses are correct, a genomic footprint of competing beneficial mutations may be detected by measuring Tajima’s
D and/or linkage disequilibrium. In general, footprints associated with selective sweeps caused by the fixation of beneficial mutations can be found in genetic data if their characteristic pattern of variation, such as a dip of nucleotide diversity around a selected site or a haplotype structure revealed by linkage disequilibrium, persists for some time. For Wright-Fisher populations, such signatures may be detected for up to 0.1
generations after fixation of the driving mutations [
19,
20].
5. Discussion
For a highly beneficial mutation A at locus 1 spreading in a very large population, we have analyzed the scenario when a second beneficial mutant B arises before A has fixed. Under the assumptions that the fitness of B is greater than that of A and that A- and B-carrying chromosomes can recombine, recombinants AB may form and eventually fix. We present approximate formulas for the fixation time of AB under additive fitness of the mutations as a function of the frequency of A at the introduction of B. The latter parameter turns out to be useful for describing the interference between competing beneficial mutations.
Our analysis suggests that the effect of interference between beneficial mutations is most pronounced for small values of
In this parameter range, fixation time decreases substantially with
However, for larger values, fixation time is relatively constant (
Table 1). This agrees with the formulas for
T, especially Equation (11), which shows that
T depends logarithmically on
. We also observed that fixation time is largely determined by the conditions of the repulsion case.
Similarly, the effect of interference on the genomic footprint of competing mutations can be clearly discerned. For small values of
and 0.077, a strongly positive
D was observed, whereas
D is around zero or negative for larger
(
Table 1). Positive values of
D indicate that alleles are in intermediate frequencies. This may be observed when a haplotype structure arises in the genome through recombination between different allelic types consisting of multiple polymorphic loci [
7,
12].
Finally, we address the question of whether we can expect to observe patterns of overlapping selective sweeps due to competing mutations in recombining genomic regions. Estimates of the average selection coefficient
and the rate
at which beneficial mutations arise and go to fixation (i.e., selective substitutions) are known for some species, including
Drosophila melanogaster. For instance, Jensen et al. [
21] analyzed a dataset of genetic variation from the euchromatic part of the genome of a
D. melanogaster population from Africa, which is—roughly speaking—the recombining portion of chromosomes. They obtained the following estimates:
,
and
per generation per nucleotide site. Since under selection and genetic drift the mean fixation time (conditional on fixation) for a diploid species such as
D. melanogaster is
[
3], we find that the probability of a second substitution arising on a chromosome during the sojourn of the first one to fixation is
per nucleotide site. Multiplying
with the size of the euchromatic part of a chromosome (in
D. melanogaster approximately 24 Mb
base pairs), we find that on average at about 20 sites of a chromosome strongly selected substitutions could arise and compete with the first mutation during its sojourn to fixation. However, only a part of these substitutions can cause interference. According to our results, a necessary condition for interference to occur is that the second mutation arises as long as the first mutation is at low frequency, say
The sojourn time of the first mutation (with selection coefficient
) in an interval
is approximately
generations, which includes drift (calculated from Equations (4.25) and (4.41) in [
13]). These results suggest that beneficial mutations spend a relatively long time (in this case about 35% of their fixation time) at low frequencies (
0.1) before they go on to fixation. In the
Drosophila case, about 7 of the 20 selected mutations may therefore cause interference. This number, however, could be even higher if—in addition to genetic drift—other diversity-reducing forces such as background selection are incorporated [
22].
An example that some of these selected substitutions occur in close proximity in a recombining region of the genome is found at the
polyhomeotic locus of a European population of
D. melanogaster. Voigt et al. [
23] report a case in which five selected substitutions (i.e., nearly fixed variants between Europe and Africa) are located in the 5-kb intergenic region between
polyhomeotic proximal and the gene
CG3835 within a segment of 2.28 kb. They showed that these five selected variants are involved in the adaptation of
D. melanogaster to the higher temperature in Europe compared to that of the ancestral species range in Africa. Variation is generally low in the whole
polyhomeotic region, and Tajima’s
D is strongly negative, as expected after a sweep. However, using a larger dataset than in her previous study, Susanne Voigt (personal communication) found evidence that the five beneficial substitutions that likely caused the sweep did not act independently in a sequential manner but were selected as a haplotype block. As a consequence, an elevated level of Tajima’s
D in the fragment containing the five selected substitutions was not detected.
A more promising example in the context of interference between beneficial mutations may be the
Agouti locus in deer mice. Here the precise mutations required for adaptation to the light-colored soil of the Nebraska Sand Hills have been identified [
24]. The authors claim that—contrary to the aforementioned
Drosophila case—the light Sand Hills phenotype is the result of independent selection on many mutations within the
Agouti locus spanning about 120 kb. Thus, in this case, a genomic footprint of interference between beneficial mutations may be encountered in sequence data.