Prioritised Learning in Snowdrift-Type Games

Cooperation is a ubiquitous and beneficial behavioural trait despite being prone to exploitation by free-riders. Hence, cooperative populations are prone to invasions by selfish individuals. However, a population consisting of only free-riders typically does not survive. Thus, cooperators and free-riders often coexist in some proportion. An evolutionary version of a Snowdrift Game proved its efficiency in analysing this phenomenon. However, what if the system has already reached its stable state but was perturbed due to a change in environmental conditions? Then, individuals may have to re-learn their effective strategies. To address this, we consider behavioural mistakes in strategic choice execution, which we refer to as incompetence. Parametrising the propensity to make such mistakes allows for a mathematical description of learning. We compare strategies based on their relative strategic advantage relying on both fitness and learning factors. When strategies are learned at distinct rates, allowing learning according to a prescribed order is optimal. Interestingly, the strategy with the lowest strategic advantage should be learnt first if we are to optimise fitness over the learning path. Then, the differences between strategies are balanced out in order to minimise the effect of behavioural uncertainty.


Introduction
In many situations, ability to learn becomes important for survival: how fast and effectively individuals explore their surroundings and learn how to react to them determines their survival chances. Some environments and situations might require specific skills or rapid adaptations whereas others might just require general awareness. Learning and adaptation are remarkable features of life in natural environments that are prone to frequent changes. The process of learning can be considered as a continuous multidimensional process where several skills can be learned in parallel. However, reducing the time step to a very small interval, we focus on learning only one skill at a time. In this study, we introduce a model of multidimensional stepwise learning by assuming that the learning process can be subdivided into smaller time steps during which we only learn one skill.
Evolutionary games were proposed to answer questions of the effects of natural selection on populations. Here, a game happens at a higher level: the game is not played directly by individuals but rather by strategies or skills competing for survival [1,2]. In such settings, learning the best and more effective strategy first becomes critical for individuals, especially under resource limitation [3]. Learning in the evolutionary settings often assumes that learning evolves in parallel with the evolution itself [4]. In this paper, we consider the case when the system is perturbed after reaching its stable equilibrium. Finding the most efficient learning path to return to the previous equilibrium is a challenge. In this paper, we suggest to look at this problem from the evolutionary perspective where an optimal learning path is dictated by the chance of survival. This is achieved through maximisation of fitness depending on the learning path an individual takes.
Behavioural stochasticity is a popular object of study for game-theorists. First, the concept of "trembling hands" [5] was suggested as an approach to model players' mistakes during the strategies' execution with some small probability. In addition, mutations can be interpreted as reactions to environmental changes (see for instance [6][7][8] for studies on mutations). In artificial intelligence and economics this idea is of a particular interest since individuals can be imperfect and incapable of executing their strategies (examples of such studies are [9][10][11]). However, we do not restrict our analysis to mistakes that are random noise, but, instead, allow for biased mistakes and assume that skills can be learnt at different rates. For this, we utilise the notion of incompetence that means that individuals make mistakes with certain probabilities, which was first introduced in [12,13] and extended in [14]. Hence, learning is defined as a process of improving competence and reducing the probabilities of mistakes. Here, we propose the notion of prioritised learning when the priority in the order is determined by the skills' relative advantages. That is, we are interested in the mechanism of discovering an optimal order according to which skills should be learned. This mechanism relies on the interplay between relative fitness and learning advantages and the notion of prioritised learning which aims to balance out these differences between the strategies.
We focus on the question in which order species should learn their skills to have an evolutionary advantage. The process of re-learning the effective strategy can be considered as a continuous multidimensional problem. However, reducing the time step to a very small interval, we focus on only one skill. We consider the case of the coexistence of two strategies: to cooperate and to defect. Mathematically, this can be described by the Snowdrift (also known as Chicken or Hawk-Dove) game where two strategies interact and both appear in the behavioural traits of individuals [15].
The question of how cooperation evolves in communities is a rich field of study for mathematicians and biologists. Enforcement, punishment and reciprocity are the mechanisms that can sustain stable cooperation [16][17][18][19]. The Prisoners' Dilemma is the well-known example of the problem of cooperation [20,21]. It is a strict game where sustaining cooperation becomes a challenge. However, it can be relaxed if we assume that benefit gained from cooperation can be received by both players and the costs shared (see [22] for the review). The resulting game has a form of the Snowdrift game, where a stable mixed equilibrium exists [23]. We point out that the focus of our study is not the evolution of cooperative mechanisms as such. Learning to cooperate is a fruitful field proposing many interesting results [24][25][26]. However, here we are more interested in understanding of how the system should optimally evolve back to an equilibrium state once it was disturbed. Specifically, we shall focus on a Snowdrift type game in order to determine which of the strategies (cooperation or defection) should be learnt first.
When speaking of the optimality of a learning path, we expect that the benefit from this strategy has an impact on the overall population's fitness [27]. We note that learning and replicator timescales are decoupled. In fact, we require that the learning timescale is much slower than reproduction. This can be seen in view of a behavioural adaptation that might take longer time to happen. In classic settings of the replicator dynamics, the dynamics' time scale is equal to the reproduction time scale. That is, every time step is the length of one generation. However, when modelling behaviour of individuals, this reproduction time scale represents an interaction time scale. Therefore, the behavioural change might take several interactions to be achieved. Hence, naturally, we assume that individuals are learning slower than they interact. For example, in the incompetent version of the Snowdrift game, individuals go through multiple interactions which may change their behaviour from cooperating to defecting.
We show that measuring the fitness over the learning path depends on the order of learning along the path and the extent to which strategies are learnt. Our results demonstrate that it might be preferable to learn a skill prone to higher probabilities of mistakes first and leverage its learning advantage. This suggests that in the environmental or other changes those strategies that are most disruptive must be adapted to more quickly if they are to survive at all. If two skills are equivalent in their relative strategic advantages, we show that both skills are also equivalent with respect to the order of learning. Counter-intuitively, we show that these relative advantages can still be identical even if the relative fitness advantages of the skills are significantly different suggesting that the evolution is trying to balance out mistakes. We conjecture that not only the fitness of the skill has to be taken into account but also the degree of incompetence.
In the following section we set up the model and define the notions of relative fitness, learning and strategic advantages. Then, in Section 3 we define a fitness-over-learning objective function measuring how fitness of the population improves over the learning path taken. After that, we proceed to two cases: (a) the case when skills are identical in their strategic advantage and no prioritised learning is needed in Section 4, and (b) the case when skills are distinct in their advantages in Section 5. These two cases demonstrate the notion of hierarchy in the learning order and how mistakes affect the evolution.

Learning Setup
Typically, a population of species acquires a set of skills that they are required to learn either while being young or while adapting to new environmental conditions. Let us suppose that this set consists of only two essential skills both of which are needed for survival (or for stable coexistence in the population). Hence, we consider two specific strategies that need to be learnt: cooperation and defection. In the evolutionary context, defection can be interpreted as aggression to capture (rather than share) a resource such as food or territory. We utilise the notion of replicator dynamics in order to describe the evolution of interactions among individuals in the population [28,29]. Let p := p(t) ∈ [0, 1] be the frequency of cooperators at (evolutionary) time t. Then, the dynamics can be expressed aṡ where f C is fitness of cooperators and φ is the mean fitness function defined as φ : is the fraction of defectors. For the purpose of this paper, we use a linear form of the fitness functions, as in [14], which simplifies tȯ We shall note that if both cooperation and defection strategies would be required to some extent, then both strategies might be required to coexist at equilibrium. Such an equilibrium usually characterises the Snowdrift game, which is also referred to as a anti-coordination game. Since both strategies are the best response to the opposite strategy, at equilibrium, both strategies coexist, hence, securing some stable level of cooperation. We shall construct a reward matrix R for such a game as where B is the benefit and C is the cost of cooperation. In order to simplify our analysis, we apply the linear transformation from [30] and subtract the diagonal elements from the corresponding columns, as it does not affect dynamics' behaviour. Then, we can consider a reduced form of the matrix given by where a = B − C and b = C /2. As we want to set our game to be a Snowdrift game, we assume that B > C > 0, because then both strategies will stably coexist at the equilibriump = (â,b) and their frequencies will be given bŷ Then, in the context of learning a set of skills, both skills will be required to be learned. The question we address is: which one should be learnt first? In order to model learning these skills in the game, we utilise the notion of incompetence. This concept was first introduced for classic games in 2012 [12] and extended to evolutionary games in 2018 [14]. Here, individuals choose a strategy to play but have non-zero probabilities of executing a different strategy. Such mistakes are a manifestation of the incompetence of players. Mathematically, it is described by the matrix of incompetence, Q, that evolves as individuals are learning. This matrix consists of conditional probabilities q ij determining the probability of executing strategy j given that strategy i was chosen. The matrix has the form Q = q 11 q 12 q 21 q 22 .
A schematic representation of the interaction under incompetence can be found in Figure 1A. As a measure of learning we use parameters x and y for each strategy. Thus, each strategy can be learned at a different pace, which sets this work apart from the existing literature (see [12,31]). Then, the matrix of incompetent parameters has the form We assume that strategies can be learnt and, hence, the incompetence can decrease from some initial level of propensity to make mistakes. Thus, the learning process is described by the equation where S represents the starting level of incompetence. If x = y = 0, then X = 0 and Full competence corresponds to the case x = y = 1, where Q = I, the identity matrix. Each strategy has its own measure of incompetence level and its own time needed for it to be mastered. Then, the new incompetent game reward matrix is defined as Given the new reward matrix, we require some basic assumptions on parameters in (3)-(5) to be satisfied in order for the game to avoid bifurcations in the replicator dynamics. Specifically:

1.
a, b > 0: this ensures that it is beneficial to learn both strategies.

2.
α >â, β >b: this is necessary for the new incompetent game to have a stable mixed equilibrium point for any level of incompetence, ∀x, y ∈ [0, 1] [14].
If parameters of the game do not meet these conditions, then there will be some values of the incompetence parameters for which the system (1) undergoes bifurcation. This would lead to situations where one of the skills is not beneficial to be learnt due to the fact that it is dominated. Hence, learning the beneficial skill would be an obvious answer to the question of which skill to learn first. However, we are interested in the case when the optimal learning path involves both skills. However, under incompetence, individual 1 will cooperate only with probability α and individual 2 will defect only with probability β. Hence, the outcome of the interaction now depends on the probability distribution of mistakes. (B) A schematic representation of prioritised learning: we aim to define first an optimal direction in which the learning should start (x and y direction represents strategies 1 and 2, respectively) and an n-step optimal sequence of (x x k , y x k ) if x direction is optimal (or (x y k , y y k ) if y direction is optimal). (C) An example of a learning path in the x direction. We start at (x 0 , y 0 ), then taking the first half-step in the x direction to (x 1 , y 0 ) and a second half-step in the y direction to (x 1 , y 1 ). By design, we have to start at (0, 0) and finish at (1, 1).

Maximising Fitness over Learning
The optimality of a learning path can be determined in many ways. In the replicator dynamics with symmetric payoff matrices, interactions follow the evolutionary path that maximises the overall population's fitness [32]. We focus on the fitness function at the equilibrium state, which implies that the population attains a steady-state faster than incompetence parameters change. Technically speaking, we assume either very long timescales of learning or else very fast convergence to steady state, or both.
Hence, it is sufficient to consider the mean fitness function [29] of the Snowdrift game which has the following form Then, in our new incompetent game the mean fitness φ(X) =pR(X)p T can be shown to be We shall note that in fact any nontrivial learning path will improve the mean fitness at the equilibrium. This follows from the fact that for any vector x = (x, y) with entries in (0, 1) the following holds where H φ is a Hessian of φ(X). Since we would like to find an optimal learning path that maximises the fitness, it is convenient to consider a new re-scaled fitness function. For that, we choosẽ where new parametersã andb areã An important note for further analysis is the understanding of fitness and learning advantages. Given that the relative fitness of each strategy is positive, that is, a, b > 0, we say that the strategy with a higher relative fitness obtains a fitness advantage. In addition, we define a learning advantage of a strategy in a very special way. This advantage arises when the strategy is accompanied by a higher probability of making a mistake in its execution. In such a case reducing the corresponding level of incompetence offers a greater opportunity for improvement, thereby constituting a learning advantage.
Next, note that the parametersã andb introduced in (8) are closely connected with the above definitions of fitness and learning advantages. Indeed, their difference allows us to capture the relative tradeoffs between these two types of advantages. Hence, we shall say that the new parameter δ :=ã −b measures the relative strategic advantage of cooperators against defectors. That is, if δ > 0, cooperators have an advantage. We illustrate these concepts in Table 1. Next, we shall demonstrate how this notion affects the optimal learning path with respect to the population's mean fitness. Table 1. Definitions of advantages of cooperators over defectors. For the definition of advantages of defectors over cooperators, the inequality signs in parameters comparison should be reversed.

Advantage Parameters Comparison Discussion
Fitness advantage a > b Fitness advantage implies that cooperator have higher fitness and, hence, are more abundant.
Learning advantage α < β Learning advantage implies that cooperators are more flexible in their strategy execution and can act as defectors.
Strategic advantage δ > 0 orã >b Strategic advantage combines both fitness and learning advantages implying that if cooperators are disadvantaged in fitness (or learning), then advantage in learning (or fitness) can compensate.
However, we need to take into account possible limitations of the learning pace. We assume that only one skill can be learnt to some degree and two skills cannot be learnt simultaneously. Thus, learning is achieved in a discrete manner (see Figure 1B). The question is what is the optimal learning order and what are the switching points. This will be determined by defining the learning path as a curve C on the learning space L such that it starts at (0, 0) and ends at (1, 1). Definition 1. Define the learning space of an incompetence game L as the domain of the incompetence matrices Q(X) from (5) given by the set of all 2 × 2 stochastic matrices. Definition 2. Define a learning path for an incompetence game as a curve C : A learning path C can be a smooth curve or a stepwise path, depending on whether learning is continuous or discrete. We shall consider only stepwise learning paths. This is a natural restriction because: at a small time-scale only one skill can be learned at any given time. However, we will also study the case when the number n → ∞, which approaches a smooth learning curve.

Definition 3.
A stepwise learning path of order n, C n = {(x k , y k ) : 1 ≤ k ≤ n}, is a stepwise curve in the learning space L, connecting the n points (x k , y k ) n k=1 , that satisfies Conditions (a) and (b) imply ∆x k , ∆y k > 0 for 1 ≤ k ≤ n − 1. The path segment from (x k , y k ) to (x k+1 , y k+1 ) could consist of the following sequence (x k , y k ), (x k+1 , y k ), (x k+1 , y k+1 ), in which case we say that the x direction was taken. The alternative path segment is (x k , y k ), (x k , y k+1 ), (x k+1 , y k+1 ), indicating that the y direction was chosen first. Here, we focus on alternating stepwise paths where the first direction determines the remaining path. Consequently, an n-step learning path is described by the points {(x k , y k )} n k=1 satisfying (a)-(d) above, resulting in two possible learning paths corresponding to the direction of the first step. Let Φ x (C n ) be the fitness-over-learning in the x direction for the learning path C n ∈ L n given by Let Φ y (C n ) be the fitness-over-learning in the y direction for the learning path C n ∈ L n defined as That is, for a given n ∈ N, we have two objective functions: Φ x and Φ y . Finding their maxima separately yields an optimal learning path with the optimal direction. In what follows, we define the optimal learning paths in the x and y directions and the overall optimal learning path that maximises the population's measure of fitness over learning.
Definition 5. The optimal learning path C * n ∈ L n with respect to the population's mean fitness function φ(X) : L n → R is such that it satisfies the equation where C * x n , C * y n are the optimal paths in the x and y direction, respectively. When Φ * x = Φ * y , both C * x n and C * y The superscript (respectively, subscript) indicates the direction of the path. That is, if the x direction is optimal, that is Φ * both directions x and y are optimal and it does not matter which one we take. Hence, we can define prioritised learning in these settings.

Definition 6.
We say that there exists prioritised learning for Φ among stepwise learning paths of order n, if there exists C * n such that one of the directions is preferable over the other. That is, Given the structure of the fitness Functions (7) and (9)-(10) we can explicitly derive Φ x and Φ y for the learning paths in the x and y directions. However, we will show that the optimal direction of learning can be determined simply by the sign of the relative strategic advantage, δ.

No Strategic Advantages
Throughout this section, we assume that no strategy has a relative strategic advantage (δ = 0). We shall first note that in this case both objective functions, Φ x and Φ y , exhibit a symmetry relation. Theorem 1. If δ = 0, then there is no difference in the direction of optimal learning, that is, Hence, if there is no relative strategic advantage in the game (δ = 0), the order of learning does not affect the fitness of the population. It is therefore sufficient to calculate only one path that maximises the fitness-over-learning of the population. In the following proposition we show that this learning path has a remarkably simple form.

Proposition 1.
If δ = 0, then the unique optimal stepwise learning path of order n in the x direction, C * x n = {(x * ,x k , y * ,x k ) : 1 ≤ k ≤ n} ∈ L n , is given by See Mathematical Appendix A for the proofs of Theorem 1 and Proposition 1. Interestingly, the optimal solution for the x direction yields x * ,x k , y * ,x k such that y * ,x k ≥ x * ,x k , ∀k. That is, each step in the y direction, y k is greater than the step in the x direction, x k .
To analyse how increasing the number of steps is changing the objective function, we consider the rate of change of the fitness over learning function at the optimal solution, that is After substituting the optimal solution (11) into (12) we obtain Due to symmetry, when δ = 0, it follows that ∆Φ * x = ∆Φ * y . Therefore, the smaller the learning steps, the greater the benefit. However, the marginal increases tend to 0 as n becomes large. Arguably, this illustrates the "law of diminishing returns" of stepwise learning.
In terms of fitnesses, these examples are different: the fixed points of the replicator equations are (0.5, 0.5) and (0.9, 0.1), respectively. Hence, fitness of strategy 1 is higher in the second game. However, setting α 1 = β 1 = 0.5 and α 2 = 0.9, β 2 = 0.1, we have relative strategic advantages in examples 1 and 2 both equal to 0 (δ 1 = δ 2 = 0). Then, S 1 and S 2 equalise the strategies. The high probability of mistakes for strategy 2 in R 2 signals that it is more disrupted by incompetence. This makes the optimal learning paths for these two games identical (see Figure 2A).
Next we shall consider the case when δ = 0. In this case, one of the strategies has a relative strategic advantage, depending on the sign of δ. We show that the order of the learning path is now important and influences the value of the fitness-over-learning function, which we call prioritised learning.

Prioritised Learning
First, we recall the notion of prioritised learning used henceforth. By Definition 6, there exists prioritised learning for Φ among stepwise learning paths of order n, if there exists C * n such that one of the directions is preferable over the other, along that path.That is, Φ x (C n ) * = Φ y (C n ) * . We shall next characterise an optimal solution in the x direction. Proposition 2. Let n ∈ N, n ≥ 2. If n is such that δ < 1 2(n−2) , then the x direction is preferred and the optimal learning path of order n is given by We refer the reader to Mathematical Appendix A for more details. Next, we provide conditions for which the y direction defines the optimal learning path. Proposition 3. Let n ∈ N, n ≥ 2. If n is such that δ > − 1 2(n−2) , then the y direction is preferred and the optimal learning path of order n is given by Hence, depending on the value of δ, the optimal learning path has different directions. The threshold for δ is equal to 1 2(n−2) , which for sufficiently large number of steps is nearly 0. However, in the proof of Theorem 2 (see Mathematical Appendix A), we show that it is the sign of δ that determines the direction of learning.

Theorem 2.
The direction of the optimal learning path is determined by the sign of δ: for δ > 0 the y direction is optimal and for δ < 0 the x direction is optimal.
For δ = 0 the optimal learning path represents n − 2 equally distributed steps and the direction of the first step does not affect the fitness-over-learning. Note that if δ = 0, the first and last steps for each direction are y * ,y Thus, the first step of the learning path aims to adjust fitness and learning advantages between the two strategies. However, in the interior of the learning space, the path still takes n − 2 equally distributed steps. In Figure 3 we displayed the objective function Φ for different positive values of δ and different number of steps n. The images below the colormap enhance the changes in Φ * as the number of steps n varies. The difference in the values of Φ * with respect to n is marginal, hence, below we zoom into several values of δ. For δ < 1, Φ * y increases in n, suggesting that taking more steps is beneficial. However, Φ * y decreases for δ > 1, suggesting one-step learning of the skill. We show this in the next result that follows immediately from Propositions 2, 3, and Theorem 2.

Corollary 1.
For |δ| > 1, we obtain two cases: (i) If δ ≥ 1, then the optimal learning curve is a one-step function in the y direction. (ii) If δ ≤ −1, then the optimal learning curve is a one-step function in the x direction.
If |δ| > 1, then the optimal solution from (13) and (14) is only feasible for n = 2. However, if δ is positive and less than 1, the greatest fitness-over-learning is achieved for a smooth learning path along the line y = δ + x. This can be seen as a consequence of allowing n to approach infinity.

Corollary 2.
For 0 < δ < 1, the optimal step-wise learning path {(x * ,y k , y * ,y k )} n i=1 in the y direction as n → ∞, follows the relation The same relation between x * and y * can be obtained for the optimal solution in the x direction when δ < 0 and n → ∞. In that case, when we initiate learning in the x direction, we start with x * ,x = −δ for y * ,x = 0 and continue to follow the relationship y * ,x = δ + x * ,x for x * ,x ∈ (0, 1 + δ). We demonstrate the effect of the relative strategic advantage on the optimal learning path, by considering the game with the fitness matrix R 3 and the incompetence matrix S 3 , referred to as Example 3: Strategy 2 obtains a learning advantage (β 3 = 0.1 α 3 = 0.9). We give strategy 1 a fitness advantage by selecting three values for a: 5, 7 and 9, which result in δ 3 ≈ 0.74, 0.28 and 0, respectively. Then, smaller values of a result in the larger first step (see Figure 2B).
The next natural question to ask would be what is the influence of the number of steps we make. Computing the exact form of Φ * x and Φ * y at the optimal solutions and taking their rate of change in n yields that which are positive for any n ≥ 2. For δ 0 we can observe negative Φ * x identifying that the x direction is no longer preferred. This indicates that the bigger the difference between skills, the lower potential fitness-over-learning that can be gained. In this sense, the skill with higher incompetence might reduce the fitness as it requires some investments for the skills to be learnt.
Our learning scheme allows for an adjustment of individuals' behaviour in case of any disruption leading to behavioural mistakes. The number of steps required for such an adjustment can be as low as 2 allowing for a quick reaction to system's uncertainty. Moreover, such an adjustment does not require for the system to stop interactions for learning. Individuals can continue interacting with their group-mates while their behavioural mistakes are reduced and fitness is maximised.

Conclusions
In this paper, we considered the evolutionary game where two skills coexist in a mixed equilibrium, and hence are both required. This is a key assumption as we aimed to answer the question: If both strategies are important, then how do we learn them in an optimal way? We introduced a fitness-over-learning function which measures the improvement in fitness of the population over the learning path that was taken. This function relies on both performance of the strategy and its rate of mistakes.
The naive suggestion would be that the most advantageous skill in terms of fitness has to be learnt first. However, the strategy with lower relative strategic advantage is learnt first in the optimal learning path. We conjecture that this adjusts the difference between the skills and then, once they are comparable, optimal learning suggests to learn both skills with equal rates. These findings indicate that once disrupted, selection tries to recover the most affected strategies first even if their fitness is not the highest. Nonetheless, if the fitness difference is high enough to overcome the effect of incompetence, then the optimal learning will demand that the better strategy is learned first. Another possible interpretation would be to consider the mixed equilibrium as mixed strategies used by players. Then, by learning the less-advantageous strategy, individuals are reaching the nearest optimal mixed strategy.
Importantly, we parametrised the notion of strategic advantage of cooperation versus defection with a single quantity denoted by δ, which captures the tradeoffs between fitness and propensity to make execution errors in these two modes of behaviour. Interestingly, we showed that this quantity has a critical threshold absolute value of 1. Namely, if |δ| < 1 then our Corollary 1 implies that learning by many small steps is preferable to learning by fewer large steps. Arguably, this captures the belief that complex skills are best learned incrementally. However, if |δ| > 1, then Corollary 1 shows that coexistence is preserved by one of only two possible learning paths: (a) full learning first in the x direction, followed by full learning in the y direction; or (b) the other way around. This suggests that a sufficiently large strategic advantage of cooperation over defection (or the converse) eliminates the luxury of incremental learning.
In addition, the number of steps in the learning path maximising the fitness is not bounded. Indeed, taking many small learning steps improves the fitness we observe. However, as demonstrated in Figure 3, there may exist a number of steps n * after which the increase in the objective function seems insignificant. Hence, we can determine a sufficient number of steps to achieve a target level of the fitness-over-learning function in applications.
Overall, the learning scheme proposed in this paper can be used to correct behavioural uncertainty when the system already reached its equilibrium but was disrupted. However, our formulation has its limitations. The main limitation is that we allow for only one skill to be learnt at a time. However, restrictiveness of this assumption decreases with increasing number of steps. The second limitation of our scheme is that the direction of the learning path can only be chosen at the very beginning and cannot be changed while individuals are learning. While it may be more natural to permit the learning direction to change at each step, it would also require more resources to be spent on learning. Therefore, the cost of learning would need to be taken into account. Such extensions can be studied in future research. Acknowledgments: Authors would like to thank Patrick McKinlay for his work on the preliminary results for this paper.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Mathematical Appendix
Here, we provide all the formal proofs for the results in the manuscript. Lemma A1. Letφ(X) be the fitness function and let C * x n ∈ L n . Then Φ x and Φ y can be simplified to where L := 2ãb + 1 2 (ã +b).
Appendix A.1. Proof of Theorem 1 Proof. Letã =b and let {x * x k , y * x k } n k=1 ∈ L n be the optimal strategy in the x direction. Since we obtain in addition to Lemma A1 This also holds in the opposite direction and completes the proof.
Let ν H be an eigenvalue of H, then and therefore ψ is an eigenvalues of BB T . Hence, the eigenvalues of H are given by ν H = −c ± √ ν BB T and we obtain This implies that H evaluated at the proposed optimal path is negative definite. Then, the path is the optimal solution in the y direction, which completes the proof.

Appendix A.3. Proof of Proposition 2
Proof. As before, we maximise Φ x over the learning path. Note that further we shall omit the super-script x . Taking the partial derivatives yields for 2 ≤ k ≤ n − 1 By properties (a)-(b), that is, ∆x k = 0 and ∆y k = 0, ∀k = 1, . . . , n − 1, we obtain the necessary conditions for the critical points Together, we obtain for 3 ≤ k ≤ n − 1 This implies that ∆x * k := c ∈ R for 2 ≤ k ≤ n − 1. Then, the interval [x * 2 , 1] is divided into n − 2 equal sized segments. Consequently, c = 1 − x * 2 n − 2 and therefore for 2 ≤ k ≤ n Similarly, we obtain ∆ 2 y * k = 0, ∀k = 1, . . . , n − 3. Hence, ∆y * k := d ∈ R for 1 ≤ k ≤ n − 2, implying that the interval [0, y * n−1 ] is distributed into n − 2 equally sized segments, such that d = y * n−1 n − 2 . Then, To find x * 2 and y * n−1 , we utilise relations (A7) and (A8) as follows and y * n−1 Combining both equations yields x * 2 = −δ + 1 + δ 2n − 3 and y * n−1 = (1 + δ) which is only feasible for δ < 1 2(n−2) to guarantee x * 2 , y * n−1 ∈ (0, 1], or if δ < −1, then it is only feasible for n = 2. Hence, the only path satisfying the necessary condition to be the optimal path in the x direction is given by To support the claim that this learning path maximises the fitness-over-learning function we consider the Hessian matrix for Φ Therefore, |v B | < c and v H < 0. This completes the proof that the learning path is optimal for the x direction.