The Rescaled Pólya Urn and the Wright—Fisher Process with Mutation

Giacomo Aletti; Irene Crimaldi

doi:10.3390/math9222909

and

¹

Environmental Science and Policy Department, Università degli Studi di Milano, 20133 Milan, Italy

²

IMT School for Advanced Studies Lucca, 55100 Lucca, Italy

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Mathematics2021, 9(22), 2909;https://doi.org/10.3390/math9222909

This article belongs to the Special Issue Bayesian Predictive Inference and Related Asymptotics—Festschrift for Eugenio Regazzini's 75th Birthday

Version Notes

Order Reprints

Abstract

In recent papers the authors introduce, study and apply a variant of the Eggenberger—Pólya urn, called the “rescaled” Pólya urn, which, for a suitable choice of the model parameters, exhibits a reinforcement mechanism mainly based on the last observations, a random persistent fluctuation of the predictive mean and the almost sure convergence of the empirical mean to a deterministic limit. In this work, motivated by some empirical evidence, we show that the multidimensional Wright—Fisher diffusion with mutation can be obtained as a suitable limit of the predictive means associated to a family of rescaled Pólya urns.

Keywords:

Pólya urn; predictive mean; urn model; Wright—Fisher diffusion

1. Introduction

The well-known standard Eggenberger—Pólya urn [1,2] works as follows. An urn initially contains

N_{0, i}

balls of color i, for

i = 1, \dots, k

, and at each time-step, a ball is drawn from the urn and then it is returned into the urn together with

α > 0

additional balls of the same color (here and in the following, the expression “number of balls” is not to be understood literally, but all the quantities are real numbers, not necessarily integers). Hence, denoting by

N_{n, i}

the number of balls of color i inside the urn at time-step n, we have

N_{n, i} = N_{n - 1, i} + α ξ_{n, i} for n \geq 1,

where

ξ_{n, i} = 1

if the drawn ball at time-step n is of color i, and

ξ_{n, i} = 0

otherwise. The parameter

α

tunes the reinforcement mechanism: the greater the

α

, the greater the dependence of

N_{n, i}

on

\sum_{h = 1}^{n} ξ_{h, i}

.

In [3,4,5], the rescaled Pólya (RP) urn has been introduced, studied, generalized and applied. This model differs from the original one by the introduction of a parameter

β

such that

\begin{matrix} N_{n, i} & = b_{i} + B_{n, i} & with \\ B_{n + 1, i} & = β B_{n, i} + α ξ_{n + 1, i} & n \geq 0 . \end{matrix}

Therefore, at time-step 0, the urn contains

b_{i} + B_{0, i} > 0

balls of color i and the parameters

α > 0

and

β \geq 0

regulate the reinforcement mechanism. More precisely, the term

β B_{n, i}

connects

N_{n + 1, i}

to the “configuration” at time-step n by means of the “scaling” parameter

β

, and the term

α ξ_{n + 1, i}

connects

N_{n + 1, i}

to the outcome of the drawing at time-step

n + 1

by means of the parameter

α

. The case

β = 1

corresponds to the standard Eggenberger—Pólya urn with an initial number

N_{0, i} = b_{i} + B_{0, i}

of balls of color i. When

β < 1

, the RP urn model shows the following three characteristics:

(i): A reinforcement mechanism mainly based on the last observations;
(ii): A random persistent fluctuation of the predictive mean $ψ_{n, i} = E [ξ_{n + 1, i} = 1 |, ξ_{h, j}, 0 \leq h \leq n, 1 \leq j \leq k]$ ;
(iii): The almost sure convergence of the empirical mean $\sum_{n = 1}^{N} ξ_{n, i} / N$ to the deterministic limit $p_{i} = b_{i} / \sum_{i = 1}^{n} b_{i}$ , and a chi-squared goodness of fit result for the long-term probability distribution ${p_{1}, \dots, p_{k}}$ .

Regarding point (iii), we specifically have that the chi-squared statistics

χ^{2} = N \sum_{i = 1}^{k} \frac{{(O_{i} / N - p_{i})}^{2}}{p_{i}},

where N is the sample size and

O_{i} = \sum_{n = 1}^{N} ξ_{n, i}

the number of sampled observations equal to i, is asymptotically distributed as

χ^{2} (k - 1) λ

, with

λ > 1

. Therefore, the presence of correlation among observations attenuates the effect of N, which multiplies the chi-squared distance between the observed frequencies and the expected probabilities. This is a key feature for statistical applications in the framework of a “big sample”, where a small value of the chi-squared distance might be significant, and hence a correction related to the correlation between observations is required. In [3,5], a possible application in the context of clustered data was described, with independence between clusters and correlation due to a reinforcement mechanism inside each cluster.

In [4], the RP urn was applied as a good model for the evolution of the sentiment associated with Twitter posts. Precisely, we analyzed three data sets: (i) the “COVID-19 epidemic” data set covers the period from 21 February to 20 April to 2020 and includes tweets in Italian about the COVID-19 epidemic; (ii) the “Migration debate” data set refers to the period from 23 January to 22 February 2019 and the collected posts are related to the Italian debate on migration; (iii) the “10 days of traffic” data set collects the entire traffic of posts in Italian in the period from 1 September to 10 September 2019. For every post, the relative sentiment, that is, the positive or negative connotation of the text, was computed using the polyglot python module developed in [6], which provides a numerical value

v \in [- 1, 1]

for the sentiment of a post (for a survey on sentiment analysis, also known as opinion mining, we refer to [7] and references therein). We fixed a threshold T so that a tweet with

v > T

was classified as a tweet with a positive sentiment and one with

v < - T

was classified as a tweet with a negative sentiment. Tweets with a value

v \in [- T, T]

were discarded. We took the following different values for T:

T = 0

,

T = 0.35

and

T = 0.5

. We applied the RP urn model, ordering the tweets according to their creation time and taking each tweet with a positive/negative classification as an extraction in the urn model. More specifically, we applied the RP model with

k = 2

: the time series of the tweets represents the time series of the extractions from the urn, that is, the random variables

ξ_{n, 1}

. The event

{ξ_{n, 1} = 1}

means that tweet n exhibits a positive sentiment, while

{ξ_{n, 1} = 0}

means that tweet n exhibits a negative sentiment. For all the considered data sets, the estimated values of

β

were strictly smaller than 1, but very near to 1 (details about the parameters estimation can be found in [4]). Note that the RP urn dynamics with such a value for

β

cannot be approximated by the standard Pólya urn (

β = 1

), because one would lose the fluctuations of the predictive means and the possibility of touching the barriers

{0, 1}

. In this work, we show that the law of such an RP urn process can be approximated by a Wright—Fisher diffusion with mutation. More precisely, we prove that the multidimensional Wright—Fisher diffusion with mutation can be obtained as a suitable limit of the predictive means associated with a family of RP urns with

β \in

[0, 1),

β \to 1

. As an example, in Figure 1, for the data set “COVID-19 epidemic”, we show the plot of the process

{(ψ_{n, 1})}_{n}

, reconstructed from the data (details about the reconstruction process can be found in [4]) and rescaled in time as

t = n {(1 - β)}^{2}

, the plot of a simulated (by the Euler–Maruyama method) trajectory of the Wright—Fisher process, the plot of the approximation of this trajectory by means of the RP urn and the approximation of the data process by means of the standard Pólya urn.

Figure 1. “COVID-19 epidemic” Twitter data set: the black line is the process

{(ψ_{n, 1})}_{n}

, reconstructed from the data and rescaled in time as

t = n {(1 - β)}^{2}

; the red line is a simulated trajectory of the Wright—Fisher process; the orange line is the approximation of this trajectory by means of the RP urn and the blue line is the approximation of the data process by means of the standard Pólya urn. The numbers 0,

0.35

and

0.5

refer to the values chosen for the threshold T. The corresponding estimated values for

1 - β

are: 0.000776 (

8 \times 10^{- 4}

), 0.00115 (

11 \times 10^{- 4}

) and 0.00130 (

13 \times 10^{- 4}

).

The Wright—Fisher (WF) class of diffusion processes models the evolution of the relative frequency of a genetic variant, or allele, in a large randomly mating population with a finite number k of genetic variants. When

k = 2

, the WF diffusion obeys the one-dimensional stochastic differential equation

d X_{t} = F (X_{t}) d t + \sqrt{X_{t} (1 - X_{t})} d W_{t}, X_{0} = x_{0}, t \in [0, T] .

(1)

The drift coefficient,

F : [0, 1] \to R

, can include a variety of evolutionary forces such as mutation and selection. For example,

F (x) = p_{1} - (p_{1} + p_{2}) x = p_{1} (1 - x) - p_{2} x

describes a process with recurrent mutation between the two alleles, governed by the mutation rates

p_{1} > 0

and

p_{2} > 0

. The drift vanishes when

x = p_{1} / (p_{1} + p_{2})

which is an attracting point for the dynamics. Equation (1) can be generalized to the case

k > 2

. The WF diffusion processes are widely employed in Bayesian statistics, as models for time-evolving priors [8,9,10,11] and as a discrete-time finite-population construction method of the two-parameter Poisson–Dirichlet diffusion [12]. They have been applied in genetics [13,14,15,16,17,18], in biophysics [19,20], in filtering theory [21,22] and in finance [23,24].

The benefit coming from the proven limit result is twofold. First, the known properties of the WF process can give a description of the RP urn when the parameter

β

is strictly smaller than one, but very near to one. Second, the given result might furnish the theoretical base for a new simulation method of the WF process. Indeed, the simulation from Equation (1) is highly nontrivial because there is no known closed form expression for the transition function of the diffusion, even in the simple case with null drift [25].

The rest of the paper is organized as follows. In Section 2, we set up our notation and we formally define the RP urn model. Section 3 provides the main result of this work, that is, the convergence result of a suitable family of predictive means associated with RP urns with

β \to 1

. In Section 4, employing the boundary classification of the WF diffusion with mutation and connecting it to the parameters of the RP urn model, we introduce an RP urn with a value of

β

very near to 1 the notion of recessive subsets of colors and the notion of dominant color. These two concepts are related to the possibility of reaching the barriers 0 and 1 by the predictive means of the urn process. Finally, Section 5 summarizes the work and concludes it.

2. The Rescaled Pólya Urn

For a vector

x = {(x_{1}, \dots, x_{k})}^{⊤} \in R^{k}

, we set

| x | = \sum_{i = 1}^{k} | x_{i} |

and

{∥ x ∥}^{2} = x^{⊤} x = \sum_{i = 1}^{k} {| x_{i} |}^{2}

. Moreover we denote by

1

and

0

the vectors with all the components equal to 1 and equal to 0, respectively.

Let

α > 0

and

β \geq 0

. At time-step 0, the urn contains

b_{i} + B_{0, i} > 0

distinct balls of color i, with

i = 1, \dots, k

. We set

b = {(b_{1}, \dots, b_{k})}^{⊤}

and

B_{0} = {(B_{0, 1}, \dots, B_{0, k})}^{⊤}

. We suppose

b = | b | > 0

and we set

p = \frac{b}{b}

. At each time-step

(n + 1) \geq 1

, a ball is drawn at random from the urn and we define the random vector

ξ_{n + 1} = {(ξ_{n + 1, 1}, \dots, ξ_{n + 1, k})}^{⊤}

as

ξ_{n + 1, i} = \{\begin{matrix} 1 & when the drawn ball at time-step n + 1 is of color i \\ 0 & otherwise . \end{matrix}

The number of balls inside the urn is updated as follows:

N_{n + 1} = b + B_{n + 1} with B_{n + 1} = β B_{n} + α ξ_{n + 1},

(2)

which gives

B_{n} = β^{n} B_{0} + α β^{n} \sum_{h = 1}^{n} β^{- h} ξ_{h} .

(3)

Similarly, from the equality

| B_{n + 1} | = β | B_{n} | + α,

we get, using

\sum_{h = 0}^{n - 1} x^{h} = (1 - x^{n}) / (1 - x)

,

| B_{n} | = β^{n} | B_{0} | + α \sum_{h = 1}^{n} β^{n - h} = β^{n} (| B_{0} | - \frac{α}{1 - β}) + \frac{α}{1 - β} .

(4)

Setting

r_{n}^{*} = | N_{n} | = b + | B_{n} |

, that is the total number of balls inside the urn at time-step n, we get the relations

r_{n + 1}^{*} = r_{n}^{*} + (β - 1) | B_{n} | + α

(5)

and

r_{n}^{*} = b + \frac{α}{1 - β} + β^{n} (| B_{0} | - \frac{α}{1 - β}) .

(6)

Denoting by

F_{0}

the trivial

σ

-field and setting

F_{n} = σ (ξ_{1}, \dots, ξ_{n})

for

n \geq 1

, the conditional probabilities

ψ_{n} = {(ψ_{n, 1}, \dots, ψ_{n, k})}^{⊤}

of the extraction process, also called predictive means, are

ψ_{n} = E [ξ_{n + 1} | F_{n}] = \frac{N_{n}}{| N_{n} |} = \frac{b + B_{n}}{r_{n}^{*}} n \geq 0

(7)

and, from (3) and (4), we have

ψ_{n} = \frac{b + β^{n} B_{0} + α \sum_{h = 1}^{n} β^{n - h} ξ_{h}}{b + \frac{α}{1 - β} + β^{n} (| B_{0} | - \frac{α}{1 - β})} .

(8)

The dependence of

ψ_{n}

on

ξ_{h}

is regulated by the factor

f (h, n) = α β^{n - h}

, with

1 \leq h \leq n, n \geq 0

. In the case of the standard Eggenberger—Pólya urn (i.e., the case

β = 1

), each observation

ξ_{h}

has the same “weight”

f (h, n) = α

. Instead, when

β < 1

the factor

f (h, n)

increases with h, and the main contribution is given by the most recent drawings. The case

β = 0

is an extreme case, for which

ψ_{n}

depends only on the last drawing

ξ_{n}

.

By means of (7), together with (2) and (5), we get

ψ_{n + 1} - ψ_{n} = - \frac{(1 - β)}{r_{n + 1}^{*}} b (ψ_{n} - p) + \frac{α}{r_{n + 1}^{*}} (ξ_{n + 1} - ψ_{n}) .

(9)

Setting

Δ M_{n + 1} = ξ_{n + 1} - ψ_{n}

and letting

ϵ_{n} = b (1 - β) / r_{n + 1}^{*}

and

δ_{n} = α / r_{n + 1}^{*}

, from (9) we obtain

ψ_{n + 1} - ψ_{n} = - ϵ_{n} (ψ_{n} - p) + δ_{n} Δ M_{n + 1} .

(10)

3. Main Result

Consider the RP urn with parameters

α > 0

,

β \in [0, 1)

,

b > 0

and

B_{0}

such that

| B_{0} | = r (β) = α / (1 - β)

. Consequently, the total number of balls in the urn along the time-steps is constantly equal to

r^{*} (β) = b + r (β)

and if we denote by

ψ^{(β)} = {(ψ_{n}^{(β)})}_{n}

the predictive means corresponding to the fixed value

β

, we have the dynamics

ψ_{n}^{(β)} - ψ_{n - 1}^{(β)} = - ϵ (β) (ψ_{n - 1}^{(β)} - p) + δ (β) Δ M_{n}^{(β)},

(11)

where

ϵ (β) = \frac{b {(1 - β)}^{2}}{α + b (1 - β)}, δ (β) = \frac{α (1 - β)}{α + b (1 - β)}

(12)

and

Δ M_{n}^{(β)} = ξ_{n}^{(β)} - ψ_{n - 1}^{(β)}

. Note that we have

ϵ (β) \sim c δ {(β)}^{2}

for

β \to 1

, with

c = b / α > 0

. Finally, we define

X^{(β)} = {(X_{t}^{(β)})}_{t \geq 0}

, where

X_{t}^{(β)} = ψ_{⌊ t / {(1 - β)}^{2} ⌋}^{(β)} \Leftrightarrow X_{t}^{(β)} = ψ_{n - 1}^{(β)}, t \in [(n - 1) {(1 - β)}^{2}, n {(1 - β)}^{2}) .

(13)

The following result holds true:

Theorem 1.

Suppose that

X_{0}^{(β)}

weakly converges towards some process

X_{0}

when

β \to 1

. Then, for

β \to 1

, the family of stochastic processes

{X^{(β)}, β \in [0, 1)}

weakly converges towards the k-alleles Wright—Fisher diffusion

X = {(X_{t})}_{t \geq 0}

, with type-independent mutation kernel given by

p

and with dynamics

d X_{t} = - b \frac{X_{t} - p}{α} d t + Σ (X_{t}) d W_{t},

(14)

with

Σ (X_{t}) Σ {(X_{t})}^{⊤} = (diag (X_{t}) - X_{t} {X_{t}}^{⊤})

and

1^{⊤} Σ (X_{t}) = 0^{⊤}

, that is,

\begin{matrix} Σ {(X_{t})}_{i j} = \{\begin{matrix} 0 & i f X_{t, i} X_{t, j} = 0 o r i < j \\ \sqrt{X_{t, i} \frac{\sum_{l = i + 1}^{k} X_{t, l}}{\sum_{l = i}^{k} X_{t, l}}} & i f i = j a n d X_{t, i} X_{t, j} \neq 0 \\ - X_{t, i} \sqrt{\frac{X_{t, j}}{\sum_{l = j}^{k} X_{t, l} \sum_{l = j + 1}^{k} X_{t, l}}} & i f i > j a n d X_{t, i} X_{t, j} \neq 0 . \end{matrix} \end{matrix}

(15)

Proof.

Fix a sequence

(β_{n})

, with

β_{n} \in [0, 1)

and

β_{n} \to 1

. The sequence of processes

{X^{(β_{n})}, n \in N}

is bounded, hence we have to prove the tightness of the sequence in the space

D^{k} [0, \infty)

of right-continuous functions with the usual Skorohod topology, and the characterization of the law of the unique limit process.

For any

f \in C_{b}^{2}

, define

\begin{matrix} γ_{n}^{(β, f)} (x) & = {\hat{A}}^{(β)} f ((n - 1) {(1 - β)}^{2}) (x) \\ = E [\frac{f (X_{n {(1 - β)}^{2}}^{(β)}) - f (X_{(n - 1) {(1 - β)}^{2}}^{(β)})}{{(1 - β)}^{2}} | X_{(n - 1) {(1 - β)}^{2}}^{(β)} = x] \\ = E [\frac{f (ψ_{n}^{(β)}) - f (ψ_{n - 1}^{(β)})}{{(1 - β)}^{2}} | ψ_{n - 1}^{(β)} = x] \\ \overset{by ψ_{n}^{(β)} - ψ_{n - 1}^{(β)} = - ϵ (β) (ψ_{n - 1}^{(β)} - p) + δ (β) Δ M_{n}^{(β)}}{=} \frac{1}{{(1 - β)}^{2}} (E [f (x) + \sum_{i} \frac{\partial f}{\partial x_{i}} (x) (- ϵ (β) (x_{i} - p_{i}) + δ (β) Δ {M_{n, i}}^{(β)}) \\ + \frac{1}{2} δ {(β)}^{2} \sum_{i j} \frac{\partial^{2} f}{\partial x_{i} \partial x_{j}} (x) Δ M_{n, i}^{(β)} Δ M_{n, j}^{(β)} + O ({(1 - β)}^{3}) | F_{n - 1}] - f (x)) \\ = - \frac{b}{α + b (1 - β)} \sum_{i} \frac{\partial f}{\partial x_{i}} (x) (x_{i} - p_{i}) + \frac{1}{2} \frac{α^{2}}{{(α + b (1 - β))}^{2}} \sum_{i j} \frac{\partial^{2} f}{\partial x_{i} \partial x_{j}} (x) (x_{i} 𝟙_{i = j} - x_{i} x_{j}) \\ + O (1 - β) . \end{matrix}

(16)

We note that, for any

f \in C_{b}^{2}

, the partial derivatives in (16) are uniformly bounded, as

x

belongs to the compact simplex

S = {x_{i} \geq 0, \sum_{i} x_{i} = 1}

. The family

{γ_{n}^{(β, f)} (x), n \in N, β < 1, x \in S}

is then uniformly integrable. Thus, as a consequence of [26] (Theorem 4) (or [27] (ch. 7.4.3, Theorem 4.3, p. 236)), we have that the sequence of processes

{X^{(β_{n})}, n \in N}

is tight in the space of right-continuous functions with the usual Skorohod topology. Since, for any n and t,

X_{t}^{(β_{n})} \in S

, then

1^{⊤} Σ (X_{t}) = 0^{⊤}

. Moreover, the generator of the limit process is determined by the limit

\begin{matrix} A f (t) (x) & = lim_{n \to \infty} γ_{⌊ t / {(1 - β)}^{2} ⌋}^{(β_{n}, f)} (x) \\ = - \frac{b}{α} \sum_{i} \frac{\partial f}{\partial x_{i}} (x) (x_{i} - p_{i}) + \frac{1}{2} \sum_{i j} \frac{\partial^{2} f}{\partial x_{i} \partial x_{j}} (x) (x_{i} 𝟙_{i = j} - x_{i} x_{j}) . \end{matrix}

Hence, the weak limit of the sequence of the bounded processes

X^{(β_{n})}

is the diffusion process

d X_{t} = - b \frac{X_{t} - p}{α} d t + Σ (X_{t}) d W_{t}, Σ (X_{t}) Σ {(X_{t})}^{⊤} = (diag (X_{t}) - X_{t} {X_{t}}^{⊤}) .

The expression (15) follows from [28] (Corollary 3). □

Remark 1 (Limiting ergodic distribution).

Since the simplex has dimension

k - 1

with respect to the Lebesgue measure, it is convenient to change the notations. Let

T^{k - 1}

be the

k - 1

-dimensional simplex defined by

T^{k - 1} : = {y \in R^{k - 1} : y_{1} \geq 0, \dots, y_{k - 1} \geq 0, 1 - y_{1} - y_{2} - \dots - y_{k - 1} \geq 0},

where, with the old definition, we have

x_{i} = y_{i}, i < k

and

x_{k} : = 1 - y_{1} - y_{2} - \dots - y_{k - 1}

. Obviously, there is a one-to-one natural correspondence between

T^{k - 1}

and the simplex

{x \in R^{k} : x_{1} \geq 0, \dots, x_{k} \geq 0, \sum_{i} x_{i} = 1}

defined by

y = (y_{1}, \dots, y_{k - 1}) ⟷ (y_{1}, \dots, y_{k - 1}, 1 - y_{1} - y_{2} - \dots - y_{k - 1}) = (x_{1}, \dots, x_{k - 1}, x_{k}) = x .

The Markov diffusion process

X_{t}

in (14) may be redefined as

Y_{t} = (X_{t, 1}, \dots, X_{t, k - 1})

on

y \in T^{k - 1}

with the corresponding generator

L f (y) = - \frac{b}{α} \sum_{i = 1}^{k - 1} \frac{\partial f}{\partial y_{i}} (y) (y_{i} - p_{i}) + \frac{1}{2} \sum_{i, j = 1}^{k - 1} \frac{\partial^{2} f}{\partial y_{i} \partial y_{j}} (y) (y_{i} 𝟙_{i = j} - y_{i} y_{j}) .

(17)

The Kolmogorov forward equation for the density

p (y, t)

of the limiting process

Y_{t}

is

\begin{matrix} \frac{\partial}{\partial t} p (y, t) = \frac{1}{2} (\frac{b}{α} \sum_{i = 1}^{k - 1} \frac{\partial}{\partial y_{i}} (p (y, t) (y_{i} - p_{i})) \\ + \sum_{i = 1}^{k - 1} \frac{\partial^{2}}{\partial y_{i}^{2}} (y_{i} (1 - y_{i}) p (y, t)) - 2 \sum_{1 \leq i < j \leq k - 1} \frac{\partial^{2}}{\partial y_{i} \partial y_{j}} (y_{i} y_{j} p (y, t))) . \end{matrix}

(18)

Therefore, it is not hard to show that the limit invariant ergodic distribution is

p (y) = \frac{1}{B (2 \frac{b}{α} p)} {(1 - y_{1} - \dots - y_{k - 1})}^{\frac{2 b (1 - p_{1} - \dots - p_{k - 1})}{α} - 1} \prod_{i = 1}^{k - 1} y_{i}^{\frac{2 b p_{i}}{α} - 1},

(19)

because it satisfies (18) (see also [29]). The above distribution is the Dirichlet distribution

Dir (2 \frac{b}{α} p)

as a function of

x = (y, 1 - y_{1} - \dots - y_{k - 1})

.

Remark 2 (Transition density of the limit process).

The transition density

p (y_{0}, y; t)

is defined by

P (Y_{t} \in S | Y_{0} = y_{0}) = \int_{S \cap T^{k - 1}} p (y_{0}, y; t) d y

and it can be represented in terms of series of orthogonal polynomials [30] as shown in [31]. Moreover, we refer to [9,32,33] for the explicit form of the reproducing kernel orthogonal polynomials.

4. Recessive and Dominant Colors in an RP Urn with $β$ near to 1

Let

J = {J_{1}, \dots, J_{k_{J}}}

be a partition of

{1, \dots, k}

, in that

J_{l} \neq Ø

,

J_{i_{1}} \cap J_{i_{2}} = Ø

, and

\cup_{l = 1}^{k_{J}} = {1, \dots, k}

. Here

k_{j}

denotes the cardinality of

J

. Define the

k_{J}

-dimensional objects

{(ψ_{n}^{(β, J)})}_{n}

,

{(ξ_{n}^{(β, J)})}_{n}

and

p^{(J)}

as

\begin{matrix} ψ_{n, i}^{(β, J)} & = \sum_{l \in J_{i}} ψ_{n, l}^{(β)} \\ ξ_{n, i}^{(β, J)} & = \sum_{l \in J_{l}} ξ_{n, l}^{(ε)} \\ p_{i}^{(J)} & = \sum_{l \in J_{i}} p_{l} \end{matrix}\} for i = 1, \dots, k_{J},

and

X_{t}^{(β, J)} = ψ_{⌊ t / {(1 - β)}^{2} ⌋}^{(β, J)}

. With these definitions, from (11), we immediately get that

{(ψ_{n}^{(β, J)})}_{n}

is a

k_{J}

-dimensional RP urn following the dynamics

ψ_{n}^{(β, J)} - ψ_{n - 1}^{(β, J)} = - ϵ (β) (ψ_{n - 1}^{(β, J)} - p^{(J)}) + δ (β) (ξ_{n}^{(β, J)} - ψ_{n - 1}^{(β, J)})

(20)

and that Theorem 1 holds for

X_{t}^{(β, J)}

. Consequently, the convergence to the Wright—Fisher diffusion still holds if we group together some components of the process. For instance, when we consider two groups of components, we have the following result:

Corollary 1.

Let

J = {J, J^{c}}

with

J \neq Ø

,

J^{c} \neq Ø

. Under the hypothesis of Theorem 1, each component of the sequence of processes

X_{t}^{(β, J)}

converges, for

β \to 1

, to the one-dimensional diffusion process with values in

[0, 1]

that satisfies the SDE

d X_{t, i}^{(J)} = - b \frac{X_{t, i}^{(J)} - p_{i}}{α} d t + {(- 1)}^{i + 1} \sqrt{X_{t, i}^{(J)} (1 - X_{t, i}^{(J)})} d W_{t} .

In addition,

X_{t, 1}^{(J)} = \sum_{l \in J} X_{t, l}

and

X_{t, 2}^{(J)} = \sum_{l \in J^{c}} X_{t, l}

.

Now, if we further specialize the grouping choice to

J = ({i}, {1, \dots, i - 1, i + 1, \dots, k})

, we get:

Corollary 2.

Under the conditions of Theorem 1 the i-th component of the sequence of processes

X^{(β)}

converges, for

β \to 1

, to the one-dimensional diffusion

{(X_{t, i})}_{t \geq 0}

with values in

[0, 1]

satisfying the SDE

d X_{t, i} = - b \frac{X_{t, i} - p_{i}}{α} d t + \sqrt{X_{t, i} (1 - X_{t, i})} d W_{t} .

For instance, the above two results are useful in order to translate the well-known classification of the boundaries of the WF process with mutation [34] (p. 239, Example 8) (see also [35]) to the RP urn model when the parameter

β

is strictly smaller than 1, but very near to 1. Indeed, Corollary 1 implies that

Z_{t} = \sum_{l \in J} X_{t, l}

satisfies the SDE

\begin{matrix} d Z_{t} & = - b \frac{Z_{t} - \sum_{l \in J} p_{l}}{α} d t + \sqrt{Z_{t} (1 - Z_{t})} d W_{t} \\ = (- \frac{b}{α} (1 - \sum_{l \in J} p_{l}) Z_{t} + \frac{b}{α} \sum_{l \in J} p_{l} (1 - Z_{t})) d t + \sqrt{Z_{t} (1 - Z_{t})} d W_{t} . \end{matrix}

Setting

a_{0} = \frac{b}{α} \sum_{l \in J} p_{l}

and

a_{1} = \frac{b}{α} - a_{0}

and noting that

\cap_{i \in J} {X_{t, i} = 0} = {Z_{t} = 0}

, we obtain:

(1): $a_{0} < 1 / 2$ , i.e., $\sum_{l \in J} p_{l} < \frac{α}{2 b}$ , if and only if $P (\exists t : \cap_{i \in J} {X_{t, i} = 0}) = 1$ ;
(2): $a_{0} \geq 1 / 2$ , i.e., $\sum_{l \in J} p_{l} \geq \frac{α}{2 b}$ , if and only if $P (\exists t : \cap_{i \in J} {X_{t, i} = 0}) = 0$ .

With the same spirit, Corollary 2 states that

Z_{t} = 1 - X_{t, i}

satisfies the SDE

\begin{matrix} d Z_{t} & = - b \frac{Z_{t} - \sum_{l \neq i} p_{l}}{α} d t + \sqrt{(1 - Z_{t}) Z_{t}} d W_{t} \\ = (- \frac{b}{α} p_{i} Z_{t} + \frac{b}{α} (1 - p_{i}) (1 - Z_{t})) d t + \sqrt{Z_{t} (1 - Z_{t})} d W_{t} . \end{matrix}

Setting

a_{0} = \frac{b}{α} (1 - p_{i})

and

a_{1} = \frac{b}{α} - a_{0}

, we get:

(3): $a_{0} < 1 / 2$ , i.e., $p_{i} > 1 - \frac{α}{2 b}$ , if and only if $P (\exists t : {X_{t, i} = 1}) = 1$ ;
(4): $a_{0} \geq 1 / 2$ , i.e., $p_{i} \leq 1 - \frac{α}{2 b}$ , if and only if $P (\exists t : {X_{t, i} = 1}) = 0$ .

Therefore, for an RP urn with

β < 1

, but very near to 1, we can give the following definition:

Definition 1.

We call recessive a non-empty subset

J ⊊ {1, \dots, k}

of colors such that

\sum_{l \in J} p_{l} < \frac{α}{2 b}

. We call dominant a color

i \in {1, \dots, k}

such that

{1, \dots, k} \ {i}

is recessive.

Obviously, every subset of a recessive set is recessive. Moreover, when

\frac{α}{b} > 2 (1 - {min}_{i} p_{i})

, every set

J ⊊ {1, \dots, k}

is recessive. The terms “recessive” and “dominant” are justified by the fact that, recalling properties (1)–(4) of the WF process, if a set of colors is recessive, then we can observe that at some times the corresponding predictive means of the urn process are very near to zero. On the contrary, when a color is dominant, we can observe that at some times the corresponding predictive mean of the urn process is very near to one. In Figure 2, we plot the process

(ψ_{n, 1})

related to the simulation of an RP urn with

k = 2

,

α / b = 1

and

p = 0.75

, where it is possible to observe the excursions near the barrier 1.

Figure 2. Simulation: plot of the process

(ψ_{n, 1})

related to the simulation of an RP urn with

k = 2

,

α / b = 1

and

p = 0.75

.

5. Conclusions

We have proven that the multidimensional WF diffusion with mutation can be obtained as the limit of the predictive means associated with a family of RP urns with

β < 1

,

β \to 1

. As a consequence, the known properties of the WF process can give a description of the RP urn when the parameter

β

is strictly smaller than 1, but very near to 1. For instance, starting from the known classification of the boundaries for the WF process and connecting it to the model parameters of the RP urn, we have obtained for an RP urn with a value of

β

very near to one, the notion of recessive subsets of colors and the notion of a dominant color. These two concepts are related to the possibility of reaching the barriers 0 and 1 by the predictive means of the urn process. Other classical problems, together with the corresponding known results for the WF process, can be found in [31]. These results can be used in order to give an approximated answer to the considered problems in the case of an RP urn with a value of

β

near 1.

Author Contributions

Both authors contributed equally to this work. All authors have read and agreed to the published version of the manuscript.

Funding

Irene Crimaldi is partially supported by the Italian “Programma di Attività Integrata” (PAI), project “TOol for Fighting FakEs” (TOFFE) funded by the IMT School for Advanced Studies Lucca. This research received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme under grant agreement No 817257.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Acknowledgments

Both authors sincerely thank the organizers of the present special issue for their invitation to contribute and Fabio Saracco for having collected and shared with them the analyzed Twitter data sets. Giacomo Aletti is a member of the Italian group “Gruppo Nazionale per il Calcolo Scientifico” of the Italian Institute “Istituto Nazionale di Alta Matematica”. Irene Crimaldi is a member of the Italian group “Gruppo Nazionale per l’Analisi Matematica, la Probabilità e le loro Applicazioni” of the Italian Institute “Istituto Nazionale di Alta Matematica”.

Conflicts of Interest

The authors declare no conflict of interest.

References

Eggenberger, F.; Pólya, G. Über die Statistik verketteter Vorgänge. ZAMM-J. Appl. Math. Mech./Z. Angew. Math. Mech. 1923, 3, 279–289. [Google Scholar] [CrossRef]
Mahmoud, H.M. Pólya Urn Models; Texts in Statistical Science Series; CRC Press: Boca Raton, FL, USA, 2009. [Google Scholar]
Aletti, G.; Crimaldi, I. The Rescaled Pólya Urn: Local reinforcement and chi-squared goodness of fit test. Adv. Appl. Probab. Available online: https://iris.imtlucca.it/handle/20.500.11771/19197#.YZNznboRVPZ (accessed on 1 November 2021).
Aletti, G.; Crimaldi, I.; Saracco, F. A model for the Twitter sentiment curve. PLoS ONE 2021, 16, e0249634. [Google Scholar] [CrossRef]
Aletti, G.; Crimaldi, I. Generalized Rescaled Pólya urn and its statistical applications. arXiv 2021, arXiv:2010.06373. [Google Scholar]
Chen, Y.; Skiena, S. Building sentiment lexicons for all major languages. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Short Papers), Baltimore, MD, USA, 22–27 June 2014; pp. 383–389. [Google Scholar]
Chakraborty, K.; Bhattacharyya, S.; Bag, R. A Survey of Sentiment Analysis from Social Media Data. IEEE Trans. Comput. Soc. Syst. 2020, 7, 450–464. [Google Scholar] [CrossRef]
Favaro, S.; Ruggiero, M.; Walker, S.G. On a Gibbs sampler based random process in Bayesian nonparametrics. Electron. J. Stat. 2009, 3, 1556–1566. [Google Scholar] [CrossRef]
Griffiths, R.C.; Spanò, D. Diffusion processes and coalescent trees. In Probability and Mathematical Genetics, Papers in Honour of Sir John Kingman; Bingham, N.H., Goldie, C.M., Eds.; LMS Lecture Note Series; Cambridge University Press: Cambridge, UK, 2010; Volume 378, pp. 358–375. [Google Scholar]
Mena, R.; Ruggiero, M. Dynamic density estimation with diffusive Dirichlet mixtures. Bernoulli 2016, 22, 901–926. [Google Scholar] [CrossRef]
Walker, S.G.; Hatjispyros, S.J.; Nicoleris, T. A Fleming-Viot process and Bayesian nonparametrics. Ann. Appl. Probab. 2007, 17, 67–80. [Google Scholar] [CrossRef][Green Version]
Costantini, C.; De Blasi, P.; Ethier, S.; Ruggiero, M.; Spanò, D. Wright-Fisher construction of the two-parameter Poisson-Dirichlet diffusion. Ann. Appl. Probab. 2017, 27, 1923–1950. [Google Scholar] [CrossRef]
Bollback, J.P.; York, T.L.; Nielsen, R. Estimation of 2N_es from temporal allele frequency data. Genetics 2008, 179, 497–502. [Google Scholar] [CrossRef]
Gutenkunst, R.N.; Hernandez, R.D.; Williamson, S.H.; Bustamante, C.D. Inferring the Joint Demographic History of Multiple Populations from Multidimensional SNP Frequency Data. PLoS Genet. 2009, 5, e1000695. [Google Scholar] [CrossRef] [PubMed]
Malaspinas, A.S.; Malaspinas, O.; Evans, S.N.; Slatkin, M. Estimating allele age and selection coefficient from time-serial data. Genetics 2012, 192, 599–607. [Google Scholar] [CrossRef]
Schraiber, J.; Griffiths, R.C.; Evans, S.N. Analysis and rejection sampling of Wright-Fisher diffusion bridges. Theor. Popul. Biol. 2013, 89, 64–74. [Google Scholar] [CrossRef] [PubMed]
Williamson, S.H.; Hernandez, R.; Fledel-Alon, A.; Zhu, L.; Bustamante, C.D. Simultaneous inference of selection and population growth from patterns of variation in the human genome. Proc. Natl. Acad. Sci. USA 2005, 102, 7882–7887. [Google Scholar] [CrossRef]
Zhao, L.; Lascoux, M.; Overall, A.D.J.; Waxman, D. The characteristic trajectory of a fixing allele: A consequence of fictitious selection that arises from conditioning. Genetics 2013, 195, 993–1006. [Google Scholar] [CrossRef]
Dangerfield, C.; Kay, D.; Burrage, K. Stochastic models and simulation of ion channel dynamics. Procedia Comput. Sci. 2010, 1, 1587–1596. [Google Scholar] [CrossRef]
Dangerfield, C.E.; Kay, D.; MacNamara, S.; Burrage, K. A boundary preserving numerical algorithm for the Wright—Fisher model with mutation. BIT Numer. Math. 2012, 5, 283–304. [Google Scholar] [CrossRef]
Chaleyat-Maurel, M.; Genon-Catalot, V. Filtering the Wright–Fisher diffusion. ESAIM Probab. Stat. 2009, 13, 197–217. [Google Scholar] [CrossRef][Green Version]
Papaspiliopoulos, O.; Ruggiero, M. Optimal filtering and the dual process. Bernoulli 2014, 20, 1999–2019. [Google Scholar] [CrossRef]
Delbaen, F.; Shirakawa, H. An interest rate model with upper and lower bounds. Asia-Pac. Financ. Mark. 2002, 9, 191–209. [Google Scholar] [CrossRef]
Gourieroux, C.; Jasiak, J. Multivariate Jacobi process with application to smooth transitions. J. Econom. 2006, 131, 475–505. [Google Scholar] [CrossRef]
Jenkins, P.A.; Spanò, D. Exact simulation of the Wright-Fisher diffusion. Ann. Appl. Probab. 2017, 27, 1478–1509. [Google Scholar] [CrossRef]
Kushner, H.J. Approximation and Weak Convergence Methods for Random Processes, with Applications to Stochastic Systems Theory; MIT Press Series in Signal Processing, Optimization, and Control; MIT Press: Cambridge, MA, USA, 1984; Volume 6. [Google Scholar]
Kushner, H.J.; Yin, G.G. Stochastic Approximation and Recursive Algorithms and Applications, 2nd ed.; Applications of Mathematics; Springer: New York, NY, USA, 2003; Volume 35. [Google Scholar]
Tanabe, K.; Sagae, M. An Exact Cholesky Decomposition and the Generalized Inverse of the Variance-Covariance Matrix of the Multinomial Distribution, with Applications. J. R. Stat. Soc. Ser. B (Methodol.) 1992, 54, 211–219. [Google Scholar] [CrossRef]
Wright, S. Evolution and the Genetics of Populations, Volume 2: Theory of Gene Frequencies; Evolution and the Genetics of Populations; University of Chicago Press: Chicago, IL, USA, 1984. [Google Scholar]
Dunkl, C.F.; Xu, Y. Orthogonal Polynomials of Several Variables, 2nd ed.; Encyclopedia of Mathematics and Its Applications; Cambridge University Press: Cambridge, UK, 2014; Volume 155. [Google Scholar] [CrossRef]
Aletti, G.; Crimaldi, I. The rescaled Pólya urn and the Wright—Fisher process with mutation. arXiv 2021, arXiv:2110.01853. [Google Scholar]
Griffiths, R.C.; Spanò, D. Orthogonal polynomial kernels and canonical correlations for Dirichlet measures. Bernoulli 2013, 19, 548–598. [Google Scholar] [CrossRef]
Griffiths, R.C.; Spanò, D. Multivariate Jacobi and Laguerre polynomials, infinite-dimensional extensions, and their probabilistic connections with multivariate Hahn and Meixner polynomials. Bernoulli 2011, 17, 1095–1125. [Google Scholar] [CrossRef]
Karlin, S.; Taylor, H.M. A Second Course in Stochastic Processes; Subsidiary of Harcourt Brace Jovanovich; Academic Press, Inc.: New York, NY, USA; London, UK, 1981. [Google Scholar]
Huillet, T. On Wright–Fisher diffusion and its relatives. J. Stat. Mech. Theory Exp. 2007, 2007, P11006. [Google Scholar] [CrossRef]

Figure 1. “COVID-19 epidemic” Twitter data set: the black line is the process

{(ψ_{n, 1})}_{n}

, reconstructed from the data and rescaled in time as

t = n {(1 - β)}^{2}

; the red line is a simulated trajectory of the Wright—Fisher process; the orange line is the approximation of this trajectory by means of the RP urn and the blue line is the approximation of the data process by means of the standard Pólya urn. The numbers 0,

0.35

and

0.5

refer to the values chosen for the threshold T. The corresponding estimated values for

1 - β

are: 0.000776 (

8 \times 10^{- 4}

), 0.00115 (

11 \times 10^{- 4}

) and 0.00130 (

13 \times 10^{- 4}

).

Figure 2. Simulation: plot of the process

(ψ_{n, 1})

related to the simulation of an RP urn with

k = 2

,

α / b = 1

and

p = 0.75

.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

The Rescaled Pólya Urn and the Wright—Fisher Process with Mutation

Abstract

1. Introduction

2. The Rescaled Pólya Urn

3. Main Result

4. Recessive and Dominant Colors in an RP Urn with $β$ near to 1

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Acknowledgments

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics

The Rescaled Pólya Urn and the Wright—Fisher Process with Mutation

Abstract

1. Introduction

2. The Rescaled Pólya Urn

3. Main Result

4. Recessive and Dominant Colors in an RP Urn with β near to 1

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Acknowledgments

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics

4. Recessive and Dominant Colors in an RP Urn with $β$ near to 1