A Novel Approach for Calculating Exact Forms of mRNA Distribution in Single-Cell Measurements

: Gene transcription is a stochastic process manifested by ﬂuctuations in mRNA copy numbers in individual isogenic cells. Together with mathematical models of stochastic transcription, the massive mRNA distribution data that can be used to quantify ﬂuctuations in mRNA levels can be ﬁtted by P m ( t ) , which is the probability of producing m mRNA molecules at time t in a single cell. Tremendous efforts have been made to derive analytical forms of P m ( t ) , which rely on solving inﬁnite arrays of the master equations of models. However, current approaches focus on the steady-state ( t → ∞ ) or require several parameters to be zero or inﬁnity. Here, we present an approach for calculating P m ( t ) with time, where all parameters are positive and ﬁnite. Our approach was successfully implemented for the classical two-state model and the widely used three-state model and may be further developed for different models with constant kinetic rates of transcription. Furthermore, the direct computations of P m ( t ) for the two-state model and three-state model showed that the different regulations of gene activation can generate discriminated dynamical bimodal features of mRNA distribution under the same kinetic rates and similar steady-state mRNA distribution.


Introduction
Gene transcription is a central biological process that transcribes genetic information stored in DNA into messenger RNA (mRNA) molecules.This process undergoes a series of complex interplays among transcription factors, RNA polymerases, chromatin modifying enzymes, and so on to control the complement of mRNA in the cell [1,2].mRNAs are often randomly produced because of low copy numbers of biochemical molecules participating in their random collisions.The stochasticity of gene transcription has been manifested by observed pulsatile bursts, with each episode of transcriptional activity interrupted by irregular gene-off periods, which in turn leads to mRNA level fluctuations in an isogenic cell population [3][4][5].
The transcriptional fluctuation is conventionally quantified by measuring mRNA distribution histogram data for the gene of interest at the single-cell level [3,6] (Figure 1), which can be performed by counting the number of mRNA molecules in individual cells using single-molecule RNA fluorescence in situ hybridization (smFISH) for selected genes with single-transcript sensitivity [7,8] or exploiting the single-cell RNA sequencing (scRNA-seq) technology for abundances of genes of the whole transcriptome [4,9].The histogram data provide a statistics base for approximations of mass function P m (t) that denotes the probability of producing m copies of mRNA molecules in one cell at time t, and the transcription noise defined as the variance of mRNA copy numbers divided by mean squared [7,[10][11][12].The histogram and mass function P m (t) depict a panoramic view of mRNA distribution, whereas the noise quantifies the deviation of mRNA numbers in individual cells from the mean.When combined, these concepts may also signify a phenotypic trend in the cell population [1,[13][14][15].Most mRNA distributions observed in single-cell RNA-sequencing (seq) data are of three types [6,16]: Decaying distribution (Figure 1a), as shown by the tetO promoter in mammalian cells [10], for which P m decreases in m; unimodal distribution (Figure 1b), as shown by the Mbln2 gene in mouse fibroblast cells [4], for which P m peaks uniquely exist at m > 0; bimodal distribution (Figure 1c), as shown by E. coli genes [17], for which P m has two peaks: there is a peak at m = 0 and the another peak at m > 1.A decaying or unimodal distribution with low noise may suggest a phenotypic homogeneity, while a bimodal distribution with high noise supports a binary process that steers cells into sub-populations with distinct cell identities to implement the cellular "bet-hedging" strategy in response to environmental stresses [6,[18][19][20].When combined with mathematical models, the computation and analysis of mass function P m (t) facilitate the understanding of how the different gene regulation mechanisms and system parameters influence mRNA distribution profiles for both prokaryotic and eukaryotic genes [7,[21][22][23].In stochastic transcription models, the calculation for the exact forms of P m (t) is transformed to solve the corresponding chemical master equations that involve continuous variable time t and discrete random variable m of mRNA copy number [24][25][26].By introducing the generating function V(z, t) with two variables, z and t, and multiplying the master equations by (1 + z) m , the master equations can be transformed into a system of partial differential equations of V(z, t) [2,24].Thereafter, if V(z, t) can be solved from the partial differential equations, then P m (t) is obtained by the reverse calculation Tremendous efforts have been made to understand the steady-state cases of mathematical models for which the system of partial differential equations of V(z, t) is reduced to a system of ordinary differential equations as t → ∞.The expressions of lim t→∞ V(z, t) and lim t→∞ P m (t) can be solved in the form of hypergeometric functions or singular integrals [2, 10,16], which have facilitated good fits to the three distribution modes in Figure 1.When time t is finite, the scenario becomes more interesting as inducible genes in E. coli, yeast, and mammalian cells can exhibit all the decaying, bimodal, and unimodal distributions over time [8,17,21,27].However, the dynamic transition patterns among dif-ferent distribution modes and the duration and significance of each dynamic distribution mode are tightly regulated by different regulation mechanisms [23] and system parameter regions [24].
In this study, our aim was to calculate the exact expressions for the dynamic mRNA distribution P m (t) of mathematical models.In this case, we can still transform partial differential equations of V(z, t) into a system of ordinary differential equations through the method of characteristics [24,28].However, a standard method to obtain V(z, t) is lacking owing to the presence of variable coefficients in the transformed ordinary differential equations.This dilemma limits the calculation of V(z, t) only within a few of the limiting parameter regions, such as zero values of the mRNA degradation rate [29] and gene inactivation rate [30], or fast switching between two promoter states such that the two states can be mathematically approximated as a single state [23,31].
In this study, we sought to develop a new approach to solve V(z, t) for larger system parameter regions of mathematical models where the parameters are positive and finite.We can transform the calculation of V(z, t) by solving a classical higher-order ordinary differential equation.This breakthrough allows us to express V(z, t) as a closed form of hypergeometric functions, which facilitates the computation of P m (t) at each time point through the reverse formula (1) by numerical simulation [32].We illustrate the approach for the widely used three-state model and the corresponding master equations in Section 2. We provide a detailed derivation of V(z, t) in Section 3. Several simplifications of V(z, t) are discussed in Section 4, for which the exact forms of P m (t) within special parameter regions can be derived.Finally, we provide a brief discussion of our results and their potential biophysical implications in Section 5.

Mathematical Models and Master Equations
One of the most classical models in the study of gene transcription is the so-called twostate model that has been utilized to capture the stochastic bursty transcription [4-6,10,33].In the two-state model the genes are suggested to transition randomly between off (inactive) and on (active) states.The mRNA molecules are newly produced only when genes are in the on state and are degraded following certain process.Each biochemical process in the two-state model (2) is determined by a single rate-limiting steps.
A central problem in studying stochastic gene transcription is understanding the regulation of gene random switching between off and on states under varying external conditions [3,22,[33][34][35].The technique of transcriptional real-time imaging allows the measurements of off and on periods for each gene within the whole timescale, which has generated massive distribution data for the duration periods of off and on states.This revealed the universal exponentially distributed on period for both prokaryotic and eukaryotic genes [1,36,37], which supported the control of a single rate-limiting step in gene inactivation in the two-state model (2).However, modeling a single rate-limiting step directing gene on may not be sufficient to capture the ordered multiple biochemical reactions that mediate the recruitment of transcriptional activators, chromatin modification enzymes, RNA polymerases and so on [1,5].Many other models has been developed based on the two-state model (2) to capture the multiple sequential steps in the gene activation [2, 31,35,36,38].
We illustrate our approach using the simplest framework of multi-step gene activation for which the gene is activated by two ordered rate-limiting steps.This two-step activation has been utilized to capture the data of non-exponential distribution with a unique peak for gene off period, as observed for tetA gene and P lac/ara promoter in E. coli cells under certain temperatures and stresses [35,39,40], and 16 mouse fibroblast genes under Trichostatin A (TSA) treatments [36].By introducing two-step gene activation, the two-state model (2) is generalized to the three-state model [9,36,41,42].As shown in Figure 2, each process in the three-state model is described by a single rate-limiting step, with the duration for each state being independent and exponentially distributed with constant rates.The model consists of random rotation between two sequential gene off states (states 1 and 2) and one on state (state 3) with activation rates λ 1 , λ 2 and inactivation rate γ, in coupling with the simple birth and death processes of mRNAs with synthesis rate v and degradation rate δ.The analytical expressions of the steady-state mRNA distribution and its mean and noise for the three-state model were derived [2, 41,42].However, it still lacks the exact forms for the dynamical mRNA distribution P m (t).Let P m,i (t), i = 1, 2, 3 denote the probabilities that the gene resides at state i with m mRNA copy numbers being produced at time t.Then Following the standard procedure, the calculation of P m,i (t) can be transformed to solve the chemical master equations with inactive genes and zero transcripts at the initial time [41]: P 0,1 (0) = 1, P 0,2 (0) = P 0,3 (0) = 0, and P m,i = 0 for m ≥ 1 and i = 1, 2, 3.
The system (3)-( 6) cannot be solved by iteration as the evolution of P m,i (t) is related to P m+1,i (t), P m−1,i (t), and itself.The traditional method introduces generating functions which transform the system (3)-( 6) to partial differential equations [2]: System ( 8)-( 11) possesses a unique solution according to the theory of partial differential equations [28].However, the exact expression of the solution remains elusive.

Expressing Generating Function V (z, t)
When time t → ∞, the expression of lim t→∞ V(z, t) can be expressed by a single generalized hypergeometric function 2 F 2 [2], denoted by 2 F 2 (x 1 , x 2 ; y 1 , y with When the time is finite, the next result shows that V(z, t) takes more complicated expression compared to the steady-state case.
Theorem 1.The generating function V(z, t) given by ( 7) can be expressed in a closed form Here A, B and C are given by Proof.By the method of characteristics [24,28], we let z 0 be a parameter, and Then Equations ( 8)-( 10) can be transformed into the initial value problems To transform this system further, we define By using the chain rule, we find u i (t) = δxw i (x) for i = 1, 2, 3, and transform ( 15)-( 18) to a more tractable problem Next, we proceed by extracting third-order linear equations for w 3 and w = w 1 + w 2 + w 3 from (20)- (22).Let a, b, c and d be real constants, and y = y(x) be a smooth function of x.We introduce the linear operator We claim that w 3 and w are separately the unique solutions for the following initial value problems and The verification of the initial conditions is trivial.To confirm L λ1 +1, λ2 +1 δl +1, δs +1 (w 3 ) = 0, we first multiply Equation ( 22) by x λ2 and then take derivatives.It gives A substitution of Equation ( 21) cancels out the first two terms in the right hand side and simplifies the identity to In turn, we multiply ( 27) by x λ1 and then take derivatives.We get Substitution of Equation ( 20) into (28) cancels out the last two terms in the left hand side and simplifies the identity to According to the definition (12), we have From ( 29) and ( 30) it follows immediately that To confirm L λ1 , λ2 δl , δs (w) = 0, we first derive w (x) = w 3 (x) by adding Equations ( 20)-( 22), and then use L λ1 +1, λ2 +1 δl +1, δs +1 (w 3 ) = 0 to find L λ1 , λ2 δl , δs This implies that L λ1 , λ2 δl , δs (w) is a constant.Be applying initial values ( 25)-( 26) and the relation w (x) = w 3 (x), we find To proceed, it is crucial to solve the initial value problem (26).Once the solution w(x) is obtained, V(z, t) follows immediately by substituting x = vz and z 0 = ze −δt into w(x).

Probability Mass Function P m (t)
The complicated expression V(z, t) given by ( 13) does not permit a direct utilization of (1) to derive P m (t).However, when the parameters fall within some special regions, the hypergeometric function 2 F 2 can be dramatically simplified, which may allow the calculation of P m (t) using (1).

Inactive Gene with γ = 2δ and λ 1 + λ 2 = δ
Denoted by T off and T on the durations of gene off and on states, respectively.When the parameters fall within the region γ = 2δ and λ 1 + λ 2 = δ, there is which suggests the inactive gene with T off > T on .
Theorem 2. If γ = 2δ and λ 1 + λ 2 = δ, then generating function V(z, t) can be expressed as and the probability mass function P m (t) can be expressed as We remark that if λ 1 = λ 2 , the expression (57) still hold if we take the derivative λ 1 → λ 2 .

Two-State Model with λ 1 → ∞
When λ 1 is sufficiently large, the transition from gene off 1 to off 2 states in the threestate model (Figure 2) can be neglected, and the model collapses into the classical two-state model (2), as illustrated for yeast FLO11 gene [37].In this subsection, we shall derive the exact form for the generating function V(z, t) of the two-state model (2).If γ = δ is further assumed, reverse Formula (1) can be used to derive P m (t).

Discussions and Conclusions
In this study, we developed an approach for calculating V(z, t) and P m (t) for large parameter regions.We illustrated the approach for the three-state model (Figure 2) and successfully obtained the dynamic expressions of V(z, t) in a closed-form of hypergeometric functions (Theorem 1).In addition to the numerical simulation for computing P m (t) using (1), we can still simplify the expression of V(z, t) within several special parameter regions to derive the integral forms of P m (t) by elementary functions.The simplicity of those forms may facilitate us to prove in mathematical detail the dynamic profile transitions of mRNA distribution [24].We discussed two special parameter regions in Theorems 2 and 3.The first parameter region presents inactive genes with a longer duration of gene off than that of gene on, which has been confirmed for thousands of eukaryotic genes [22,36] and prokaryotic genes under low induction [17,45].The second parameter region presents the classical two-state model with arbitrary durations of the gene off state.This case has been discussed in other studies that exhibited rich dynamic P m (t) profiles [24,32].Moreover, we noticed that V(z, t) readily gives P 0 (t) = V(−1, t), which quantifies the probability of zero mRNA being produced.The index P 0 (t) quantifies the typical "none-or-all" phenomenon and signifies the cell fate decision in an isogenic cell population [17,18,45].Our work may facilitate the computation of P 0 (t) to describe HIV latency [18], transcription dynamics of the E. coli promoter P lac/are [17], and lysogen stability from the fate-determining cI gene [45].
We computed V(z, t) and P m (t) using the obtained analytical expressions.In Figure 3a, we generated three dynamical curves of P 0 (t) = V(−1, t) using V(z, t) expressions ( 13) and (75) under three parameter groups, including the average off duration T off (Equation ( 55)), which takes T off = 25 min with λ 2 = 2λ 1 for P lac/ara promoter in E. coli cells under acidic shift [40]; T off = 70 min with λ 2 = 7λ 1 for mouse fibroblast genes under TSA treatments [36]; and T off = 50 min with λ 1 λ 2 (two-state model) for yeast FLO11 gene under the addition of ethanol [37].These dynamic curves exactly match the corresponding curves generated by the stochastic simulation algorithm (SSA), suggesting the correctness of our V(z, t) expressions.In Figure 3b,c, we further generated distribution profiles of P m (t) using Formula (1) by differentiating expressions of V(z, t) in Theorems 2 and 3, and confirmed their correctness using SSA.We emphasize here that the expression of V(z, t) for the two-state model was obtained in [32] by another method that shows a different sign "-" compared with our expression (75).As an independent check using SSA (Figure 3a) verifies the correctness of our expression (75), the expression of V(z, t) obtained in [32] needs to be corrected.13) and (75) under three parameter groups-δ = 0.014, γ = 0.167 (min −1 ) and T off = 25 min with λ 2 = 2λ 1 for E. coli P lac/ara promoter [17,40]; δ = 0.0109, γ = 0.167 (min −1 ) and T off = 70 min with λ 2 = 7λ 1 for mouse fibroblast genes [22,36]; δ = 0.063, γ = 0.05 (min −1 ) and T off = 50 min with λ 1 λ 2 for yeast FLO11 gene [11,37] We further compared P m (t) dynamic profiles separately generated by different models.We first found that the steady-state mRNA distribution data of mouse Mbln2 gene [4] and E. coli P lac/ara promoter [17] (Figure 1b,c) can be well fitted by both the two-state model and three-state model under the same rescaled kinetic rates γ/δ, v/δ and T off δ with λ 1 = nλ 2 for different n ≥ 1 (Figure 4).When n increases, the three-state model gradually reduces to the two-state model (n → ∞).We then generated P m (t) profiles at different time points using both models equipped with the estimated kinetic rates and δ = 0.0109 min −1 for Mbln2 gene [22] and δ = 0.014 min −1 for P lac/ara promoter [17].As shown in Figure 4a, there exhibits an ordered distribution transition pattern among decaying, bimomal, and unimodal as time develops.However, the two-state model generates the longest duration of 120 min for the intermediate bimodality compared with the decreasing bimodal duration with respect to n for the three-state model.This observation suggests that the increase in gene activation step number may facilitate the suppression of heterogeneous bimodal distribution while enhancing the formation of homogeneous unimodal distribution.Such bimodal suppression was also observed in the transcription of promoter P lac/ara for which the dynamic mRNA distribution transition from decaying to bimodality occurs as early as 40 min for the two-state model and as late as 100 min for the three-state model with λ 1 = λ 2 (Figure 4b).
The approach presented here may be further implemented in other models with constant kinetic rates of transcription, especially for the multiscale model [31] and crosstalking pathway model [11], which also includes three functional gene states.Both models are different from the framework shown in Figure 2 as the multiscale model describes a change in the gene on state upon production of mRNA to explicitly describe RNA polymerase dynamics; the cross-talking pathways model assumes two parallel rate-limiting gene activation steps to simulate competitive binding of transcription factors at the cognate DNA sites.By introducing the generating functions (7) and variables ( 14) and ( 19), the master equations [11,31] for the multiscale model or cross-talking pathway model can be transformed into canonical forms of third-order hypergeometric differential equations, as in (24) and (26).Then, using the method of undetermined coefficients (34), the analytical expressions for generating functions V(z, t) for both models may be derived.Furthermore, the protocol of our approach may be developed for models with more than three gene states, where gene activation is regulated by multiple parallel [46,47] or sequential [2,48] rate-limiting steps, or their combinations.However, the complexity of the calculation for P m (t) is expected to increase dramatically with respect to the gene state number.
The models mentioned above may provide a universal framework to illustrate stochastic gene activation for both prokaryotic and eukaryotic genes.However, the assumption of an exponential decay of mRNA molecules in these models may fit the straight log-scale lines of degrading mRNA levels against time for E. coli genes [33], but may fail to capture much more complicated mRNA degradation processes for eukaryotic genes [49].A detailed model of eukaryotic mRNA degradation models in which the mRNA molecule decays either through the pathway of 5 → 3 direction or the pathway of 3 → 5 direction [50]: the 5 → 3 pathway consists of multiple rate-limiting steps representing the ordered PolyA-deadenylation, decapping, and the involvement of many intracellular factors; the 3 → 5 pathway is directed by a single rate-limiting step to degrade mRNA from the unprotected 3 ends by exosomes.Future work may include the calculation of P m (t) for models of stochastic mRNA degradation and illustrate how transcript degradation influences transcription dynamics.Author Contributions: J.C. performed research; F.J. designed and wrote the manuscript.All authors have read and agreed to the published version of the manuscript.

Figure 1 .
Figure 1.Three modes of mRNA distribution.(a) Decaying distribution for which the mass function P m gradually decreases in mRNA copy number m = 0, 1, • • • .(b) Unimodal distribution that P m has a unique peak at m > 0. (c) Bimodal distribution that P m takes two peaks: one at m = 0 and the other at m > 0. mRNA distribution data (blue bars) and their fits by P m (red lines) of the two-state model (2) are shown for the (a) tetO promoter in CHO cells [10], (b) Mbln2 gene in mouse fibroblast cells [4], and (c) promoter P lac/ara in E. coli cells [17].

Figure 4 .
Figure 4. Discriminated dynamic bimodal features of mRNA distribution generated by the two-state model and three-state model.(a) The steady-state mRNA distribution data of mouse Mbln2 gene [4] (blue bars) can be well fitted by both models with (T off δ, γ/δ, v/δ) = (0.71, 4.74, 38.8), and λ 1 = nλ 2 with finite n for the three-state model and n → ∞ for the two-state model.Dynamic distribution profiles with δ = 0.0109 min −1 [22] display an intermediate dynamic bimodal distribution with the longest duration of 120 min for the two-state model, but decreasing durations with respect to n for the three-state model.(b) The steady-state mRNA distribution data of E. coli P lac/ara promoter [17] (blue bars) can be well fitted by both models with (T off δ, γ/δ, v/δ) = (2.53,0.22, 13) and λ 1 = nλ 2 .Dynamic distribution profiles with δ = 0.014 min −1 [17] display a dynamic distribution transition from decaying to bimodality, with the transition time as early as 40 min for the two-state model and as late as 100 min for the three-state model with λ 1 = λ 2 .