SiCaSMA: An Alternative Stochastic Description via Concatenation of Markov Processes for a Class of Catalytic Systems

The Chemical Master Equation is a standard approach to model biochemical reaction networks. It consists of a system of linear differential equations, in which each state corresponds to a possible configuration of the reaction system, and the solution describes a time-dependent probability distribution over all configurations. The Stochastic Simulation Algorithm (SSA) is a method to simulate sample paths from this stochastic process. Both approaches are only applicable for small systems, characterized by few reactions and small numbers of molecules. For larger systems, the CME is computationally intractable due to a large number of possible configurations, and the SSA suffers from large reaction propensities. In our study, we focus on catalytic reaction systems, in which substrates are converted by catalytic molecules. We present an alternative description of these systems, called SiCaSMA, in which the full system is subdivided into smaller subsystems with one catalyst molecule each. These single catalyst subsystems can be analyzed individually, and their solutions are concatenated to give the solution of the full system. We show the validity of our approach by applying it to two test-bed reaction systems, a reversible switch of a molecule and methyltransferase-mediated DNA methylation.


Introduction
Different approaches exist to model (bio)chemical reaction networks. On one end of the spectrum, quantum-theoretical approaches provide insights into the dynamics of few-atom systems using nearly no approximations of the underlying physics. Force-field driven particle simulations approximate the positions and velocities of atoms or molecules deterministically and are therefore capable of simulating significantly larger systems. By assuming a well-stirred, isothermal and isobaric system, one can eliminate the spatial component completely, thus further increasing the feasible system size.
Without the spatial dependency, the only time-dependent system state is characterized by the number of molecules of each chemical species at a given time. An established method to describe the behavior of such a system is the Chemical Master Equation (CME). It is a system of coupled linear ordinary differential equations whose solution describes a time-dependent probability distribution over the set of possible states of the system (see, e.g., Higham [1], Wilkinson [2], Schnoerr et al. [3] for reviews on stochastic modeling approaches for chemical reactions).
The CME suffers from the curse of dimensionality, i.e., computational complexity and memory requirement grow exponentially with the number of chemical species. Thus, exact solutions of the CME are rare and only possible for very small and simple systems. Moreover, depending on the system at hand, it can be tedious to even find a proper state definition and to list all possible states.

1.
It is not necessary to define the state transition graph of the entire system. This can be a real advantage, since the state transition graph can become quite large. This holds in particular for catalytic systems in which the substrate exists in many conformations. A prominent example is post-translational modification of a protein, e.g., phosphorylation at different sites. The nodes of the transition graph correspond here to all possible configurations of substrate phosphorylation states and catalyst binding states. Due to the combinatorial complexity, this number grows rapidly even for small molecule numbers. Moreover, it is for most of those systems not possible to exploit the underlying structure of the graph. 2.
Intractable state transition graphs are replaced by a concatenation of much simpler graphs, resulting in lower dimensional differential equations for the CME approach. Instead of solving the conventional CME defined on the full state transition graph, one can solve the CME on these simpler graphs and concatenate the obtained solutions. This effectively enables the solution of the CME for complex systems, where it was formerly necessary to resort to simulation methods.

3.
The implementation of the SSA is considerably simplified.
We show on different example systems, including a simple conversion reaction and a model for methyltransferase-mediated DNA methylation, that SiCaSMA is equivalent to the standard CME description of the full system.

Materials and Methods
The CME is a set of linear differential equations that describe the evolution of probability distribution P(X, t) over all possible system states X and time t: Here, j ∈ {1, ..., M} is the reaction index and ν j the state change vector of reaction j. Reaction propensities are denoted a j (X). Conditions in square brackets ensure that the summation is over feasible reactions only, where ν j = ν + j + ν − j is decomposed into positive and negative parts. Equation (1) can be cast into the forṁ with system matrix S and solution P(X, t) = e St P(X, t = 0).
Hence, solving the CME requires the calculation of the exponential of the product of the system matrix S and time t.
The SSA simulates sample paths from the CME. For each reaction step, this includes the realization of two random variables, an exponentially distributed waiting timet for the next reaction to happen, and a reaction index j to determine the type of reaction, as depicted in Algorithm 1.

5:
Draw reaction index j from discrete distribution J ∼ a j (X(t)) /a sum (X(t))

An Intuitive Example: Linear Conversion Process
As an intuitive example to illustrate the idea behind our approach, we consider the reversible conversion of molecules between two different forms with simple first order kinetics ( Figure 1A), i.e., reactions of the form This reaction system constitutes a special and very simple case of our system class. It represents, for example, reversible binding of a catalyst to a substrate, without modification of the substrate. The forms A and B represent free enzyme and enzyme substrate complex, respectively. When substrate is in excess, reaction rates only linearly depend on the amount of catalyst molecules and are of order zero with respect to the substrate, as assumed here. Especially for this simple system, it is intuitively clear that we can either describe the full system or consider each molecule independently and obtain equivalent solutions. We define the state X(t) of the underlying Markov process to be the number of molecules in state A. Thus, for n molecules, the range of X(t) is given by The propensities and state change vectors of the two reactions are given by The state transition graph of this reaction system for n = 2 is shown in Figure 1B (left). This system can alternatively be modeled by concatenating n single molecule systems with state spaces Y i (t) ∈ {0, 1}, i = 1, . . . , n ( Figure 1B (right)). In this case, the concatenation is equivalent to a summation Y(t) = ∑ n i=1 Y i (t). Next, we show that X(t) and Y(t) describe equivalent stochastic processes. We therefore solve the CME corresponding to each system analytically and compare the resulting probability distributions P(X, t) and P(Y, t). Regarding the original system, P(X, t) corresponds to the solution of the respective CME, which can be derived from the state transition graph ( Figure 1B) and readṡ Without loss of generality and for the sake of brevity, we choose k 1 = k −1 = 1 for the following calculations. Then, the solution of this set of linear differential equations is given by and for X(t = 0) = 2 Applying SiCaSMA, for the equivalent process, we solve each subsystem to obtain P(Y i , t), i = 1, 2 in a first step. The respective CME readṡ with solution This gives for Since , the probability distribution P(Y, t) is formally defined as a convolution of the probability distributions P(Y i , t). For n = 2, this results in P(Y, t) = P(Y 1 , t) * P(Y 2 , t) and can also be formulated in a verbose fashion P(X, t) and P(Y, t) are equivalent, as can be derived from the Equations (11) and (12) in comparison to the solution of the full system (Equation (8)) and seen in the courses in Figure 1C. We consider a reversible conversion reaction, in which the system state X is defined as the number of molecules in the first (blue) state. State change vectors and propensities are denoted ν 1 , ν −1 and a 1 (X), a −1 (X), respectively. (B) State transition graph of the full system (left) and both single catalyst subsystems (right). (C) Visual representation of the solution P(X, t) of the CME for the full system (left) and of the combined solution P(Y, t) of the two single catalyst subsystems (right).

Applying SiCaSMA to a Model for DNMT1-Mediated DNA Methylation
While the just-described system is rather simple in nature, it provides an intuitive application of our approach to general catalytic systems, which are assumed to be of the following form: The chemical species can be subdivided into two categories. C denotes the catalyst species and S the substrate species, respectively. Catalyst-driven conversions of the substrate, such as conformational changes or post-transcriptional modifications, are treated as individual species in this framework. The key assumption for our approach is that the catalyst molecules function independently of each other. In our system, substrate molecules S are modified by catalysts C. Our specific system comprises reversible complex formation of catalyst and substrate molecules and catalyst-mediated modifications of the substrate. If multiple modifications are possible, this is implemented as a counter, and denoted by S i for a single substrate molecule or CS i in complex form, respectively, While all catalyst and substrate molecules are simulated simultaneously in the classical approaches, we replace this procedure with consecutive simulations including all substrate molecules and only one catalyst molecule at a time. Both approaches differ in one major aspect: In the classical approach, each catalyst molecule faces the same substrate state at a given time, and if one catalyst molecule changes this substrate state, this change is directly visible for all other catalysts. Using SiCaSMA, however, a catalyst molecule initially faces the substrate state that results from all modifications of the previously simulated catalyst molecules. The equivalence of both procedures is therefore non-trivial. Moreover, unlike in the previous example, the subsystems cannot be simulated in parallel, since they indirectly depend on each other through the initial condition in the substrate state.
On this basis, we apply our method to a model for epigenetic regulation via DNA methylation, as described in Adam et al. [18]. In this system, the protein DNA methyltransferase 1 (DNMT1) can methylate DNA molecules at different sites and therefore serves as biocatalyst. The DNA strands serve as substrate molecules. Each DNA molecule contains 44 methylation sites in the original model. Analogously to Adam et al. [18], we model the binding reaction of DNMT1 and DNA with a propensity that increases linearly with the number of unbound DNMT1 molecules and is independent of the number of DNA molecules. This assumption comes from the fact that a DNA strand contains many (unspecific) sites at which DNMT1 can bind. Furthermore, each DNA molecule can accommodate arbitrarily many DNMT1 proteins, such that the DNMT1 species is the limiting factor here.
For an analytic CME solution, we have to simplify this model in order to obtain a tractable number of states. We do this by neglecting different DNMT1 conformations and consider a system consisting of n D = 1 DNA molecule with n S = 1 methylation site and n P = 2 DNMT1 molecules. This keeps the system size tractable and enables a clear illustration. The reaction scheme is shown in Figure 2A. DNMT1 can bind to and dissolve from the DNA. In the bound state, it can methylate the DNA processively. The state of the system is described by a two-dimensional vector, in which the first and second component correspond to the number of enzymes bound to the DNA molecule and the number of methylated sites of this DNA molecule, respectively. The first entry, therefore, serves as catalyst species C, while the second one represents the substrate species S. Methylation reactions are modeled as irreversible processes in our system. We obtain the following reaction propensities and state transition vectors: Here, the factor (1 − X 2/n S ) in a m (X, n S ) indicates the fraction of yet unmethylated sites. The state transition graph of the CME for the full system is depicted in Figure 2B (left). It comprises six possible states; hence, the CME is a six-dimensional linear differential equation system, which is given bẏ For the initial state X = (0, 0) T and k 1 = k −1 = k m = 1, the solution of Equation (14) is given by The stochastic process, which results from applying SiCaSMA to the system, is illustrated in Figure 2B (right). Here, the system is modeled by simulating the two DNMT1 molecules one after another while accumulating the methylation state. For each single catalyst subsystem indicated by p = 1, 2, we obtain a CME that comprises four states, The general solution of this equation is given by Using the matrix exponent e tS is given by All above equations hold for both single catalyst subsystems. However, the initial state of the second system depends on the solution of the first one. The state transition graph of the single catalyst subsystems ( Figure 2B (right)) consists of two classes. The transient class, which contains all states with unmethylated DNA, is enclosed by the left red box. Similarly, the absorbing class, which contains the states with methylated DNA, is depicted inside the right red box. Exchanging the first catalyst molecule by the second one corresponds to keeping the methylation state from the first subsystem and starting in the respective class in the DNMT1 unbound state, i.e., the DNMT1 protein is re-initialized. The probabilities within each class are therefore added up, which is indicated by the red plus-sign. The distribution of the initial state of the second subsystem reads where t f is the simulated time. The two states in which the catalyst molecule is bound have probability 0 for t = 0. From Y 1 , Y 2 and Y m , we construct the overall system state Y, which will be shown to be equivalent to X. We therefore define: As for the first example system, the probability distribution P(Y 1 + Y 2 , t) is defined as the convolution Using Equation (20), we construct all possible outcomes after two successive single catalyst subsystem simulations and compare the resulting probability P(Y, t) with P(X, t) of the full system in Equation (A3). An exemplary derivation of the equality for P X = (0, 1) T , t and the initial state X = (0, 0) T is detailed in Appendix A. Similar derivations can be undertaken for all six entries of the full CME solution. Figure 2C visualizes the equivalent analytic solutions P(X, t) and P(Y, t) obtained for the full system (left) and SiCaSMA (right), respectively.

Applying SiCaSMA to Larger Networks via the SSA
Next, we apply SiCaSMA to a larger network by implementing it into an SSA approach. Therefore, we consider the methylation system of Figure 2 and arbitrary n S , n D , n P . Pseudocode of the conventional SSA for the full system is depicted in Algorithm 2. Applying SiCaSMA (Algorithm 3), we initialize the number of methylated sites Y m = X init,2 . For the n P single catalyst systems, the state of the catalyst Y p is set to Y p = X init,1 , and sample paths are generated via the conventional SSA with a single catalyst molecule, n P = 1. Finally, the state vector X(t), which is a two-dimensional vector with X 1 (t) = ∑ n P i=1 Y p (t) and X 2 (t) = Y m (t), is returned. In summary, SiCaSMA SSA runs the conventional version of the SSA for each single catalyst subsystem and eventually combines the obtained solutions by treating the methylation state as a global variable throughout all single catalyst simulations and concatenating the catalyst states using a sum.
Algorithm 2 SSA DNA methylation system (X init , t f , n P , n S ) Initialize X = X init and set t = 0 2: while t < t f do Calculate A = (a 1 , a −1 , a m ) and a sum = ∑ j=1,2,3 A j

6:
Update if t < t f then 8: Update X = X + ν i end if 10: end while return X Algorithm 3 Single Catalyst SSA DNA methylation system (X init , t f , n P , n S ) Initialize Y m = X init,2 for p = 1, ..., n P do 3: A visual comparison of the outcome of the two algorithms is shown in Figure 3. Results in Figure 3A are obtained with n S = 10, n D = 2, n P = 10 and reaction rates k 1 = 2, k −1 = 1, k m = 1 2 . The estimated distributions are obtained using 10 5 sample paths for each of the simulation algorithms evaluated for t = 3.0. Similarly, Figure 3B presents the estimated distributions for an even larger system defined by n S = 100, n D = 5, n P = 100 and the same reaction rates as in Figure 3A. It can be seen that the probability distributions estimated with the SSA of the full system (Figure 3 (left)) and of SiCaSMA (Figure 3 (right)) are nearly identical. This becomes especially remarkable when recapitulating the dimension of the corresponding CME, which defines a probability for every single possible state of the system. Only considering the different possibilities of binding 100 DNMT1 molecules to five DNA strands renders a CME solution nearly impossible, not to mention the second, independent combinatorial explosion introduced via the different possible methylation states of the DNA. A convergence of SSA results such as the one observed in Figure 3 can therefore not be taken for granted.

Discussion
In this paper, we have introduced SiCaSMA, an alternative stochastic description for a class of catalytic systems in which catalysts act independently of each other on substrate molecules. Instead of considering the CME for the full system, SiCaSMA concatenates smaller subsystems which consist of all substrate molecules, but only one catalyst molecule each. The single catalyst subsystems are analyzed one after another. Hereby, the substrate state is inherited from actions of former subsystems, which is reflected in the initial conditions of the subsystem under consideration. Thus, the substrate state acts as a global variable. The states of the catalyst molecules in the subsystems are local variables, i.e., describe the state in the subsystems, and are re-initialized for each subsystem. By applying SiCaSMA, it is not necessary to characterize the state space of the full system. It is sufficient to perform calculations or simulations on the single catalyst subsystems, which usually have a much smaller state space. In the end, the state of the full system is reconstructed from the states of the individual single catalyst subsystems.
We have applied SiCaSMA to different systems. First, a simple reversible conversion reaction, in which a molecule can reversibly switch between two different states, was used to illustrate the approach. This system was described with simple first-order kinetics and represents an intuitive example for our approach. The partition into n individual subsystems is trivial here, but the example is well-suited to demonstrate the general idea of SiCaSMA. Second, we have considered DNMT1-mediated DNA methylation, which consists of reversible binding of DNMT1 to the DNA and processive methylation at different sites. For both systems and low numbers of variables, we have shown that SiCaSMA provides indeed an equivalent description of the underlying stochastic process. Since the dimensions of the single catalyst subsystems are smaller than those of the full systems, SiCaSMA can also be applied to calculate a solution of the full system when the full system is not feasible. However, reconstruction of the state of the full system also becomes more complicated in these cases, and a practically applicable general reconstruction scheme is still missing.
Beyond the specific methylation system discussed in the paper, we think that SiCaSMA can be applied to a broad range of biochemical reaction systems. Several literature models can be adopted to this form. Examples include mRNA transcription [19] and posttranscriptional modifications of proteins such as phosphorylation [20]. Moreover, we believe that it is also possible to extend the model class in several directions, e.g., regarding the number of different catalyst or substrate types and their numbers of different configurations. A more formal definition of the system class, which generalizes our results, is a challenging future task.   [24].

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript:

CME Chemical Master Equation SSA
Stochastic Simulation Algorithm SiCaSMA Single Catalyst Stochastic Modeling Approach DNMT1 DNA methyltransferase 1

Appendix A. Proof of Equivalence of X and Y for the DNA Methylation Model
We perform an exemplary derivation of equivalence for P X = (0, 1) T , t and the initial state X = (0, 0) T . Four additional algebraic identities are used during the derivation: as well as The relation between the states X and Y i is given by the convolution, i.e., P X = (0, 0) T , t = P (Y 1 , Y m ) T = (0, 0) T , t P (Y 2 , Y m ) T = (0, 0) T , t P X = (1, 0) T , t = P (Y 1 , Y m ) T = (0, 0) T , t P (Y 2 , Y m ) T = (1, 0) T , t Thus, for P X = (0, 1) T , t we get P X = (0, 1) T , t This expression is indeed equivalent to the fourth entry of Equation (15).