Research in systems biology has been increasingly supported by computational models of biochemical reaction networks. These models are studied either through a deterministic approach, using differential equations to represent the temporal changes of the concentrations of the chemical species, or via a stochastic approach based on the chemical master equation (CME) and solved through Monte Carlo simulation algorithms. Although the deterministic approach of solving a set of differential equations by numerical integration is fast and widely adopted, it is unable to estimate the variance of the species concentrations and can become inaccurate for systems with a small number of particles [1
]. In these situations, the stochastic approach of using a Monte Carlo simulation algorithm for evaluation of the CME is preferred.
The CME is a very high dimension differential equation that describes the evolution of the entire state space. Direct solutions of the CME are rare and apply only to very small systems. In practice, the CME is solved by applying a simulation algorithm that provides an exact solution in the Monte Carlo sense (i.e., by summation of many simulated trajectories). Gillespie derived an algorithm that satisfies this requirement [2
]. This is often referred to as the Gillespie algorithm, although Gillespie himself referred to it as the stochastic simulation algorithm
(SSA). In fact, Gillespie provided two variants of the algorithm, the direct method
and the first reaction method
, with the direct method being the most widely used. Gillespie’s SSA has been further improved for computational efficiency [3
] and there have been several implementations of these algorithms in software for computational systems biology. For example, there are software applications, such as COPASI [6
], VCell [7
], and StochPy [8
], which provide user-friendly platforms to create and simulate models using the SSA and include other features to further analyze the model and simulation results. Furthermore, there are other lightweight programs developed specifically to simulate models using the SSA, namely Dizzy [9
], Gillespie2 [10
], SGNS2 [11
], RoadRunner [12
] and pSSAlib [13
]. All these simulators require the full set of reactions—the reaction network—to be enumerated beforehand and thus are sometimes termed “network-based”.
There are several cases in which a biochemical network is very large or limitless. A particularly common occurrence is given by some signal transduction networks that contain proteins with multi-site phosphorylation, leading to combinatorial numbers of chemical species and reactions between them [14
]. Another case is that of the formation of polymers with an unlimited number of monomers. In order to model such systems, an approach has been developed in which sets of similar reactions are defined by rules that apply to sets of species specified by patterns [17
]. This formalism results in a concise model specification of the underlying chemical kinetics [20
]. The most common rule-based modeling languages are the BioNetGen language (BNGL) [17
] and Kappa [22
]. The BNGL simulator, BioNetGen [17
], operates by deriving the reaction network specified in the reaction rules and then applying the SSA for simulation. On the other hand, rule-based simulators such as KaSim, PySB [24
], RuleMonkey [25
] and NFsim [19
] carry out simulations directly on the basis of reaction rules without deriving the entire reaction network, and accordingly these have been termed “network-free”. At its core, all these simulators are based on Gillespie’s method, as rules are sampled at each time interval using a method equivalent to how reactions are sampled in the SSA.
In this review, we compare these two stochastic simulation approaches and several popular software implementations in the context of models with different complexity. The comparison addresses issues such as the number of particles, species, and reactions, as well as the length of the simulation.
2. Network-Based Approach
The stochastic formulation of chemical kinetics describes the time evolution of a well-stirred set of chemically interacting particles in thermal equilibrium within a fixed reaction volume [1
]. The time evolution of the number of particles of each species in the volume, on the basis of the probabilities of all reactions that can occur in the system, is driven by the CME. As already mentioned, the CME is rarely solved analytically, mostly because the number of its terms grows exponentially with the number of species in the system.
An alternative to the analytical solution of the CME is to simulate the trajectories of molecular populations in exact accordance with the CME, as proposed by Gillespie [2
] (the SSA). Each trajectory corresponding to a single SSA run represents an exact sample from the distribution defined by the CME. The steps of SSA can be summarized as follows:
Initialize: Set the time and set up the initial state vector, propensities, and random number generators.
Execute: Using a suitable sampling procedure, generate random numbers and, on the basis of these, determine the next reaction to occur and the time interval.
Update: Update the molecule count, and if needed, recalculate the propensities. Output the system state.
Iterate: If simulation end time is not reached, go to step 2.
The two original, and statistically equivalent, sampling procedures for step 2 of the SSA are the direct method (DM) and the first reaction method (FRM) [2
]. The DM samples two random numbers from the uniform distribution in the unit interval, and the time of next reaction (
) is first generated according to the probability function of reactions. Using
, the DM then generates the indices of reactions and selects the one to occur next. The FRM, using a random number, generates “tentative reaction times” (
) for all the reactions and then selects the reaction with the smallest
. Because the FRM needs to generate many more random numbers per iteration than the DM (for systems with three or more species), the DM is generally the procedure implemented for the sampling in step 2 of the SSA [2
Gibson and Bruck proposed the next reaction method (NRM) [4
] that can reduce the computational costs of the SSA significantly. In addition to using one random number per iteration, to reduce the time to update propensities and to find the smallest
value, the NRM uses an indexed priority queue to store the
values generated in previous iterations and to extract them whenever required. This results in a significant improvement in the runtime performance when compared to the FRM. This algorithm is exact as well as efficient. For large reaction networks and loosely coupled reaction systems, the NRM is significantly faster than both the FRM and the DM. This advantage, however, may not be significant for small systems, as the computational cost of maintaining the additional data structures required dominates the simulation time [5
Other variants to accelerate the search for the next reaction in the SSA have been proposed, such as the optimized direct method (ODM) [5
], the sorting direct method (SDM) [26
], the partial-propensity direct method (PDM) [27
], and the SSA with composition rejection algorithm (SSA–CR) [28
Besides the exact algorithms mentioned above, many others have been proposed that can accelerate the simulation even further, but they do this by adopting approximations and no longer provide exact solutions. A popular method is the
-leaping algorithm [31
], which does not simulate each reaction event individually but rather steps a time-span
and estimates how many and which reactions have happened meanwhile. Many other variants of this and other approximations have been proposed, including hybrid methods that partition the system into a part that is simulated using differential equations and another that uses the SSA or one of its variants (see review by Pahle [32
Most of the stochastic simulators provide options to choose between the DM and the NRM, for example, COPASI [6
], StochPy [8
], and Dizzy [9
]. Other simulators use only one of these, with Gillespie2 [10
] and RoadRunner [12
] using only the DM, and SGNS2 [11
] using only the NRM. The pSSAlib software [13
] allows selection between the DM, the PDM, a sorting variant of the PDM (SPDM), and the SSA–CR. StochKit2 [33
] provides several of these, including the SSA–CR, but automatically selects which algorithm to use.
3. Network-Free Approach
To address the combinatorial complexity in biological signaling networks [14
], originating from multiple post-translational modifications and conformational changes, rule-based modeling approaches have been developed [15
]. At the core of these approaches are reaction rules
that represent groups of reactions. These rules refer to specific binding sites with or without specific ligands. Rules can also specify different states of a molecule (such as oxidized or reduced, phosphorylated or unphosphorylated, etc.). With rule-based modeling it is easy to specify, with a few rules, a complex set of combinatorial interactions in which several subunits can assemble into larger complexes and allow for modification of specific moieties. This type of model specification is therefore very useful for signal transduction networks in which these types of interactions are abundant.
The BNGL [17
] are some examples of formalisms developed for biochemical rule-based modeling. While the BNGL can be processed by different software applications (BioNetGen [23
], DYNSTOC [25
], RuleMonkey [39
], and NFsim [19
]), the other languages are mostly restricted to being processed by a single software package. The BioNetGen software package expands a BNGL rule-based model to a reaction network, which is then simulated using a variety of deterministic and stochastic network-based methods. However, when a rule-based model can result in a large reaction network, the expansion as well as the simulation of such a network becomes computationally expensive. For such scenarios, the generation of the reaction network can be avoided by a network-free simulation approach.
] uses an agent-based null-event stochastic simulation approach based on an earlier package, STOCHSIM [40
]. In this approach, each of the reactive molecular components are represented as a software object (agent), and these are tracked individually during the simulation. More specifically, for each fixed time increment, on the basis of a decision to select either one or two molecules for the next reaction, the reactants are first chosen randomly. Then the rules that qualify for the interaction on the basis of the chosen reactants are shortlisted, and for the reaction with the highest probability, an update is performed using a graph-rewriting operation.
Unlike DYNSTOC, which uses a fixed time-step, RuleMonkey [39
] has a variable time increment, and rules are represented as pattern graphs. The simulation procedure is similar to the SSA [2
], as the time increment and rule selection are based on the DM. Once a rule to execute next is chosen, the most potential reactants are selected on the basis of the pattern graphs and are then used to update the state of system. As such, RuleMonkey uses iterative updates to track rule rates exactly, avoiding null events that do not change the state of the system being simulated [41
]. NFsim [19
] is another rule-based simulator, using a generalized algorithm [42
] also based on the SSA. Contrary to RuleMonkey, NFsim introduces null events in its implementation [41
]. While both RuleMonkey and NFsim have been shown to perform similarly over a wide parameter range [41
], NFsim has the additional capabilities of defining functional rates and coarse-grained rules. It uses an efficient representation of molecules, complexes, and rules as well as an optimized handling of reactant selection and transformation.
These network-free simulators, unlike the network-based SSA simulators, scale with the number of rules rather than the number of reactions and thus should be very efficient for systems in which a few rules can represent a large number of reactions [37
]. While this is true for networks with limited interacting particles, the network-free simulators might not be as efficient, given they represent each particle individually. The particle-specific events, such as aggregation and polymerization, make the computational cost even higher. On the other hand, although the network-based simulators are dependent on the number of reactions, their efficiency is not affected greatly by the number of molecules [37
]. As such, network-based simulators may be preferred for systems with large particle numbers and a moderate reaction network.
All of the approaches described have difficulties when there are large numbers of particles and a large reaction network. To address this situation, there have been efforts to develop hybrid methods [37
]. The hybrid particle-population-based approach [37
] is reported to be exact and efficient but requires a predefined partition of the system into network-free and -based parts. Because these hybrid approaches are based on the partitioning of the models, it is important to identify limits of both network-based and -free approaches such that automatic identification of different parts of the network could be created on the basis of this information [37
Benchmarking several simulators on the basis of different network-based or -free algorithms has revealed interesting patterns. The benchmarks have tested the scaling of the execution times of simulators on the basis of two tests, namely, as a function of the number of particles in the system and as a function of simulation end times. The first test was focused on exposing issues depending on the number of particles in the system; the second test was focused on exposing issues that arised closer to stable states (attractors). We note that most realistic simulations should start from steady states rather than the idealized state of only a few input molecules; biological systems are in stable states at the start of most experiments, which usually apply a perturbation forcing a transition between stable states (either different steady states or stable oscillations). Thus the behaviors over the longer times in the second test were rather important.
The results of both the tests indicate that StochPy and DYNSTOC are the slowest implementations. In the case of StochPy, a SSA–DM implementation, the tool is designed to write the raw simulation output after each event, rather than to do so at a requested fixed-interval output; thus we suspect that it spends a significant amount of time writing (unecessary) output. Additionally, this tool is Python-based, an interpreted language, and this likely also incurres a considerable time penalty. For DYNSTOC, the issue is rather different and is due to the algorithm used, in which each molecule is tracked as an agent. At very low molecule numbers for the simplest model, this tool was among the fastest, and it scaled linearly with the number of molecules. Thus the problem is that this approach is not able to deal with any reasonable number of molecules, which were present in every other model tested. Thus we conclude that such a pure agent-based approach is not competitive.
The results show that, surprisingly, the network-based approach was always the fastest, not only for small- and moderate-sized reaction networks, but also for the larger networks and under all the conditions tested. For the simpler models, with a relatively low number of species and with limited interactions, the lightweight simulators BioNetGen, Gillespie2, pSSAlib_SPDM, and SGNS2 were significantly faster than all the others tested. Gillespie2 became slower when simulating larger models, but BioNetGen, pSSAlib_SPDM, and SGNS2 remained very fast under all conditions tested.
An important aspect of rule-based modeling is the concept of “observables”, which are functions of the species’ abundances. BioNetGen outputs the values of the observables, but SGNS2 only outputs the species abundances; thus in order to obtain values of the observables, the output of SGNS2 would require a further data processing step; pSSAlib outputs data in separate files and requires a post-processing step for this purpose. Although COPASI is slower, mostly as a result of a large overhead in model loading and preparing data structures, it can also readily output the values of the observables. RoadRunner run times are approximately similar to COPASI’s, but it has the same problem of requiring post-processing as in SGNS2.
The network-based simulators tested spanned a range of different SSA approaches. Several were based on the DM (Dizzy, StochPy, Gillespie2, BioNetGen, and COPASI_D), while a couple used the NRM (COPASI_GB and SGNS2) and the CR (pSSAlib_SSACR and StochKit2); pSSAlib_SPDM, which uses PDM, was found to be the fastest for the upper extremes of our tests. In the SPDM, while the time spent on factoring out reaction propensities did not pay off for a small number of molecules or short simulations, it did lead to a significant efficiency when larger numbers of molecules were reached. Consequently, for the FcϵRI signaling model, pSSAlib_SPDM completed the simulation in half the time taken by BioNetGen in the most extreme case tested—a considerable speed-up. The efficient implementation of this method is, however, handicapped by the general usability of the simulator. One setback is that the pSSAlib does not accept any SBML file, but requires specific annotations, which other software do not include. This is particularly problematic for rule-based models that are generated from the BNGL; these models usually have very large numbers of reactions and adding the annotation manually is not practical.
As expected, the DM was observed to be less efficient than the NRM and CR. However, the comparison of the DM against the NRM in the same package (COPASI) showed that the difference is not large and is only apparent under conditions with large numbers of particles. One notable exception to this was BioNetGen, which, while being a DM implementation, was the fastest at all times. This might have been due to its implementation of the sorting variant of the DM. Further efficiency seems to have been obtained by various code optimizations (earlier versions of BioNetGen than that tested here were much slower), but the same is true for SGNS2 (a NRM implementation). We noted, however, that BioNetGen uses the standard C runtime rand()
function, unlike SGNS2 and most other SSA implementations, which use the Mersenne Twister [52
], a much better-quality pseudo-random number generator but that is slower (see Appendix A
, Figure A7
for a comparison of the two). The dangers of using poor pseudo-random number generators are well known [53
], and this could be a concern here, particularly for long simulation times. Also surprising is that the SSA–CR implementations (both pSSAlib_SSACR and SStochKit2) were not faster than the NRM, despite expectations of the contrary [28
]. The expected advantage of SSA–CR did not materialize in models with large numbers of reactions. Of course, the efficiency of this implementation cannot be ruled out for other types of models not tested here.
What are the advantages of network-free simulation? While the network-free simulators were never the fastest with any of the models and conditions included here, there are clearly situations in which they are needed. The use of network-free simulation is inevitable when the derivation of the network is computationally challenging or impossible. For example, complete models of the ErbB-mediated activation of ERK and AKT [50
] and of early T-cell receptor signaling [51
] result in very large reaction networks, so large that BioNetGen is unable to generate the corresponding reaction networks. Even if the network could be derived, loading it in simulators such as COPASI and RoadRunner would require a significantly long time. Network-free simulators are also essential to simulate systems with infinitely linking molecules, such as models of polymerization, models of trivalent ligand bivalent receptors (TLBRs), models of large complexes, and so forth. Among the network-free simulators tested, NFsim was generally the fastest. Under some of the less-demanding conditions tested (i.e., low molecule numbers and simpler models), RuleMonkey had a small advantage, but otherwise it is clear that NFsim is currently the best choice. NFsim also provides an option to define functional or conditional rate laws and complex rules. This capability is particularly useful to model systems whose rates are affected by the availability/unavailability of specific molecules. We also identified areas in which network-free simulators require further improvements. For example, we were not able to simulate models with reactions for which several of the bound moieties suffer a transformation (catalysis) with any of the network-free simulators, as these aborted with errors when encountering the catalysis, despite that BioNetGen easily generates their network (as it should). This is a rather common occurrence that happens in every enzyme mechanism (because the reaction between substrates happens with these bound to the enzyme). An example was a model of the electron-transport chain [55
], and another was a model of the cap-binding complex in mRNA translation [56
]. RuleMonkey was unable to run the BCR signaling model, aborting with the error message “Non-binding bimolecular reaction”. These limitations in processing valid
BNGL rules affect both NFsim and RuleMonkey, but can hopefully be corrected in future versions of these packages.
The present analysis revealed that the network-based (SPDM) simulation was the most efficient method for all models tested. We hoped to have identified regimes in which network-free simulation would be more competitive. This suggests that, while a rule-based specification of the models is much simpler than enumerating all the reactions, simulation via network-free implementations is not always efficient, unless the derivation of the network from the rules is computationally intractable or there are infinitely linking species/molecules in a model. It is possible that larger models than those tested here (but with a finite number of reactions) may present conditions under which network-free simulation outperforms network-based, but this is yet to be established. On the basis of the present results, we have to conclude that for hybrid algorithms that integrate both of these approaches (e.g., [37
]), the only portion of the networks that should be partitioned to be simulated by the network-free approach are those rules leading to the formation of infinite linking chains, while the rest of the network should be simulated using the network-based approach.