Hitting Times of Some Critical Events in RNA Origins of Life

Can a replicase be found in the vast sequence space by random drift? We partially answer this question through a proof-of-concept study of the times of occurrence (hitting times) of some critical events in the origins of life for low-dimensional RNA sequences using a mathematical model and stochastic simulation studies from Python software. We parameterize fitness and similarity landscapes for polymerases and study a replicating population of sequences (randomly) participating in template-directed polymerization. Under the ansatz of localization where sequence proximity correlates with spatial proximity of sequences, we find that, for a replicating population of sequences, the hitting and establishment of a high-fidelity replicator depends critically on the polymerase fitness and sequence (spatial) similarity landscapes and on sequence dimension. Probability of hitting is dominated by landscape curvature, whereas hitting time is dominated by sequence dimension. Surface chemistries, compartmentalization, and decay increase hitting times. Compartmentalization by vesicles reveals a trade-off between vesicle formation rate and replicative mass, suggesting that compartmentalization is necessary to ensure sufficient concentration of precursors. Metabolism is thought to be necessary to replication by supplying precursors of nucleobase synthesis. We suggest that the dynamics of the search for a high-fidelity replicase evolved mostly during the final period and, upon hitting, would have been followed by genomic adaptation of genes and to compartmentalization and metabolism, effecting degree-of-freedom gains of replication channel control over domain and state to ensure the fidelity and safe operations of the primordial genetic communication system of life.


Introduction
The origins of life, abiogenesis, is a matter of high importance, for it gives insight into the distribution of life in the universe. We focus on the RNA world hypothesis, where life began with self-replicating RNA molecules that can evolve under Darwinian evolution, following necessary conditions of compartmentalization and metabolism, for geometry and synthesis of nucleobases from metabolic precursors, respectively. Self-replicating sets of RNA were proposed first by Tibor Ganti [1,2] and have been studied by many others [3][4][5][6]. This is an information-centric perspective on abiogenesis, representing the putative beginning of genomic Darwinian evolution. Information centrism interprets a living organism as an operating genetic communication system in some connected domain that encodes and decodes genomic state relative to a replication channel.
While genomics, epigenomics, and transcriptomics of modern-day organisms are based on DNA, RNA, and epigenetic marks such as DNA methylation, RNA origins in their purest form concern the dual-function of RNA as an informational polymer and ribozyme. This article is similar in spirit to works in the 1970s through the 1990s, including wherein he studied the error-threshold of the critical fidelity to the main information. Two-dimensional spatial modeling has been applied in which reactions occur locally with finite diffusion, suggesting a spatially localized stochastic transition [38], simulated using Gillespie's stochastic simulation algorithm (SSA) [39]. Another model is of autocatalytic sets of collectively reproducing molecules, which has been developed in reflexively autocatalytic food-generated (RAF) theory [40]. Various physics-based analyses have been conducted, such as in light of Bayesian probability, thermodynamics, and critical phenomena [41]. Systems of quasi-species based on the principle of natural self-organization called hypercycles employ non-equilibrium auto-catalytic reactions [8]. Nonlinear kinetic models for polymerization have been used to study the emergence of self-sustaining sets of RNA molecules from monomeric nucleotides [42]. Theoretical analysis has been conducted into RNA origins. Attention has been drawn to an evolving population of dynamical systems and how dynamics affect the error threshold of early replicators and possibly towards compartmentalization conveying hypercycles [43]. String-replicator dynamics have been studied and properties suggested to be necessary to RNA origins, including the ability to operate a functional genetic communication system and ecological and evolutionary stability [44][45][46]. A variety of pre-RNA worlds have been suggested, with RNA being preceded or augmented by alternative informational polymers, such as other nucleic acids [47], beta amyloid [48], polycyclic aromatic hydrocarbons [49], lipids [24], peptides [50], and so on. It seems that pre-RNA worlds existed independent of the RNA world in the sense that they are not ancestral to the RNA world, and that these worlds may have had non-trivial interactions with the RNA world.
A key concept of stochastic systems is that of a hitting time: the time of the first occurrence of some event. We develop a simple mathematical model at the sequence level to represent the synthesis and function of RNA molecules in order to gain insight into the hitting times of various critical events of RNA origins of life. The idea is to study the surface of hitting times in terms of the structure of the system. The model lacks many features of realism, such as sequence size variability, finite sources of "food" (activated nucleotides in our context), limited diffusion rates, poor system mixing, and so on, in order to concentrate on the process as a search problem. The notion of fitness landscapes has been studied extensively in evolutionary biology [51]. Landscape topology has been considered in an Opti-Evo theory, which assumes sufficient environmental resources and argues that fitness landscapes do not contain "traps" and globally optimal sequences form a connected level-set [52]. We describe the model in Section 2, where we define a replicating reaction network, whose random realizations are constructed using SSA. We describe hitting times as key random variables of interest and characterize polymerization as a transition kernel. In Section 3, we conduct and discuss simulation studies based on SSA, where we analyze the structure of the polymerase measures and the input-output and survival behaviors of the hitting times given the parameters of the system. In Section 4, we end with conclusions.

Materials and Methods
We define M = {Adenine, Uracil, Guanine, Cytosine}, abbreviated as M = {A, U, G, C}, corresponding to the RNA bases. We study the space of sequences having length n, that is, E = M n with space of possibilities (a σ-algebra) E = 2 E , so that the pair (E, E ) is the measurable space of all RNA sequences of length n with |E| = 4 n . We let ν be a probability measure (distribution) on (E, E ), giving the probability space triple (E, E , ν). Appendix A gives an overview of (E, E , ν). A related space, though not utilized in this article, is the space of all RNA sequences up to length n, E * = ∪ n i=1 H i with E * = 2 E * and size |E * | = 4 3 (|E| − 1). We denote the collection of non-negative E-measurable functions by E ≥0 .
We build a simplified mathematical model for the time-evolution of a population of interacting RNA molecules in solution. Let X t be the population at time t ∈ R ≥0 with initial population X 0 . X t is a multiset, that is, it is a set containing elements possibly with repeats. We assume that the system is well mixed and has access to an infinite source of activated nucleotides.
The complement of x ∈ E is denoted x c ∈ E, attained using the base-pairing A with U and C with G. Let h(x, y) ∈ {0, 1, . . . , n} for x, y ∈ E (1) be the Hamming distance between x, y ∈ E as the number of positions in x and y where the nucleotides differ. We have h(x, x) = 0 and h(x, x c ) = n.

Core Model
We describe the reaction network of the system below. We simulate trajectories of the system using the stochastic simulation algorithm (SSA) [39], to simulate exact trajectories for the evolution of stochastic reaction networks here. SSA forms a Markovian process, where the arrival of reactions follows a Poisson (point) process, and assumes that the reaction volume is well mixed and homogeneous, with all parts of the system accessible for reactions. Reactions across various disjoint volume elements of the system are dependent. We do not consider the effects of finite diffusion, which effects a length scale above which disjoint volume elements are effectively independent [38].

System
We model the population of sequences which can form double-stranded helices, dissociate, and replicate with mutation with replicator fitness and sequence specificity. We interpret each system element as a set, either containing one element-a single-stranded sequence-indicated as {x}-or two elements, single and complementary stranded sequences, indicated as {x} ∪ {x c } = {x, x c }, where x, x c ∈ E (double bracket notation indicates a collection, or set, of sets, i.e., {{x}, {y}, {z}, · · · }). We define system elements as setsĒ with respective σ-algebras,Ē = 2Ē,F = 2F andḠ = 2Ḡ. The sizes are |Ē| = 4 n and |F| = 4 n /2, and |Ḡ| = 3 × 4 n /2. The setĒ contains single-stranded sequences, whereas the setF contains double-stranded sequences; the setḠ is the union ofĒ andF, containing both single and double stranded sequences. These sets are necessary to track the various sequences (single and double stranded). The reaction network of the system is given by and expressed in terms of sets Reaction (2) is double-strand formation from complementary single-strands with reaction rate k ds . Reaction (3) is the dissociation of double-strands into single-strands, caused by a heat source, with reaction rate k ss . Reaction (4) is template-directed polymerization of a single-strand (the template) by another single-strand (the polymerase), producing a single-strand complementary to the template with some fidelity, with reaction rate k rep (which functionally depends on the polymerase and template sequences).
The polymerization reaction rate is defined by where a > 0 is a positive constant, f : E → (0, 1] is the replicative fitness of x (as a polymerase) and s : E × E → (0, 1] is the similarity between x and y, a symmetric function. Finally, x replicates y, outputting a version y c * with mutations, where each nucleotide position has fidelity probability p : E → (0, 1]. Note that the similarity function can be either trivial/constant, i.e., s(x, y) = 1, or non-trivial. For example, if we assume sequence similarity to correlate to spatial proximity of sequences, as assumed below, then s(x, y) is non-trivial.

High-Fidelity Set
To define a high-fidelity set, pick an arbitrary subset of sequences R ⊂ E as highfidelity replicators of size r = |R|. We define R two ways.
We define R using a product of non-empty random nucleotide subsets {A i ⊆ M : i = 1, . . . , n} for each nucleotide position , etc. and r 1 + r 2 + r 3 + r 4 = n. For simplicity, we assume |A i | ∈ {1, 4} with fraction 4 being q ∈ (0, 1). Thus, R is a subset of E defined as a product space.
For another construction of R, we define a finite union of m random sequences R = {x 1 , . . . , x m }.

Distance
We define the Hamming distance H between sequence x and high-fidelity sequence set R as H(x, R) = min{h(x, y) : y ∈ R} ∈ {0, 1, . . . , n} for x ∈ E.

Fitness
We define "tent-pole" fitness f k of sequence x ∈ E and high-fidelity sequence set R for curvature parameter k ∈ R ≥0 as The maximums are the sequences of the high-fidelity sequence set R, which are the "points" or "poles" of the surface, with exponential decay into the remainder of the space in string distance. The strength of the decay is governed by parameter k, called the curvature parameter, which can be specified through the value of fitness at H(x, R) = n (sequence dimension), Appendix B describes other fitness functions.

Similarity
We define two cases for the similarity function appearing in the reaction rate of template-directed polymerization. The first case is the trivial (constant) case where the This assumes that there is no mechanism by which sequence specificity is selected for, such as in the case that polymerases should evolve to generically well replicate sequences, including their own. This means that they will spend much of their time replicating other sequences.
For the second case of a non-trivial similarity function, we note that RNA origins of life are thought to be a spatially localized stochastic transition, where high-fidelity replicators are found concentrated in foci, following from the increased replicative mass of the replicators. Hence, we implicitly encode spatial information through a non-trivial similarity function, based on a distance function, which increases the replicative system mass for similar (and here nearby) high-fidelity replicators, that is, if they're similar, then they're likely proximal. In the following definition, we use the notation ∧ for the minimum of two numbers, x ∧ y = min{x, y}. Distance S is defined between two sequences x, y ∈ E as Similarity s k of sequences x, y ∈ E for curvature parameter k ∈ R ≥0 is defined in terms of exponential decay as Presently, replicators can replicate other sequences well but not their own [34]. There may exist RdRps that are excellent polymerases and, in conjunction with RNA hammerhead ribozyme, engage in rolling circle amplification of the polymerase-hammerhead sequence (genome) so that the amplification process is self-cleaving. This results in a large increase in replicative mass due to the super-exponential growth in the population of the high-fidelity replicators within a small volume. In the context of SSA, the similarity function here is an ansatz for spatial locality.

Fidelity
Finally, polymerization fidelity probability for curvature k ∈ R ≥0 is defined as Note that f k (x, R) = p k (x, R) = 1 for high-fidelity sequences x ∈ R. Note that fitness, similarity, and fidelity are defined for single-stranded sequences (E, E ).

Counting Representation
The process X = (X t ) t∈R ≥0 is the time-evolution of the system. Recall that X t is a multiset. X t contains the individual single stranded molecules in the setĒ, i.e., sequences {x} ∈Ē and double stranded molecules in the set {x, x c } ∈F, with overall setḠ =Ē ∪F having size m = |Ḡ| = 3|Ē|/2. Note that, in the set representation, there is symmetry {x, x c } = {x c , x}, so that the size of the double-stranded set is equal to |F| = |Ē|/2. The system evolution X t induces a random counting measure N t on the overall space of single and double stranded sequences (Ḡ,Ḡ) as The total count, that is, the total number of molecules, is K t ≡ |X t | = N t (Ḡ). We assume that the counter N t is maintained for all times t ∈ R ≥0 .

Reaction Rates
The total reaction rate is given by the sum of the individual reaction rates where the reaction rate of double-strand formation from complementary single-strands is given byk the reaction rate of dissociation of double-strands into single-strands is given by and the reaction rate for template-directed polymerization of a single-strand by another single-strand (the polymerase) is given by The first reaction rate is a sum over the single-strands ofĒ with size 4 n . The second is a sum over double-strandsF with size 4 n /2. The third reaction is a sum over the product space of single stranded sequences,Ē 2 , with 4 2n = 16 n number of elements. Therefore,k(t) requires 4 n ( 3 2 + 4 n ) elements to be evaluated for every reaction. Clearly, direct representation on the full space is very expensive and impractical for even modest n. One obvious way to improve efficiency is not summing over the zero elements. We define sets and as the unique single and double stranded sequences of the system. Then, direct calculations of the reaction rates arek For this approach, the replication rate has quadratic dependence on |X 1 |. Using the reaction rates, the system may be exactly simulated using SSA. The reaction at time t with ratek(t) occurs over time interval t ∼ Exponential(1/k(t)). As the reaction ratek(t) increases with increasing number of molecules K t = N t (Ḡ), the reaction rate increases and reaction duration t decreases over time. The natural consequence of increasing process intensity is that the system speeds up.
The quadratic dependence may still be too expensive for large simulations. Appendix E describes Monte Carlo approximation of the reaction rates.

Hitting Times
We define some hitting times. The initial population consists of I single-stranded sequences X 0 , i.e., |X 0 | = I. We define the hitting time τ for the time of the first replication event We define the hitting time τ for the appearance of sequences in the high-fidelity sequence set R, Put X R = R ∪ {{x, x c } : x ∈ R} and define the volume fraction of high-fidelity sequences R at time t as We define the hitting time τ where high-fidelity sequences of R emerge and reach a minimum volume fraction, This hitting time reflects that period wherein a high-fidelity replicator has been identified yet there exists no complementary high-fidelity sequence for amplification, hence the system diversity continues to increase, decreasing the concentrations of all extant sequences as more sequences are discovered. The minimum hitting time captures the duration of time the high-fidelity replicator exists by itself. We define the hitting time τ for the time high-fidelity sequences in R constitutes some volume fraction v ∈ (0, 1] of the population, In practice for simulations, τ v is censored based on some total number of reactions, that is, if the volume fraction is not achieved by n reactions, τ v = ∞ because there is no arrival time. For SSA, we specify a maximum number of reactions N to simulate. We have parameters θ ∈ Θ for τ, such as landscape curvature k, sequence dimension n, etc. Therefore, τ(θ) is right-censored with value ∞ at simulation time a, as some simulations will stop at time a with no arrival time. These are censoring events. For fixed θ, the τ(θ) is a random variable, due to the stochastic nature of SSA. Hence, for each parameter vector θ, we attain a set of M realizations of hitting time τ as For convenience, we assume that the realizations T (θ) are ordered by non-censored followed by censored.

Functional Structure
For each parameter vector θ, we record two values: the number of hitting events in the hitting time set T and the average of the hitting time τ To describe the functional structure of the average hitting time f (θ), we require a classifier which determines whether or not there are zero hittings g(θ) = 0 and a regressor for the value of f (θ) for hittings g(θ) > 0. We assume that the parameters θ = (θ 1 , · · · , θ n ) are randomly sampled according to distribution ν = ∏ i ν i and the hitting times recorded. High dimensional model representation (HDMR) may be attained for the classifier (as a probabilistic discriminative model) and the regressor of f (θ). For the regressor, we have HDMR expansion The HDMR component functions { f u } convey a global sensitivity analysis, where, defining variance term we have a decomposition of variance The normalized terms S u = σ 2 u /σ 2 f are called sensitivity indices. Appendix F gives a brief description of global sensitivity analysis via HDMR.

Statistical Structure
A second analysis can be conducted on the hitting times T (θ) for the parameter vector θ using reliability theory. Put random hitting set where f is the hitting time probability density function ('failure density') and R is censoring time distribution ('reliability distribution'), and ϑ are the parameters of the density and distribution functions. Note that f and R each specify each other, so ϑ are the common parameters. Reliability definitions are given in Appendix G.
We show reliability quantities in Table 1 for the two-parameter (α, β) ∈ (0, ∞) 2 Weibull(α, β) distribution and Cox proportional hazard's model where γ is a vector of coefficients for the θ. We use the Python software lifelines for estimation of ϑ for the Weibull-Cox model from data [53]. The mean failure time Eτ(θ) is given by

Surface Chemistries
The system given by (2) of polymerization with mutation requires two separate hitting events, one sequence in the high-fidelity set x ∈ R and either another sequence in the high-fidelity set x ∈ R or its complement x c ∈ E, in order for high-fidelity replicators to maximally engage in templated-directed polymerization and achieve some fraction of the population. This setup of RNA polymerase action, requiring two such events for the polymerase and template, makes the hitting times long. Basically, the same information must be discovered twice before it can be used, which is unsatisfactory. We idealize polymerase activity conveyed by a non-RNA species, here clay, with the parameter k clay−p , as clay itself is not thought to be capable of polymerization but is capable of oligomerization of RNA. The reactions are given by where non-RNA polymerization has mutation with fidelity probability p ∈ (0, 1] and x c * is the complement of x with mutation. The reaction rates are given by Therefore, upon the first hitting of the high-fidelity replicators R with sequence x ∈ R through (4) or (23), x gives two high-similarity single-stranded sequences x and x c * through (24), which then may participate in template-directed RNA polymerization (4).

Reactions as Measure-Kernel-Functions
All the reactions x → y which involve substrate x may be represented using transition kernels, which form linear operators. At each iteration of SSA, a reaction type is chosen, followed by a transition to a particular domain (X, X ) with distribution ν t , followed by mapping into a codomain (Y, Y ) using transition probability kernel Q with distribution µ t = ν t Q. The notions of ν t and Q involve measure-kernel-functions. The probability of transition of y into B ∈ Y given x is given by Q Appendix C recalls some facts about Q. We define kernels Q for RNA and non-RNA polymerization to provide insight into the reactions. Consider X t for some t ∈ R ≥0 . Recall that N t is the random counting measure of X t on single and double-stranded sequences (E ∪ F, 2 E∪F ). For RNA and non-RNA polymerization, we take ν t as a (random) probability measure for x in domain (X, X ) and describe a transition probability kernel Q from x into y in codomain (Y, Y ). which results in the creation of the single-stranded sequence {y c * }. The first dimension is the polymerase and the second is the template. We give a definition of the probability measure on the product space of sequences. Recall that (E ⊗ E ) ≥0 denotes the collection of non-negative E ⊗ E -measurable functions.
Because the first two coordinates are preserved under the mapping, we focus on the new dimension as a transition from (E × E, E ⊗ E ) into (E, E ) using transition probability kernel Q. In this case, Q is defined by a 16 n × 4 n matrix whose rows vectors (dimension 4 n ) are probability vectors. The structure of Q follows from the polymerase replication with mutation, whereby each nucleotide position has fidelity probability p : E → (0, 1], which depends on the first dimension of E × E. We put p x = p(x) for sequence x ∈ E. Now, we state a simple fact on the binomial structure of the number of mutations made by a polymerase.

Theorem 1 (Mutation distribution). The number of mutations by polymerase
Now, we partition E into level sets (H i (y)) by Hamming distance to the template complement y c , We define the transition kernel Q for RNA polymerization, where Q completely encodes RNA polymerization using Theorem 1.
Corollary 1 (Transition probability kernel Q). We have that the transition probability kernel Q for RNA polymerization is defined by RNA polymerization is defined by Q using the binomial structure of polymerase mutation. A more sophisticated model could be defined as a sum of Bernoulli random variables with varying success probabilities in the Poisson binomial distribution. This could be used to take into account polymerase mutation that varies with nucleotide position. Another idea is taking into account schemata such as repeats which destabilize the polymerase [54].
It is multiplication of ν t as a 16 n dimension row vector with 16 n × 4 n dimension matrix Q, giving a 4 n dimension row vector ν t Q. We Then, µ t (H i ) for i ∈ {0, . . . , n} is the distribution on sequences by distance to R, i.e., contains the instantaneous information of RNA polymerization. A more general model for replication is where polymerase activity is tied to geometry, i.e., compartmentalization/spatial confinement, with state space (C, C) and to metabolic state (M, M). In this telling, the polymerase reaction rate could be tied to the degree of spatial confinement and the source of the activated nucleotides from metabolic precursors.

Non-RNA Polymerization
If there exists some kind of non-RNA polymerase activity, we have that the mapping which we regard as a mapping from (E, E ) into (E, E ). Let ν t be a probability measure on (E, E ) defined by Similar to RNA polymerization, for fidelity probability p ∈ (0, 1], we have Q as the 4 n × 4 n matrix defined by and and Note that, for SSA, Q is fixed over the simulation, whereas the probability measure ν t depends on time. That is, the reactions are chosen according to the reaction rates, and the reactions each use respective Q. The ν t is formed using a random counting measure, so ν t is random. This approach generalizes in the obvious way to all the reactions.

Decay
The RNA sequences have finite lifetimes in reality. This comes from a variety of sources, including radiation, pH, intrinsic molecular stability, etc. We assume doublestranded RNA is stable, whereas single-stranded RNA is not. Therefore, we create a reaction for decay of single-stranded RNA into constitutive nucleotides with reaction ratek

Compartmentalization
It is thought that compartmentalization plays a role in RNA origins of life, giving foci of reproducing sequences [23,55]. This is somewhat anticipated by the similarity function s : E × E → (0, 1], where sequences are more likely to copy similar sequences than less similar ones, due to an underlying spatial localization. Explicit spatial effects may be modeled by assuming each x ∈ E is marked with a position on a bounded subset of the . We think of this as a one-dimensional projection of the three-dimensional system. Additional species can be introduced, such as lipids, with reactions forming a vesicle M (vesiculation), which encloses some We assume the lipids interact with the single stranded sequences in A to form vesicles as with reaction rate k mic (t) = k mic N t (X 1 ).
Hence, vesiculation is coupled to the population of sequences by design so that it evolves on roughly the same time-scale as sequence activities. Note that vesicles can enclose one another, i.e., M(A) and M(B) where A ⊂ B or B ⊂ A, but cannot cross, i.e., for all vesicles at locations A, . . . , B we have that A ∩ B ∈ {A, B, ∅}. For example, suppose one vesicle A = [0, 1] encloses another two, B = [ 1 3 , 1 2 ] and C = [ 2 3 , 3 4 ]. Then, is disconnected in one dimension, the intervals are physically connected in three dimensions, where vesicles are spheres. We identify each vesicle to a union of disjoint intervals, disjoint across the vesicles. We posit that compartmentalization precedes the hitting of a high-fidelity replicator through ensuring necessary concentration of RNA and a stable environment. Generally, we identify compartmentalization state to (C, C) with probability measure ν. Let Q c be a transition probability kernel from (C, C) into (E, E ), encoding the transition from compartmentalization coordinates to RNA sequences. The product space (C × E, C ⊗ E ) has law µ = ν × Q c . Upon hitting a high-fidelity replicator and achieving Darwinian evolution to acquire information, e.g., genes, the sequences are assumed to become adapted to compartmentalization coordinates (C, C) through the transition probability kernel Q c from (C × E, C ⊗ E ) into (C, C), so that µ = ν × Q c × Q c is the law on the full space (C × E × C, C ⊗ E ⊗ C). In this telling, compartmentalization precedes RNA activity, and, upon hitting high-fidelity replicators that can maintain their information, is followed by genomic adaptation.

Metabolism
We identify metabolism reaction-state to the measurable space (M, M) with probability measure ν. Let Q m be a transition kernel from (M, M) into (E, E ), positing that metabolism precedes replication. For example, certain metabolic state may be precursors to the synthesis of RNA. Consider product space (M × E, M ⊗ E ) with measure µ = ν × Q m . Now we suppose that, upon achieving Darwinian evolution in replicators, the replicators will eventually become adapted to (M, M). Hence, we interpret (M, M) as a mark-space of (M × E, M ⊗ E ), representing genomic adaptation. Let Q m be a transition kernel from Therefore, metabolism-first followed by replication and genomic adaptation is encoded by the structures of Q m and Q m . We do not specify these transition kernels in this article but mention that they are richly textured.

Reaction Overview
The reactions of the system having decay and clay and their reaction orders are shown in Table 2. There is one zero-order reaction, three first-order reactions, and two secondorder reactions. Additionally, we show reactions and orders for compartmentalization and metabolism.

Results
Consider some initial population of I random sequences X 0 . The population over time is given by X t with associated random counting measure N t on (Ḡ, 2Ḡ). Recall parameters θ = (n, q, k, l, m, p, k ∅ , k ss , k ds , k rep , k clay−o , k clay−p ) for sequence dimension n, high-fidelity sequence set size q, fitness degree k, similarity degree l, fidelity degree m, clay fidelity probability p, RNA decay rate k ∅ , double-strand dissociation rate k ss , double-strand formation rate k ds , RNA replication rate k rep , and clay oligomerization and polymerization rates k clay−o and k clay−p . These parameters are summarized in Table 3.
The following is the description of how the parameter values were specified and to what they biologically correspond. The sequence dimension n is chosen from {3, 4, 5}. The fitness and similarity functions are chosen by setting the value of the range of the curvature parameters k and l from one (inside the high-fidelity manifold) to some small values, such as over an exponential grid. For example, when i = 0.1, the fitness of sequences that are maximally dissimilar have 10% of the fitness of the high-fidelity sequences. We range the grid from 0.1 to 0.001 for fitness and similarity. The RNA fidelity parameter m = 0.25 is chosen such that the high-fidelity sequences have value one and the lowest fidelity sequences have value 0.25, equal to random chance. The clay fidelity parameter is set to an optimistically high value of 0.9 for clay studies. The double-strand dissociation and formation rates k ss and k ds are set to unity as a baseline. In comparison, the RNA replication rate is set to a large value, 10, whereby replication is the dominant reaction. The decay parameter k ∅ is set to some uniform random value in (0, 1). The clay RNA oligomerization rate is set to unit, and 'clay' RNA polymerization rate is set to a uniform random value in (0, 20). With the parameters governing the reaction rates, different values of these parameters confer different regimes for the system.

Stability: ODEs
We characterize the zeros of the vector field f from ODE system (A1) and use the eigenvalues of the Jacobian (A2) to determine their stability. It follows from Theorem 2 that, for all other initial conditions, the system has no equilibria.

Corollary 2 (Unbounded).
For all initial conditions X 0 such that I = |X 0 | > 1, the system is unbounded.
This confirms the obvious: the system, a replicating network with no death, is almost always an increasing system.

Simulation Reaction State
We are interested in the behavior of temporal probability measures, ν t (25) on the sequence product space and µ t = ν t Q (27) on the sequence space, for RNA polymerization. These reveal the instantaneous information of the system. The structure of µ t reveals the state of polymerization and is a leading indicator of the population concentrations over time.

Core Model with "Tent" Functions, Probable Hitting
Take sequence dimension n = 3 and fitness and similarity curvature parameters k = l = − log(0.01)/n and fidelity parameter m = − log(0.25)/n. Set rates for doublestrand dissociation and formation k ss = k ds = 1 and polymerization rate k rep = 10 and use the "tent" function for fitness, similarity, and fidelity. Take random initial population X 0 with initial population size I = |X 0 | = 10 and random singleton R = {{x}} (q = 0). We simulate 5000 reactions, simulation censored at hitting time τ v for volume fraction v = 0.25. Take partition of the sequence space by Hamming distance to the high-fidelity manifold (H i ) (28) of sequence space E. In Figure 1, we plot measures of a typical realization of the system X t on the partition (H i ) of sequence concentration (Figure 1a), growth (Figure 1b), and polymerase sequence output µ t (Figure 1c). Some quantities are plotted on log-log scale, whereas others are plotted on a linear-log scale. These results show that the concentrations are relatively stable for most time, until the high-fidelity manifold is hit. Then, the concentration of high-fidelity replicators rapidly increases to exceed 25%. Similarly, Figure 1b shows the growth curves on a log-log scale, where the highfidelity manifold rapidly increases near the end of the simulation. Figure 1c shows the structure of the RNA sequence polymerization output temporal probability measure µ t . Low probability is assigned to polymerization of high-fidelity replicators for most of the reaction time, followed by a large increase near the end of the simulation, where highfidelity replicators dominate with 56% probability. Therefore, the RNA sequence polymerization output temporal probability measure µ t is a leading indicator of the concentration curve, i.e., at simulation end-time, concentration of high-fidelity replicators is 25% and polymerization output is 56%.

Core Model with "Tent" Functions, Improbable Hitting
We use the same configuration as Section 3.2.1 except for setting fitness and similarity curvature parameters to k = l = − log(0.1)/n. In Figure 2, we plot measures of a typical realization of the system X t on sequence partition by Hamming distance to the high-fidelity manifold (H i ) of concentration (Figure 2a), growth (Figure 2b), and µ t (Figure 2c). The behavior has completely changed: the high-fidelity group ends the simulation with around 6% concentration, only steadily increasing, and never hits. The polymerase output µ t shows 6%. This indicates that the concentration of high-fidelity replicators is unlikely to increase further, as the population is generally in equilibrium with the polymerase output.

Core Model with Linear Functions, Improbable Hitting
The same configurations for Section 3.2.1 are used, except the fitness, similarity, and fidelity functions are linear. Similar to the "tent" functions, we specify the terminus landscape curvature for fitness and similarity k = l. Then, the fitness function for RNA polymerization is given by and s k (x, y) = 1 + k − 1 n S(x, y) for x, y ∈ E We put fitness and similarity landscape curvature k = l = 0.01 for fitness and similarity and v = 0.25 for hitting volume fraction of high-fidelity replicators. We simulate X t for 5000 reactions. We find that the probability of hitting is near zero P(τ 0.25 (θ) < ∞) ∼ 0. In Figure A1, we plot measures of a typical realization of X t on (H i ) of concentration ( Figure A1a), growth ( Figure A1b), and µ t ( Figure A1c). The simulation ends with highfidelity concentration of ∼5% and polymerase output of ∼4%. Therefore, the concentration of high-fidelity replicators will continue to decrease. Linear surfaces are not sufficient to achieve hitting times τ v (θ) < ∞ for high-fidelity replicator volume fraction v = 0.25, in contrast to the nonlinear "tent" functions.

Expanded Model with "Tent" Functions, Probable Hitting
We consider a similar model to the previous subsections and expand it with clay oligomerization rate (of RNA) k clay−o , clay polymerization rate (of RNA) k clay−p , and clay polymerization fidelity p.
Therefore, the full set of variables is given by θ = (n, k ss , k ds , k, k clay−o , k clay−p , p). The value of fitness/similarity landscape curvature k, l and clay RNA polymerization raet k clay−p are set such that the replicative mass of each is initialized to 10. This means that RNA and clay polymerization have the same reaction mass at the beginning of the simulation. We set sequence dimension n = 3, fitness/similarity landscape curvature k = l = − log(0.01)/n, clay RNA polymerization fidelity p = 0.9, and double-strand dissociation and formation rates k ss = k ds = 1. This is a high hitting regime, i.e., the probability of hitting is close to one P(τ v (θ) < ∞) ∼ 1. In Figure A2, we plot measures of a typical realization of X t on (H i ) and additionally the probability of reactions over time. High-fidelity replicators ended the simulation with 25% concentration ( Figure A2a) and RNA polymerase output ∼62% (Figure A2c), indicating that the concentration of high-fidelity replicators will continue to increase. All species exhibit superexponential growth ( Figure A2b). Clay polymerization decreases in contribution over time, whereas RNA polymerization increases substantially over time, and RNA double-strand reactions are small and stable ( Figure A2d).

Hitting Times: Functional and Survival Analysis
We study various models in order of increasing complexity. We examine the hitting time surface τ v (θ) in the parameters θ ∈ Θ, including probability of hitting P(τ v (θ) < ∞). We begin with the core model with no decay or clay.
3.3.1. Core Model, τ v (θ) for v = 0.1 with θ = (n, k) and "Tent" Functions We are interested in the structure of the hitting time τ v (θ) of (19) as a function of the parameter vector θ. We use the Weibull-Cox proportional hazard's model of Table 1 for the hitting time τ v for volume fraction v = 0.1. Let θ = (n, k) with sequence dimension n ∈ {3, 4} and fitness and similarity parameters k = l = − log(i)/n for i ∈ {0.1, 0.05, 0.01, 0.005, 0.001}. Set m = − log(0.25)/n for fidelity probability parameters. For each value of sequence dimension n, take random initial population X 0 with initial population size I = |X 0 | = 10 and random singleton R = {{x}} for the high-fidelity manifold and fix these for the fitness/similarity landscape curvature parmeters k = l. We fix the double-strand dissociation and formation rate parameters k ss = k ds = 1 and set RNA polymerization rate k rep such that the overall RNA polymerization rate is given byk rep (0) = 10 and use the "tent" function for fitness, similarity, and fidelity. We take 10 realizations of hitting time τ v (θ) for each parameter vector θ ∈ Θ and allocate 5000 reactions. This gives 100 independent hitting times and up to 500,000 reactions. The times are comparable because the system is initialized to the same replication mass.
For the simulations, 66 hitting times are finite. The coefficients positively contribute to hitting, where γ n ≈ 0.97 and γ k ≈ 13.29, both with p-values less than 0.005. Therefore, hittings are strongly positively influenced by the parameters of the fitness and similarity functions and less so by the dimension. Plots of the coefficients and survival and cumulative hazard curves are given in Figure A3. High survival is found for k large and low survival for k small. Cumulative hazard is highest for k small.
We estimate HDMR of the classifier (whether or not hitting time is finite) using all 100 samples. The results are shown in Figure A5 and Table 4. The HDMR explains roughly 80% of variance. S k ≈ 0.69 and S n ≈ 0.06, so hitting probability is strongly influenced by fitness landscape curvature k and less so by sequence length n. The component functions f k and f n for fitness landscape curvature and sequence dimension are strictly decreasing, where larger fitness landscape curvature parameter k results in decreasing hitting probability. These results are consistent with the survival analysis.  Figure A4 and Table 5. The HDMR explains roughly 60% of variance. S n ≈ 0.57 and S k ≈ 0.04, so sequence dimension dominates the hitting time.
Both HDMR component functions f n and f k for sequence dimension and fitness landscape curvature are increasing. The HDMR results reveal that conditioning on hitting reverses the roles of sequence dimension n and fitness landscape curvature k. We expand the model to include clay and decay. We take parameter vector with double-strand dissociation and formation and clay oligomerization reaction rates k ss = k ds = k clay−o = 1. For each parameter vector θ ∈ Θ, (i) we choose singleton high-fidelity replicator manifold R = {x} for some RNA sequence x ∈ E and choose random initial population of RNA molecules X 0 such that the initial population size is 10, I = |X 0 | = 10, and where the initial population does not intersect the high-fidelity manifold X 0 ∩ R = ∅, that is, the initial population does not reside on the high-fidelity manifold; (ii) we initialize the replicative mass of the system such that the initial overall RNA polymerization reaction rate is given byk rep (0) = (1 − f clay )20 and the initial overall clay RNA polymerization reaction ratek clay−p (0) = f clay 20; (iii) we sample the hitting times τ v (θ) for volume fraction of high-fidelity replicators v = 0.10 a total of M = 10 times, each censored by 5000 reactions, giving hitting time set T (θ) of (20). We attain input-output data set as D = {(θ i , T (θ i )) : i = 1, . . . , 240}. This gives a total of 2400 simulations. For the simulations, 1546 hitting times are finite. The results of fitting the Weibull-Cox model are shown below in Table 6 and Figure A6. The curvature parameter k again significantly dominates with a large positive value. All the remaining parameters have values less than one. Sequence dimension n again is a relatively small positive contributor.
The clay fraction f clay is small and positive and replication fidelity parameter p is not significant. We estimate HDMR of the classifier (whether or not hitting time is finite) using all 2400 samples. Component functions and sensitivity indices are shown below in Figure A7 and Table 7. First-order HDMR captures 74% of explained variance, and second-order captures 4%. Curvature dominates hitting probability with large sensitivity index S k ≈ 67%. The HDMR component function in landscape curvature f k is a decreasing function, where small values increase and large values decrease hitting probability. Sequence dimension n has sensitivity index S n ≈ 2%, and the HDMR component function f n is decreasing, where high dimension decreases the probability of hitting. Clay parameter sensitivity index is small S f clay ≈ 2%, and the HDMR component function for fractional clay RNA polymerization rate, f f clay , is decreasing, where low-to-medium clay fractions increase and high-clay fractions decrease probability of hitting. The HDMR results are consistent with the Weibull-Cox model. We estimate HDMR of the regressor (hitting time) using the 1546 simulations with finite hitting time. Component functions and sensitivity indices are shown below in Figure A8 and Table 8. First-order HDMR captures 33% of explained variance, and secondorder captures 7%. In stark contrast to the contributions to the classifier, the parameters k and n are insignificant to hitting time. Instead, the largest sensitivity index is S f clay ≈ 20%. The HDMR component function for fractional clay RNA polymerization rate, f f clay , is an increasing function, where small f clay decreases and large f clay increases the hitting time. This suggests that high clay-fractions representing first-order reactions increase the hitting time, as clay polymerization has less replicative mass than RNA polymerization, i.e., things go faster with RNA polymerization. The second largest sensitivity index is decay S k ∅ ≈ 11%. Decay is an increasing function, with sharp increase in hitting times nearby one, i.e., things go slower with large decay resulting in increased hitting time.

Compartmentalization
Compartmentalization has a direct effect on the calculation of the reaction rates, specifically replication, by computing only a subset of the reactions in X 2 1 . Put For vesicle region A ∈ M, we have that and total replicative mass As M increases in size over time, the number of partitions grows, and Therefore, the replicative mass will be reduced with M, and the system evolves less quickly. This suggests that there is a trade-off between the degree of compartmentalization and the replicative mass of the system.

Discussion and Conclusions
Origins of life is a fascinating problem. The wonderful complexity of extant life follows from origins. The distribution of life in the universe is tied to origins.
In this article, we have attempted to peek into the problem by concentrating on the RNA world hypothesis, studying hitting times of high-fidelity replicators. We develop fitness, similarity, and fidelity functions as landscapes for a mathematical model of replicating RNA molecules at the sequence level and observe hitting times through simulation studies. We draw attention to the distinction between the probability of hitting P(τ(θ) < ∞) and the hitting time τ(θ) < ∞.
In terms of mathematical set-up, we interpret the reactions as measure-kernel-functions. Each reaction is identified to and fully encoded by a probability transition kernel. The reactions take place in some domain, whereby all molecules may interact. We note that, in reality, molecules have limited diffusion, and this effectively breaks the reaction domain into independent subdomains above some length scale, i.e., molecules are more likely to react with their neighbors. Therefore, we assume our reaction volume is sufficiently small such that all molecules may participate in the reactions. We use for modeling purposes the ansatz that sequence distance is correlated to spatial proximity, where similar sequences are proximal, using a non-trivial similarity function s : E × E → (0, 1]. Theorem 2 and its Corollary 2 show that the system (without decay) is unbounded and strictly increases. This formally shows the system to be a growth process. Next, we illustrate findings about the core system with probable hitting (Section 3.2.1). In particular, we see that the temporal image measure µ t = ν t Q, which describes the polymerization output, is a leading indicator of high-fidelity sequence concentration. Polymerization output and high-fidelity replicators super-exponentially increase near the end of the simulation. Next, the fitness and sequence curvature parameters are set at a higher value which confers reduced fidelity (Section 3.2.2). This reveals that hitting is never achieved and that the polymerization output is in equilibrium with population composition. Hence, the probability of hitting is strongly influenced by landscape curvatures. Next, linear curvature is utilized for fitness and similarity and results in no hitting (Section 3.2.3). This reveals that nonlinear curvature is necessary to achieve hitting of high-fidelity replicators. Next, we expand the model with non-RNA ('clay') based polymerization and find that such activity decreases over time, in contrast to RNA polymerization, which greatly increases and dominates other reactions over time the system (Section 3.2.4).
For functional and survival analysis of the hitting times, we study the core model, whereby hitting times are strongly positively influenced by the fitness and similar functions yet are not impacted significantly by sequence dimension (Section 3.3.1). In particular, survival analysis reveals low fitness curvature confers low survival (high hitting), whereas high fitness curvature confers high survival (low hitting). HDMR analysis shows that hitting probability is strongly influenced by fitness curvature and much less so by sequence dimension, supporting the survival analysis. HDMR analysis of hitting time shows reversed roles for sequence dimension and landscape curvature, where sequence dimension dominates hitting time, with curvature playing a far less significant role. This gives the finding that hitting probability is driven by curvature, whereas hitting time is driven by sequence dimension. Next, we perform functional and survival analysis of the core model augmented with 'clay and decay' dynamics (Section 3.3.2). Survival analysis shows similar results to the core model, where curvature dominates survival (no hitting), with sequence dimension playing a significantly reduced role. HDMR analysis of hitting probability shows that curvature dominates hitting probability, similar to the core model, whereas sequence dimension again plays a significantly reduced role. HDMR analysis of hitting time reveals that the presence of 'clay and decay' significantly increase hitting time, with curvature and sequence dimension playing insignificant roles. These results are consistent in that clay polymerization has less replicative mass than RNA polymerization, where RNA polymerization is a faster reaction.
Overall, we find that nonlinear landscapes are necessary for hitting: linear landscapes are insufficient. For nonlinear landscapes, we find that the probability of hitting is dominated by curvature and that hitting times are dominated by sequence dimension. These results suggest that the landscapes in nature are nonlinear with high curvature, and that the hitting time for high-fidelity replicators is an increasing function of sequence dimension. When clay and decay are added to the model, hitting probability is again dominated by curvature, and clay and decay are relatively insignificant. This reflects that clay and decay are low order reactions. They increase hitting times.
For replication and compartmentalization, we suggest that compartmentalization, while a necessary condition, slows overall system dynamics with increasing vesiculation rate. Essentially, as compartmentalization increases, there is a corresponding reduction in absolute replicative system mass, as certain reactions among elements are no longer possible, being physically sequestered. While the timescale of a simulation is tied to the replicative system mass of the system, there is variability in replicative mass across compartments. Some compartments contain large genomic and metabolic populations. It favors the search for the high-fidelity replicator by there being a distribution on compartmental 'fitness' such as resource concentrations so that the high-fitness compartments drive replicative system mass. Compartmentalization is identified to the measure ν on coordinates in (C, C), for which coordinates are "marked" by sequences through the transition probability kernel Q c , and followed by genomic adaptation via the transition probability kernel Q c .
Metabolism is thought to be identified to production of precursors to RNA synthesis, leading to replication identification, followed by genomic adaptation to metabolic state. Metabolism is thus defined through the transition kernels Q m and Q m .
The independence of the transition kernels can be scrutinized, and it is possible that general transition kernels on the full product spaces across location, genomic, and metabolic states are necessary to satisfactorily explain RNA origins, i.e., all three functions may have co-evolved. This notion is suggested in the hot springs hypothesis for origins, where compartmentalization is hypothesized to furnish necessary conditions to genomic and metabolic state [26][27][28]. In this telling, the base measurable space of interest is (F, F ) = (C × E × M, C ⊗ E ⊗ M) with measure ν. Then, identification of a high-fidelity replicator is described through the transition probability kernel Q cem from (F, F ) into (F, F ) in "marking" the base measurable space with state for genomic replication; finally, genomic adaptation is conveyed through the kernel Q cem from (F × F, F ⊗ F ) into (F, F ). Hence, RNA origins of life has law ν × Q cem × Q cem on the product space (F × F × F, F ⊗ F ⊗ F ), reflecting the steps of replicator identification and adaptation through the definitions transition kernels Q cem and Q cem .
A putative "genesis machine" here is a mapping from the base (initial) measurable space (F, F ) into the product space of identified high-fidelity replicators undergoing adaptation, i.e., (F × F × F, F ⊗ F ⊗ F ). More generally the base space could additionally contain amino acid sequence space (P, P ). Such a machine is fully specified through the definitions of the distribution ν on the base space and the transition kernels Q cemp and Q cemp (pre and post genes, respectively). Because the stages of transition occur purely through random drift, an experiment performed by such a machine would take an unacceptably long period of time to complete. Experimental demonstration can be contemplated by augmenting the base measurable space with a control space (X, X ) to accelerate dynamics, using for instance closed-loop shaped radiation to address molecular degrees of freedom in their appropriate frequency domains, (open-loop) catalysts, temperature, geometry, selection, concentration through centrifugal force, etc., resulting in the new (four-dimensional) base space ( F, F ) = (C × E × M × P × X, C ⊗ E ⊗ M ⊗ P ⊗ X ). Then, the transition kernels Q cemp and Q cemp become mappings from ( F, F ) into ( F, F ) and from ( F × F, F × F) into ( F, F ), respectively. The general design of a genesis machine is the definitions of the augmented base space ( F, F ), its distribution ν, and the augmented transition kernels Q cemp and Q cemp , giving law µ = ν × Q cemp × Q cemp on the full 15-dimensional product space ( F × F × F, F ⊗ F ⊗ F ), written in differential notation Origins could be experimentally demonstrated using a sequence of adaptive control fields in (X, X ), cycling through the transitions, and a detection system for online identification of the system whose elements belong to the product measurable space. The full design and estimated operating timescale for such a machine needs further research to assess practical feasibility. We call the creation of primordial life (primordia) by the continuous causal efforts of a genesis machine given initial prebiotic conditions artebiogenesis, where arteis Latin and means "from skill." The primordia are not necessarily those that occurred in nature. Primordia and their genesis represent non-trivial system trajectories across the transition to the earliest life in sterile environments and belong to a manifold of primordial lifeforms, each having characteristic geochemical setting.
The probability measure µ t = ν t Q has additional utility to integrate 'test' functions or queries about the system. If we let f ∈ E ≥0 be a fitness function, then the fitness value J(µ t ) = µ t ( f ) is the expected value of the fitness function with respect to the probability measure µ t . In OptiEvo theory, J(µ t ) is studied as a function of the population X t on (E, E ) [52]. OptiEvo assumes that the set of all probability measures {µ t } is convex and that X t has sufficient flexibility such that J(µ t ) may be explored around µ t . Then, OptiEvo predicts that J(µ t ) has global maxima on (E, E ) and that these form a connected level-set of sequences with the same fitness value. Both predictions are consistent with our model. The first prediction is consistent with zero distance in fitness and similarity functions for high-fidelity sequences. The second prediction is consistent with the high-fidelity set being a singleton or a product-space construction. A contention is whether X t has sufficient flexibility in exploring J(µ t ) around µ t . This has direct bearing on the structure of Q: if X t is inflexible, then Q is constrained to certain subspaces of (E, E ), i.e., not all transitions are possible.
In future work, the model could be extended to the space of sequences of lengths up to n, (E * , E * ) or even the space of sequences of all lengths, where distance and similarity functions would utilize a more general string distance metric, e.g., Levenshtein distance. We note that the size of (E * , E * ) is not much larger than (E, E ). Alternative similarity functions could be explored, such as the trivial case of constant similarity, e.g., s(x, y) = 1 for (x, y) ∈ E × E. A limitation of this article is the restriction to short sequences due to computational efficiency. The numerical size of the sequences is mathematically low-dimensional and does not correspond to actual functions of RNA molecules. Other parameter sets can be explored for example using experimentally derived values for reaction rates, so that the timescales are calibrated. Future research could see the simulation software rewritten for a high-performance computing environment, enabling much longer, e.g., length 10-1,000, sequences to be studied. Polymerase fitness can be made empirical using known RdRp sequences as members of the high-fidelity manifold. Another area of future work could be studying the aforementioned transition kernels Q c , Q c , Q m , and Q m . More general models for polymerization transition kernel based on the structure of the Poisson-binomial distribution could be employed. It would be interesting to study lipid-RNA and metabolism-RNA interactions and equip the system with the ability to append nucleotides to their sequences to form functional genes, such as storing useful information for the replication channel, perhaps a Hammerhead ribozyme to convey rolling-circle amplification. We note that transition kernels here generally lack amino acid state and are pure-RNA. An area to explore is the notion of the transition kernel into the space of highfidelity replicators to depend on amino acid sequences and then to elaborate the system to contain a primitive translation system and examine various hitting times. Additional reactions can be introduced as operations on pairs of sequences, such as concatenation, and others for sequence splitting, and so on, with corresponding transition kernels enabling RNA networking and recombination dynamics.
Author Contributions: All authors contributed to the study conception and design. Material preparation, prepared software, data generation and analysis were performed by C.D.B. The first draft of the manuscript was written by C.D.B. and all authors commented on previous versions of the manuscript. Grant funding was provided through H.R. All authors have read and agreed to the published version of the manuscript. Acknowledgments: The authors thank the late Freeman Dyson for the discussions in 2018 of these ideas at the Institute for Advanced Study in Princeton, New Jersey, as well as anonymous reviewers who gave critical comments that substantially improved the quality of this article.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.
Here, the spaces are discrete, i.e., where ν t {x} is the probability mass at the point x ∈ X at time t ≥ 0.

Appendix C.1. Reactions as Measure-Kernel-Functions
We index the reaction types on (Z, Z ) = (N >0 , 2 N >0 ). Let η t be the probability measure on (Z, Z ) formed from the normalized reaction rates. Let Q * be the transition kernel from (Z, Z ) into (X, X ). Then, η t Q * = ν t is the distribution on (X, X ) and µ t = ν t Q is the distribution on (Y, Y ). Then, a reaction is the mapping (Z, Z ) → (X, X ) → (Y, Y ).

Reaction Cardinality
In this section, we study, instead of the hitting time τ v (θ), the hitting reaction number v (θ) for v = 0.1. We study the core model with parameter vector θ = (n, k). We uniformly sample sequence length n ∈ {3, 4, 5} and fitness/similarity landscape curvature k = l ∈ {0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1} and form input-output data D = {(θ i , (θ i )) : i = 1, . . . , 1200}. We set double-strand dissociation and formation rates k ss = k ds = 1 and initialize the overall RNA polymerization rate k rep such that the initial replicative mass is 10. For each parameter vector θ ∈ Θ, we randomly choose the initial population of sequences X 0 with initial population size I = |X 0 | = 10 and singleton high-fidelity sequence manifold R such that the initial population of sequences does not intersect the high-fidelity replicator manifold, X 0 ∩ R = ∅.
D contains 781 hitting events. We denote these D * . We form a first-order HDMR on (θ) using D * . The HDMR truth-plot, component functions, and sensitivity indices are shown below in Figure A9. First-order HDMR captures approximately 31% of variance.
Both component functions f n and f k have similar sizes, with f n somewhat larger than f k . The HDMR component function for sequence dimension f n is essentially an increasing linear function of sequence length n, and component function for landscape curvature f k is generally an increasing function of the fitness/similarity landscape curvature. These results suggest that sequence dimension and curvature influence the hitting reaction. Larger sequences and flatter curvature increase the hitting reaction.

Appendix E. Approximate Reaction Rates
One approach to reducing the computational complexity ofk(t) is to approximate the sums using Monte Carlo. Define random variables X 1 ∼ Uniform(X 1 ) and X 2 ∼ Uniform(X 2 ). Let {X 1i } and {X 2i } be independencies of such random variables. Given N random samples of X 1 , the first reaction rate becomes whose expected value is approximated using M realizations, requiring a total of MN evaluations. In a similar manner, the second reaction rate is Ek ss (t|M, N) and, putting X * 1 ∼ Uniform(X 1 ), we have the third reaction Ek rep (t|M, N) We refer to SSA simulation with Monte Carlo approximate reaction rates as Monte Carlo Approximate SSA, or MCASSA.