Hitting Times of Some Critical Events in RNA Origins of Life

Bastian, Caleb Deen; Rabitz, Hershel

doi:10.3390/life11121419

Open AccessArticle

Hitting Times of Some Critical Events in RNA Origins of Life

by

Caleb Deen Bastian

^1,*,†

and

Hershel Rabitz

^1,2,†

¹

Program in Applied and Computational Mathematics, Princeton University, Princeton, NJ 08544, USA

²

Department of Chemistry, Princeton University, Princeton, NJ 08544, USA

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Life 2021, 11(12), 1419; https://doi.org/10.3390/life11121419

Submission received: 1 November 2021 / Revised: 9 December 2021 / Accepted: 11 December 2021 / Published: 17 December 2021

(This article belongs to the Collection Experimentally Testing Origin of Life Hypotheses in the Laboratory, at Field Analogs and Computationally)

Download

Browse Figures

Versions Notes

Abstract

:

Can a replicase be found in the vast sequence space by random drift? We partially answer this question through a proof-of-concept study of the times of occurrence (hitting times) of some critical events in the origins of life for low-dimensional RNA sequences using a mathematical model and stochastic simulation studies from Python software. We parameterize fitness and similarity landscapes for polymerases and study a replicating population of sequences (randomly) participating in template-directed polymerization. Under the ansatz of localization where sequence proximity correlates with spatial proximity of sequences, we find that, for a replicating population of sequences, the hitting and establishment of a high-fidelity replicator depends critically on the polymerase fitness and sequence (spatial) similarity landscapes and on sequence dimension. Probability of hitting is dominated by landscape curvature, whereas hitting time is dominated by sequence dimension. Surface chemistries, compartmentalization, and decay increase hitting times. Compartmentalization by vesicles reveals a trade-off between vesicle formation rate and replicative mass, suggesting that compartmentalization is necessary to ensure sufficient concentration of precursors. Metabolism is thought to be necessary to replication by supplying precursors of nucleobase synthesis. We suggest that the dynamics of the search for a high-fidelity replicase evolved mostly during the final period and, upon hitting, would have been followed by genomic adaptation of genes and to compartmentalization and metabolism, effecting degree-of-freedom gains of replication channel control over domain and state to ensure the fidelity and safe operations of the primordial genetic communication system of life.

Keywords:

RNA world; stochastic simulation algorithm; random counting measure; measure-kernel-function; ordinary differential equation; high dimensional model representation; global sensitivity analysis; fitness and similarity sequence landscapes; hitting times; survival analysis

1. Introduction

The origins of life, abiogenesis, is a matter of high importance, for it gives insight into the distribution of life in the universe. We focus on the RNA world hypothesis, where life began with self-replicating RNA molecules that can evolve under Darwinian evolution, following necessary conditions of compartmentalization and metabolism, for geometry and synthesis of nucleobases from metabolic precursors, respectively. Self-replicating sets of RNA were proposed first by Tibor Ganti [1,2] and have been studied by many others [3,4,5,6]. This is an information-centric perspective on abiogenesis, representing the putative beginning of genomic Darwinian evolution. Information centrism interprets a living organism as an operating genetic communication system in some connected domain that encodes and decodes genomic state relative to a replication channel.

While genomics, epigenomics, and transcriptomics of modern-day organisms are based on DNA, RNA, and epigenetic marks such as DNA methylation, RNA origins in their purest form concern the dual-function of RNA as an informational polymer and ribozyme. This article is similar in spirit to works in the 1970s through the 1990s, including Manfred Eigen’s works on replicating sets of RNA [7,8]. Clues to the RNA world, among others, are found in the nucleotide moieties in acetyl coenzyme A and vitamin B12, the structure of the ribosome as a ribozyme [9] and moreover the centrality of RNA to the translation system, and the existence of viroids [10]. A putative canonical RNA origins sequence involves RNA dependent RNA polymerase (RdRp) ribozyme, whose first gene is perhaps the Hammerhead (HH) ribozyme, enabling rolling-circle amplification of the sequence [11]. This setup is unnecessary, as the first role of RNA may not have been template-assisted polymerization but instead based on mechanisms of RNA recombination and networking [12]. If we assume the RdRp sequence is size 200 nucleotides, then there are

4^{200} \approx 10^{120}

sequences. Starting from some population of interacting RNA molecules, we are interested in the times of first occurrence of critical events. Evolution seems to have concluded this search for a high-fidelity replicator in a fairly short period of time, i.e., within 400 million years of the Earth having a stable hydrosphere [5]. RdRp’s are known to be very ancient enzymes, are necessary to all viruses with RNA genomes, and have been proposed to have originated from junctions of proto-tRNAs relative to the context of a primitive translation system [13]. Recent work has shown that replicative RNA and DNA polymerases have a common ancestor of a RdRp [14]. Directed evolution, selecting on polymerization, is a potential way of identifying such a ribozyme. Directed compartmentalized self-replicating systems (RNA or DNA polymerases) mimicking prebiotic evolution have been demonstrated whereby polymerases are selected on their ability to replicate their own encoding gene [15]. Directed evolution has identified a RdRp that can replicate its evolutionary ancestor, an RNA ligase ribozyme; however, at increased activity, it has reduced fidelity and cannot maintain the integrity of its information [16].

A subtlety to the RNA origins argument is that template-directed polymerization by its nature requires two copies of the RdRp sequence, one for the polymerase and another for the template. This makes the RNA origins search extend until two copies are discovered. Cross reactions with other species also may influence the search time for the high-fidelity replicators. The clay mineral montmorillonite, which is common on Earth, can catalyze RNA oligomerization [17]; however, the utility of montmorillonite in these activities is not thought to be sufficient for origins, having been extensively studied [18,19]. Interestingly, it has been proposed that clay not only promotes origins, but constitutes it, which then later gave rise to RNA-based replication [20]; such mineral life as a genetic communication system has a high mutation rate and is degenerate.

Theories of abiogenesis study metabolism [21], cellular compartmentalization [22,23,24], hydrothermal vent chemical gradient energy [25], or hot springs [26,27,28], and so on, to define geochemical settings suitable for origins. These settings are compatible with RNA origins. Compartmentalization leading to selection on random sequences has been explored by studying environment forcing in hot springs and their effects on sequence identification [26,27,28], where the hypothesis is that fluctuations in environment forcing through cycling of wet, dry, and moist phases of lipid-encapsulated sequences subject sequences to combinatorial selection and identify structural and catalytic functions from the initial system state of random sequences. These functions include metabolic activity, pore formation, and structural stabilization. We assume the prebiotic molecular inventories of RNA and its precursors are provided by meteorites [29] and/or by Miller–Urey processes [30], such as from formamide [31] or many phase synthesis [32,33]. For energy and environmental factors, we consider a variant of Darwin’s “warm little pond”, where the putative environment for RNA origins of life is an icy pond with geothermal activity, a hot spring, or perhaps a hydrothermal vent: ice and cold temperature facilitate complexing of single strands into double strands and polymerization [34], and heat (energy) facilitates dissociation of double strands into single strands. More information on abiotic sources of organic compounds, mechanisms of synthesis and function of macromolecules, energy sources, and environmental factors can be found in the literature [35].

RNA origins have attracted many modeling efforts and analyses [36,37]. The concept of a self-replicating set of RNA molecules was initially studied by Manfred Eigen [7], wherein he studied the error-threshold of the critical fidelity to the main information. Two-dimensional spatial modeling has been applied in which reactions occur locally with finite diffusion, suggesting a spatially localized stochastic transition [38], simulated using Gillespie’s stochastic simulation algorithm (SSA) [39]. Another model is of autocatalytic sets of collectively reproducing molecules, which has been developed in reflexively autocatalytic food-generated (RAF) theory [40]. Various physics-based analyses have been conducted, such as in light of Bayesian probability, thermodynamics, and critical phenomena [41]. Systems of quasi-species based on the principle of natural self-organization called hypercycles employ non-equilibrium auto-catalytic reactions [8]. Nonlinear kinetic models for polymerization have been used to study the emergence of self-sustaining sets of RNA molecules from monomeric nucleotides [42]. Theoretical analysis has been conducted into RNA origins. Attention has been drawn to an evolving population of dynamical systems and how dynamics affect the error threshold of early replicators and possibly towards compartmentalization conveying hypercycles [43]. String-replicator dynamics have been studied and properties suggested to be necessary to RNA origins, including the ability to operate a functional genetic communication system and ecological and evolutionary stability [44,45,46]. A variety of pre-RNA worlds have been suggested, with RNA being preceded or augmented by alternative informational polymers, such as other nucleic acids [47], beta amyloid [48], polycyclic aromatic hydrocarbons [49], lipids [24], peptides [50], and so on. It seems that pre-RNA worlds existed independent of the RNA world in the sense that they are not ancestral to the RNA world, and that these worlds may have had non-trivial interactions with the RNA world.

A key concept of stochastic systems is that of a hitting time: the time of the first occurrence of some event. We develop a simple mathematical model at the sequence level to represent the synthesis and function of RNA molecules in order to gain insight into the hitting times of various critical events of RNA origins of life. The idea is to study the surface of hitting times in terms of the structure of the system. The model lacks many features of realism, such as sequence size variability, finite sources of “food” (activated nucleotides in our context), limited diffusion rates, poor system mixing, and so on, in order to concentrate on the process as a search problem. The notion of fitness landscapes has been studied extensively in evolutionary biology [51]. Landscape topology has been considered in an Opti-Evo theory, which assumes sufficient environmental resources and argues that fitness landscapes do not contain “traps” and globally optimal sequences form a connected level-set [52].

We describe the model in Section 2, where we define a replicating reaction network, whose random realizations are constructed using SSA. We describe hitting times as key random variables of interest and characterize polymerization as a transition kernel. In Section 3, we conduct and discuss simulation studies based on SSA, where we analyze the structure of the polymerase measures and the input–output and survival behaviors of the hitting times given the parameters of the system. In Section 4, we end with conclusions.

2. Materials and Methods

We define

M = {Adenine, Uracil, Guanine, Cytosine}

, abbreviated as

M = {A, U, G, C}

, corresponding to the RNA bases. We study the space of sequences having length n, that is,

E = M^{n}

with space of possibilities (a

σ

-algebra)

E = 2^{E}

, so that the pair

(E, E)

is the measurable space of all RNA sequences of length n with

| E | = 4^{n}

. We let

ν

be a probability measure (distribution) on

(E, E)

, giving the probability space triple

(E, E, ν)

. Appendix A gives an overview of

(E, E, ν)

. A related space, though not utilized in this article, is the space of all RNA sequences up to length n,

E^{*} = \cup_{i = 1}^{n} H^{i}

with

E^{*} = 2^{E^{*}}

and size

| E^{*} | = \frac{4}{3} (| E | - 1)

. We denote the collection of non-negative

E

-measurable functions by

E_{\geq 0}

.

We build a simplified mathematical model for the time-evolution of a population of interacting RNA molecules in solution. Let

X_{t}

be the population at time

t \in R_{\geq 0}

with initial population

X_{0}

.

X_{t}

is a multiset, that is, it is a set containing elements possibly with repeats. We assume that the system is well mixed and has access to an infinite source of activated nucleotides.

The complement of

x \in E

is denoted

x^{c} \in E

, attained using the base-pairing A with U and C with G. Let

h (x, y) \in {0, 1, \dots, n} for x, y \in E

(1)

be the Hamming distance between

x, y \in E

as the number of positions in x and y where the nucleotides differ. We have

h (x, x) = 0

and

h (x, x^{c}) = n

.

2.1. Core Model

We describe the reaction network of the system below. We simulate trajectories of the system using the stochastic simulation algorithm (SSA) [39], to simulate exact trajectories for the evolution of stochastic reaction networks here. SSA forms a Markovian process, where the arrival of reactions follows a Poisson (point) process, and assumes that the reaction volume is well mixed and homogeneous, with all parts of the system accessible for reactions. Reactions across various disjoint volume elements of the system are dependent. We do not consider the effects of finite diffusion, which effects a length scale above which disjoint volume elements are effectively independent [38].

2.1.1. System

We model the population of sequences which can form double-stranded helices, dissociate, and replicate with mutation with replicator fitness and sequence specificity. We interpret each system element as a set, either containing one element—a single-stranded sequence—indicated as

{x}

—or two elements, single and complementary stranded sequences, indicated as

{x} \cup {x^{c}} = {x, x^{c}}

, where

x, x^{c} \in E

(double bracket notation indicates a collection, or set, of sets, i.e.,

{{x}, {y}, {z}, \dots}

). We define system elements as sets

\begin{matrix} \bar{E} & \equiv {{x} : x \in E} \\ \bar{F} & \equiv {{x, x^{c}} : x \in E} \\ \bar{G} & \equiv \bar{E} \cup \bar{F} \end{matrix}

with respective

σ

-algebras,

\bar{E} = 2^{\bar{E}}

,

\bar{F} = 2^{\bar{F}}

and

\bar{G} = 2^{\bar{G}}

. The sizes are

| \bar{E} | = 4^{n}

and

| \bar{F} | = 4^{n} / 2

, and

| \bar{G} | = 3 \times 4^{n} / 2

. The set

\bar{E}

contains single-stranded sequences, whereas the set

\bar{F}

contains double-stranded sequences; the set

\bar{G}

is the union of

\bar{E}

and

\bar{F}

, containing both single and double stranded sequences. These sets are necessary to track the various sequences (single and double stranded). The reaction network of the system is given by

\begin{matrix} x + x^{c} & \overset{k_{d s}}{\to} x \cup x^{c} \end{matrix}

(2)

\begin{matrix} x \cup x^{c} & \overset{k_{s s}}{\to} x + x^{c} \end{matrix}

(3)

\begin{matrix} x + y & \overset{k_{r e p} (x, y)}{\to} x + y + y_{*}^{c} . \end{matrix}

(4)

and expressed in terms of sets

\begin{matrix} {x}, {x^{c}} & \overset{k_{d s}}{\to} {x, x^{c}} \\ {x, x^{c}} & \overset{k_{s s}}{\to} {x}, {x^{c}} \\ {x}, {y} & \overset{k_{r e p} (x, y)}{\to} {x}, {y}, {y_{*}^{c}} . \end{matrix}

Reaction (2) is double-strand formation from complementary single-strands with reaction rate

k_{d s}

. Reaction (3) is the dissociation of double-strands into single-strands, caused by a heat source, with reaction rate

k_{s s}

. Reaction (4) is template-directed polymerization of a single-strand (the template) by another single-strand (the polymerase), producing a single-strand complementary to the template with some fidelity, with reaction rate

k_{r e p}

(which functionally depends on the polymerase and template sequences).

The polymerization reaction rate is defined by

k_{r e p} (x, y) = a f (x) s (x, y) \in (0, a] for x, y \in E

(5)

where

a > 0

is a positive constant,

f : E \mapsto (0, 1]

is the replicative fitness of x (as a polymerase) and

s : E \times E \mapsto (0, 1]

is the similarity between x and y, a symmetric function. Finally, x replicates y, outputting a version

y_{*}^{c}

with mutations, where each nucleotide position has fidelity probability

p : E \mapsto (0, 1]

. Note that the similarity function can be either trivial/constant, i.e.,

s (x, y) = 1

, or non-trivial. For example, if we assume sequence similarity to correlate to spatial proximity of sequences, as assumed below, then

s (x, y)

is non-trivial.

2.1.2. High-Fidelity Set

To define a high-fidelity set, pick an arbitrary subset of sequences

R \subset E

as high-fidelity replicators of size

r = | R |

. We define

R

two ways.

We define

R

using a product of non-empty random nucleotide subsets

{A_{i} \subseteq M : i = 1, \dots, n}

for each nucleotide position

R = A_{1} \times \dots \times A_{n}

so that

r = \prod_{i = 1}^{n} | A_{i} | = 1^{r_{1}} 2^{r_{2}} 3^{r_{3}} 4^{r_{4}}

where

r_{2} = | {A_{i} : | A_{i} | = 2} |

, etc. and

r_{1} + r_{2} + r_{3} + r_{4} = n

. For simplicity, we assume

| A_{i} | \in {1, 4}

with fraction 4 being

q \in (0, 1)

. Thus,

R

is a subset of E defined as a product space.

For another construction of R, we define a finite union of m random sequences

R = {x_{1}, \dots, x_{m}}

.

2.1.3. Distance

We define the Hamming distance

H

between sequence x and high-fidelity sequence set

R

as

H (x, R) = min {h (x, y) : y \in R} \in {0, 1, \dots, n} for x \in E .

(6)

2.1.4. Fitness

We define “tent-pole” fitness

f_{k}

of sequence

x \in E

and high-fidelity sequence set

R

for curvature parameter

k \in R_{\geq 0}

as

f_{k} (x, R) = exp [- k H (x, R)] \in (0, 1] for x \in E .

(7)

The maximums are the sequences of the high-fidelity sequence set

R

, which are the “points” or “poles” of the surface, with exponential decay into the remainder of the space in string distance. The strength of the decay is governed by parameter k, called the curvature parameter, which can be specified through the value of fitness at

H (x, R) = n

(sequence dimension),

k = - \frac{log (f_{k} (x, R))}{n}

(8)

Appendix B describes other fitness functions.

2.1.5. Similarity

We define two cases for the similarity function appearing in the reaction rate of template-directed polymerization. The first case is the trivial (constant) case where the

s_{b} (x, y) = b \in (0, 1] for (x, y) \in E \times E .

This assumes that there is no mechanism by which sequence specificity is selected for, such as in the case that polymerases should evolve to generically well replicate sequences, including their own. This means that they will spend much of their time replicating other sequences.

For the second case of a non-trivial similarity function, we note that RNA origins of life are thought to be a spatially localized stochastic transition, where high-fidelity replicators are found concentrated in foci, following from the increased replicative mass of the replicators. Hence, we implicitly encode spatial information through a non-trivial similarity function, based on a distance function, which increases the replicative system mass for similar (and here nearby) high-fidelity replicators, that is, if they’re similar, then they’re likely proximal. In the following definition, we use the notation ∧ for the minimum of two numbers,

x \land y = min {x, y}

. Distance S is defined between two sequences

x, y \in E

as

S (x, y) = h (x, y) \land h (x, y^{c}) \land h (x^{c}, y) \land h (x^{c}, y^{c}) \in {0, 1, \dots, n} for x, y \in E .

(9)

Similarity

s_{k}

of sequences

x, y \in E

for curvature parameter

k \in R_{\geq 0}

is defined in terms of exponential decay as

s_{k} (x, y) = exp [- k S (x, y))] \in (0, 1] for x, y \in E .

(10)

Presently, replicators can replicate other sequences well but not their own [34]. There may exist RdRps that are excellent polymerases and, in conjunction with RNA hammerhead ribozyme, engage in rolling circle amplification of the polymerase-hammerhead sequence (genome) so that the amplification process is self-cleaving. This results in a large increase in replicative mass due to the super-exponential growth in the population of the high-fidelity replicators within a small volume. In the context of SSA, the similarity function here is an ansatz for spatial locality.

2.1.6. Fidelity

Finally, polymerization fidelity probability for curvature

k \in R_{\geq 0}

is defined as

p_{k} (x, R) = f_{k} (x, R) \in (0, 1] for x \in E .

(11)

Note that

f_{k} (x, R) = p_{k} (x, R) = 1

for high-fidelity sequences

x \in R

.

Note that fitness, similarity, and fidelity are defined for single-stranded sequences

(E, E)

.

2.1.7. Counting Representation

The process

X = {(X_{t})}_{t \in R_{\geq 0}}

is the time-evolution of the system. Recall that

X_{t}

is a multiset.

X_{t}

contains the individual single stranded molecules in the set

\bar{E}

, i.e., sequences

{x} \in \bar{E}

and double stranded molecules in the set

{x, x^{c}} \in \bar{F}

, with overall set

\bar{G} = \bar{E} \cup \bar{F}

having size

m = | \bar{G} | = 3 | \bar{E} | / 2

. Note that, in the set representation, there is symmetry

{x, x^{c}} = {x^{c}, x}

, so that the size of the double-stranded set is equal to

| \bar{F} | = | \bar{E} | / 2

. The system evolution

X_{t}

induces a random counting measure

N_{t}

on the overall space of single and double stranded sequences

(\bar{G}, \bar{G})

as

N_{t} (A) = \sum_{x \in X_{t}} I_{A} (x) for A \in \bar{G}

(12)

The total count, that is, the total number of molecules, is

K_{t} \equiv | X_{t} | = N_{t} (\bar{G})

. We assume that the counter

N_{t}

is maintained for all times

t \in R_{\geq 0}

.

2.1.8. Reaction Rates

The total reaction rate is given by the sum of the individual reaction rates

\overset{˘}{k} (t) = {\overset{˘}{k}}_{d s} (t) + {\overset{˘}{k}}_{s s} (t) + {\overset{˘}{k}}_{r e p} (t) for t \in R_{\geq 0}

where the reaction rate of double-strand formation from complementary single-strands is given by

{\overset{˘}{k}}_{d s} (t) = \frac{1}{2} \sum_{{x} \in \bar{E}} k_{d s} N_{t} ({x}) N_{t} ({x^{c}}),

the reaction rate of dissociation of double-strands into single-strands is given by

{\overset{˘}{k}}_{s s} (t) = \sum_{{x, x^{c}} \in \bar{F}} k_{s s} N_{t} ({x, x^{c}}),

and the reaction rate for template-directed polymerization of a single-strand by another single-strand (the polymerase) is given by

{\overset{˘}{k}}_{r e p} (t) = \sum_{({x}, {y}) \in {\bar{E}}^{2}} k_{r e p} (x, y) N_{t} ({x}) (N_{t} ({y}) - I_{} (x = y))

The first reaction rate is a sum over the single-strands of

\bar{E}

with size

4^{n}

. The second is a sum over double-strands

\bar{F}

with size

4^{n} / 2

. The third reaction is a sum over the product space of single stranded sequences,

{\bar{E}}^{2}

, with

4^{2 n} = 16^{n}

number of elements. Therefore,

\overset{˘}{k} (t)

requires

4^{n} (\frac{3}{2} + 4^{n})

elements to be evaluated for every reaction. Clearly, direct representation on the full space is very expensive and impractical for even modest n. One obvious way to improve efficiency is not summing over the zero elements. We define sets

X_{1} = {x \in set (X_{t}) : | x | = 1}

(13)

and

X_{2} = {x \in set (X_{t}) : | x | = 2}

(14)

as the unique single and double stranded sequences of the system. Then, direct calculations of the reaction rates are

{\overset{˘}{k}}_{d s} (t) = \frac{1}{2} \sum_{{x} \in X_{1}} k_{d s} N_{t} ({x}) N_{t} ({x^{c}})

and

{\overset{˘}{k}}_{s s} (t) = \sum_{{x, x^{c}} \in X_{2}} k_{s s} N_{t} ({x, x^{c}})

and

{\overset{˘}{k}}_{r e p} (t) = \sum_{({x}, {y}) \in X_{1}^{2}} k_{r e p} (x, y) N_{t} ({x}) (N_{t} ({y}) - I_{} (x = y)) .

For this approach, the replication rate has quadratic dependence on

| X_{1} |

. Using the reaction rates, the system may be exactly simulated using SSA. The reaction at time t with rate

\overset{˘}{k} (t)

occurs over time interval

▵ t \sim Exponential (1 / \overset{˘}{k} (t))

. As the reaction rate

\overset{˘}{k} (t)

increases with increasing number of molecules

K_{t} = N_{t} (\bar{G})

, the reaction rate increases and reaction duration

▵ t

decreases over time. The natural consequence of increasing process intensity is that the system speeds up.

The quadratic dependence may still be too expensive for large simulations. Appendix E describes Monte Carlo approximation of the reaction rates.

2.2. Hitting Times

We define some hitting times. The initial population consists of I single-stranded sequences

X_{0}

, i.e.,

| X_{0} | = I

. We define the hitting time

τ

for the time of the first replication event

τ_{rep} = inf \{t \in R_{\geq 0} : K_{t} > I\} .

(15)

We define the hitting time

τ

for the appearance of sequences in the high-fidelity sequence set

R

,

τ_{R} = inf {t \in R_{\geq 0} : I_{{1, 2, \dots}} (| R \cap X_{t} |) = 1} .

(16)

Put

X_{R} = R \cup {{x, x^{c}} : x \in R}

and define the volume fraction of high-fidelity sequences

R

at time t as

V (t) = \frac{N_{t} (X_{R})}{K_{t}} .

(17)

We define the hitting time

τ

where high-fidelity sequences of

R

emerge and reach a minimum volume fraction,

τ_{\min} = inf {t \in R_{\geq 0} : t \geq τ_{R}, V (t) \leq V (s) for τ_{R} \leq s \leq t} .

(18)

This hitting time reflects that period wherein a high-fidelity replicator has been identified yet there exists no complementary high-fidelity sequence for amplification, hence the system diversity continues to increase, decreasing the concentrations of all extant sequences as more sequences are discovered. The minimum hitting time captures the duration of time the high-fidelity replicator exists by itself. We define the hitting time

τ

for the time high-fidelity sequences in

R

constitutes some volume fraction

v \in (0, 1]

of the population,

τ_{v} = inf \{t \in R_{\geq 0} : V (t) \geq v\} .

(19)

In practice for simulations,

τ_{v}

is censored based on some total number of reactions, that is, if the volume fraction is not achieved by n reactions,

τ_{v} = \infty

because there is no arrival time.

For SSA, we specify a maximum number of reactions N to simulate. We have parameters

θ \in Θ

for

τ

, such as landscape curvature k, sequence dimension n, etc. Therefore,

τ (θ)

is right-censored with value ∞ at simulation time a, as some simulations will stop at time a with no arrival time. These are censoring events. For fixed

θ

, the

τ (θ)

is a random variable, due to the stochastic nature of SSA. Hence, for each parameter vector

θ

, we attain a set of M realizations of hitting time

τ

as

T (θ) = {τ_{i} (θ) : i = 1, \dots, M} .

(20)

For convenience, we assume that the realizations

T (θ)

are ordered by non-censored followed by censored.

2.2.1. Functional Structure

For each parameter vector

θ

, we record two values: the number of hitting events in the hitting time set

T

g (θ) = | {x \in T (θ) : x < \infty} | \in {0, 1, \dots, M}

(21)

and the average of the hitting time

τ

f (θ) = \{\begin{matrix} \frac{1}{g (θ)} \sum_{i = 1}^{g (θ)} τ_{i} (θ) & if g (θ) > 0 \\ 0 & if g (θ) = 0 \end{matrix} \in R_{\geq 0}

(22)

To describe the functional structure of the average hitting time

f (θ)

, we require a classifier which determines whether or not there are zero hittings

g (θ) = 0

and a regressor for the value of

f (θ)

for hittings

g (θ) > 0

. We assume that the parameters

θ = (θ_{1}, \dots, θ_{n})

are randomly sampled according to distribution

ν = \prod_{i} ν_{i}

and the hitting times recorded. High dimensional model representation (HDMR) may be attained for the classifier (as a probabilistic discriminative model) and the regressor of

f (θ)

. For the regressor, we have HDMR expansion

f (θ_{1}, \dots, θ_{n}) = f_{0} + \sum_{i} f_{i} (θ_{i}) + \sum_{i < j} f_{i j} (θ_{i}, θ_{j}) + \dots + f_{1 \dots n} (θ_{1}, \dots, θ_{1})

The HDMR component functions

{f_{u}}

convey a global sensitivity analysis, where, defining variance term

σ_{u}^{2} = V a r f_{u} = \int_{Θ} f_{u}^{2} (θ_{u}) ν (d θ),

we have a decomposition of variance

σ_{f}^{2} = V a r f = \sum_{u \subseteq {1, \dots, n} : | u | > 0} σ_{u}^{2}

The normalized terms

S_{u} = σ_{u}^{2} / σ_{f}^{2}

are called sensitivity indices. Appendix F gives a brief description of global sensitivity analysis via HDMR.

2.2.2. Statistical Structure

A second analysis can be conducted on the hitting times

T (θ)

for the parameter vector

θ

using reliability theory. Put random hitting set

T (Θ) \equiv {T (θ) : θ \in Θ}

where

Θ = {θ_{i}}

is an independency of parameter values. For each parameter vector

θ \in Θ

, we partition the hitting times

T (θ)

into C censored values with censor times

C (θ) = {a_{i} (θ)}

and

M - C

non-censored (hitting) values

N (θ) = {x \in T (θ) : x < \infty}

. The likelihood is given by

L (T (Θ) | ϑ) = \prod_{θ \in Θ} \prod_{x \in N (θ)} f (x | ϑ) \prod_{x \in C (θ)} R (x | ϑ),

where f is the hitting time probability density function (‘failure density’) and R is censoring time distribution (‘reliability distribution’), and

ϑ

are the parameters of the density and distribution functions. Note that f and R each specify each other, so

ϑ

are the common parameters. Reliability definitions are given in Appendix G.

We show reliability quantities in Table 1 for the two-parameter

(α, β) \in {(0, \infty)}^{2}

Weibull

(α, β)

distribution and Cox proportional hazard’s model where

γ

is a vector of coefficients for the

θ

. We use the Python software lifelines for estimation of

ϑ

for the Weibull–Cox model from data [53]. The mean failure time

E τ (θ)

is given by

E τ (θ) = \int_{0}^{\infty} t f (t, θ | α, β, γ) = β e^{- γ \cdot θ / α} Γ (1 + \frac{1}{α})

We have second moment

\int_{0}^{\infty} t^{2} f (t, θ | α, β, γ) = β^{2} e^{- 2 γ \cdot θ / α} Γ (1 + \frac{2}{α})

giving variance

V a r τ (θ) = β^{2} e^{- 2 γ \cdot θ / α} (Γ (1 + \frac{2}{α}) - Γ^{2} (1 + \frac{1}{α}))

Thus, if

γ < 0

, then

E τ (θ)

and

V a r τ (θ)

exponentially increase in

θ

.

2.3. Surface Chemistries

The system given by (2) of polymerization with mutation requires two separate hitting events, one sequence in the high-fidelity set

x \in R

and either another sequence in the high-fidelity set

x \in R

or its complement

x^{c} \in E

, in order for high-fidelity replicators to maximally engage in templated-directed polymerization and achieve some fraction of the population. This setup of RNA polymerase action, requiring two such events for the polymerase and template, makes the hitting times long. Basically, the same information must be discovered twice before it can be used, which is unsatisfactory. We idealize polymerase activity conveyed by a non-RNA species, here clay, with the parameter

k_{c l a y - p}

, as clay itself is not thought to be capable of polymerization but is capable of oligomerization of RNA. The reactions are given by

\begin{matrix} ⌀ & \overset{k_{c l a y - o}}{\to} x \end{matrix}

(23)

\begin{matrix} x & \overset{k_{c l a y - p}}{\to} x + x_{*}^{c} \end{matrix}

(24)

where non-RNA polymerization has mutation with fidelity probability

p \in (0, 1]

and

x_{*}^{c}

is the complement of x with mutation. The reaction rates are given by

{\overset{˘}{k}}_{c l a y - o} (t) = k_{c l a y - o}

and

{\overset{˘}{k}}_{c l a y - p} (t) = k_{c l a y - p} N_{t} (X_{1})

Therefore, upon the first hitting of the high-fidelity replicators

R

with sequence

x \in R

through (4) or (23), x gives two high-similarity single-stranded sequences x and

x_{*}^{c}

through (24), which then may participate in template-directed RNA polymerization (4).

2.4. Reactions as Measure-Kernel-Functions

All the reactions

x \mapsto y

which involve substrate x may be represented using transition kernels, which form linear operators. At each iteration of SSA, a reaction type is chosen, followed by a transition to a particular domain

(X, X)

with distribution

ν_{t}

, followed by mapping into a codomain

(Y, Y)

using transition probability kernel Q with distribution

μ_{t} = ν_{t} Q

. The notions of

ν_{t}

and Q involve measure-kernel-functions. The probability of transition of y into

B \in Y

given x is given by Q

Q (x, B) = \int_{B} Q (x, d y) = P (y \in B : x)

Appendix C recalls some facts about Q.

We define kernels Q for RNA and non-RNA polymerization to provide insight into the reactions. Consider

X_{t}

for some

t \in R_{\geq 0}

. Recall that

N_{t}

is the random counting measure of

X_{t}

on single and double-stranded sequences

(E \cup F, 2^{E \cup F})

. For RNA and non-RNA polymerization, we take

ν_{t}

as a (random) probability measure for x in domain

(X, X)

and describe a transition probability kernel Q from x into y in codomain

(Y, Y)

.

2.4.1. RNA Polymerization

For RNA polymerization, we have that

{x}, {y} \mapsto {x}, {y}, {y_{*}^{c}}

which results in the creation of the single-stranded sequence

{y_{*}^{c}}

. The first dimension is the polymerase and the second is the template. We give a definition of the probability measure on the product space of sequences. Recall that

{(E \otimes E)}_{\geq 0}

denotes the collection of non-negative

E \otimes E

-measurable functions.

Definition 1

(Measure on domain

ν_{t}

). Let

ν_{t}

be a random probability measure on

(E \times E, E \otimes E)

formed by random counting measure

N_{t}

(12)

ν_{t} {x, y} = \frac{k_{r e p} (x, y) N_{t} ({x}) (N_{t} ({y}) - I_{} (x = y))}{{\overset{˘}{k}}_{r e p} (t)} for (x, y) \in E \times E

with

ν_{t} (f) = \sum_{(x, y) \in E \times E} ν_{t} {x, y} f \circ (x, y) for f \in {(E \otimes E)}_{\geq 0}

(25)

We write

ν_{t} (A) = ν_{t} I_{A}

for

A \in E \otimes E

.

Because the first two coordinates are preserved under the mapping, we focus on the new dimension as a transition from

(E \times E, E \otimes E)

into

(E, E)

using transition probability kernel Q. In this case, Q is defined by a

16^{n} \times 4^{n}

matrix whose rows vectors (dimension

4^{n}

) are probability vectors. The structure of Q follows from the polymerase replication with mutation, whereby each nucleotide position has fidelity probability

p : E \mapsto (0, 1]

, which depends on the first dimension of

E \times E

. We put

p_{x} = p (x)

for sequence

x \in E

. Now, we state a simple fact on the binomial structure of the number of mutations made by a polymerase.

Theorem 1

(Mutation distribution). The number of mutations by polymerase

x \in E

on template

y \in E

is distributed

h (y^{c}, y_{*}^{c}) \sim Binomial (n, 1 - p_{x}) for p_{x} \in (0, 1)

with mean

n (1 - p_{x})

and variance

n p_{x} (1 - p_{x})

and

h (y^{c}, y_{*}^{c}) \sim Dirac (0) for p_{x} = 1 .

Now, we partition E into level sets

(H_{i} (y))

by Hamming distance to the template complement

y^{c}

,

H_{i} (y) = {x \in E : h (y^{c}, x) = i} for i \in {0, \dots, n} .

(26)

We define the transition kernel Q for RNA polymerization, where Q completely encodes RNA polymerization using Theorem 1.

Corollary 1

(Transition probability kernel Q). We have that the transition probability kernel Q for RNA polymerization is defined by

Q ((x, y), H_{i} (y)) = (\binom{n}{i}) {(1 - p_{x})}^{i} p_{x}^{n - i} for (x, y) \in E \times E, i \in {0, \dots, n}, p_{x} \in (0, 1)

and

\begin{matrix} Q ((x, y), {z}) & = \frac{1}{| H_{i} (y) |} (\binom{n}{i}) {(1 - p_{x})}^{i} p_{x}^{n - i} & for & (x, y) \in E \times E, i = {0, \dots, n} \\ z \in H_{i} (y), p_{x} \in (0, 1) \end{matrix}

and

Q ((x, y), {y^{c}}) = 1 for (x, y) \in E \times E, p_{x} = 1 .

RNA polymerization is defined by Q using the binomial structure of polymerase mutation. A more sophisticated model could be defined as a sum of Bernoulli random variables with varying success probabilities in the Poisson binomial distribution. This could be used to take into account polymerase mutation that varies with nucleotide position. Another idea is taking into account schemata such as repeats which destabilize the polymerase [54].

Proposition 1

(Measure on codomain

μ_{t}

).

μ_{t} = ν_{t} Q

is a probability measure on

(E, E)

defined by

μ_{t} (f) = \int_{E \times E} ν_{t} (d x, d y) \int_{E} Q ((x, y), d z) f (z) for f \in E_{\geq 0}

(27)

It is multiplication of

ν_{t}

as a

16^{n}

dimension row vector with

16^{n} \times 4^{n}

dimension matrix Q, giving a

4^{n}

dimension row vector

ν_{t} Q

. We write

μ_{t} (A) = μ_{t} I_{A}

for

A \in E

.

Define the partition

(H_{i})

of E as

H_{i} \equiv {x \in E : min {H (x, R), H (x^{c}, R)} = i} for i \in {0, \dots, n} .

(28)

Then,

μ_{t} (H_{i})

for

i \in {0, \dots, n}

is the distribution on sequences by distance to R, i.e.,

μ_{t} (H_{i}) = \sum_{x \in E} μ_{t} {x} I_{H_{i}} (x) for i \in {0, \dots, n}

contains the instantaneous information of RNA polymerization.

A more general model for replication is where polymerase activity is tied to geometry, i.e., compartmentalization/spatial confinement, with state space

(C, C)

and to metabolic state

(M, M)

. In this telling, the polymerase reaction rate could be tied to the degree of spatial confinement and the source of the activated nucleotides from metabolic precursors. Then, the polymerase state-space is

(C \times M \times E \times E, C \otimes M \otimes E \otimes E)

with law

ν_{t}

and the polymerase transition kernel

Q_{c m e}

is defined as the mapping from

(C \times M \times E \times E, C \otimes M \otimes E \otimes E)

into

(E, E)

. Thus, the law on the input–output space-state

(C \times M \times E \times E \times E, C \otimes M \otimes E \otimes E \otimes E)

is given by

μ_{t} = ν_{t} \times Q_{c e m}

, or in differential notation,

μ_{t} (d c, d m, d x, d y, d z) = ν_{t} (d c, d m, d x, d y) Q_{c e m} ((c, m, x, y), d z)

2.4.2. Non-RNA Polymerization

If there exists some kind of non-RNA polymerase activity, we have that the mapping

{x} \mapsto {x}, {x_{*}^{c}}

which we regard as a mapping from

(E, E)

into

(E, E)

. Let

ν_{t}

be a probability measure on

(E, E)

defined by

ν_{t} {x} = \frac{N_{t} ({x})}{N_{t} (E)} for x \in E

Similar to RNA polymerization, for fidelity probability

p \in (0, 1]

, we have Q as the

4^{n} \times 4^{n}

matrix defined by

Q (x, H_{i} (x)) = (\binom{n}{i}) {(1 - p)}^{i} p^{n - i} for x \in E, i \in {0, \dots, n}, p \in (0, 1)

and

Q (x, {z}) = \frac{1}{| H_{i} (x) |} (\binom{n}{i}) {(1 - p)}^{i} p^{n - i} for x \in E, i = {0, \dots, n}, z \in H_{i} (x), p \in (0, 1)

and

Q (x, {x^{c}}) = 1 for x \in E, p = 1 .

μ_{t} = ν_{t} Q

is a probability measure on

(E, E)

defined by

μ_{t} (f) = \int_{E} ν_{t} (d x) \int_{E} Q (x, d y) f (y) for f \in E_{\geq 0}

Note that, for SSA, Q is fixed over the simulation, whereas the probability measure

ν_{t}

depends on time. That is, the reactions are chosen according to the reaction rates, and the reactions each use respective Q. The

ν_{t}

is formed using a random counting measure, so

ν_{t}

is random. This approach generalizes in the obvious way to all the reactions.

2.5. Decay

The RNA sequences have finite lifetimes in reality. This comes from a variety of sources, including radiation, pH, intrinsic molecular stability, etc. We assume double-stranded RNA is stable, whereas single-stranded RNA is not. Therefore, we create a reaction for decay of single-stranded RNA into constitutive nucleotides

x \overset{k_{⌀}}{\to} ⌀

(29)

with reaction rate

{\overset{˘}{k}}_{⌀} (t) = k_{⌀} N_{t} (X_{1})

2.6. Compartmentalization

It is thought that compartmentalization plays a role in RNA origins of life, giving foci of reproducing sequences [23,55]. This is somewhat anticipated by the similarity function

s : E \times E \mapsto (0, 1]

, where sequences are more likely to copy similar sequences than less similar ones, due to an underlying spatial localization. Explicit spatial effects may be modeled by assuming each

x \in E

is marked with a position on a bounded subset of the real line

([- T, T], B_{[- T, T]}) \subset (R, B_{R})

. We think of this as a one-dimensional projection of the three-dimensional system. Additional species can be introduced, such as lipids, with reactions forming a vesicle M (vesiculation), which encloses some

A = [r, s] \subset [- T, T]

. We assume the lipids interact with the single stranded sequences in A to form vesicles as

x \overset{k_{m i c}}{\to} M (A)

(30)

with reaction rate

k_{m i c} (t) = k_{m i c} N_{t} (X_{1}) .

Hence, vesiculation is coupled to the population of sequences by design so that it evolves on roughly the same time-scale as sequence activities. Note that vesicles can enclose one another, i.e.,

M (A)

and

M (B)

where

A \subset B

or

B \subset A

, but cannot cross, i.e., for all vesicles at locations

A, \dots, B

we have that

A \cap B \in {A, B, ⌀}

. For example, suppose one vesicle

A = [0, 1]

encloses another two,

B = [\frac{1}{3}, \frac{1}{2}]

and

C = [\frac{2}{3}, \frac{3}{4}]

. Then,

A \ (B \cup C) = [0, \frac{1}{3}) \cup (\frac{1}{2}, \frac{2}{3}) \cup (\frac{3}{4}, 1]

. Although

A \ (B \cup C)

is disconnected in one dimension, the intervals are physically connected in three dimensions, where vesicles are spheres. We identify each vesicle to a union of disjoint intervals, disjoint across the vesicles.

We posit that compartmentalization precedes the hitting of a high-fidelity replicator through ensuring necessary concentration of RNA and a stable environment. Generally, we identify compartmentalization state to

(C, C)

with probability measure

ν

. Let

Q_{c}

be a transition probability kernel from

(C, C)

into

(E, E)

, encoding the transition from compartmentalization coordinates to RNA sequences. The product space

(C \times E, C \otimes E)

has law

μ = ν \times Q_{c}

. Upon hitting a high-fidelity replicator and achieving Darwinian evolution to acquire information, e.g., genes, the sequences are assumed to become adapted to compartmentalization coordinates

(C, C)

through the transition probability kernel

Q_{c}^{'}

from

(C \times E, C \otimes E)

into

(C, C)

, so that

μ = ν \times Q_{c} \times Q_{c}^{'}

is the law on the full space

(C \times E \times C, C \otimes E \otimes C)

. In this telling, compartmentalization precedes RNA activity, and, upon hitting high-fidelity replicators that can maintain their information, is followed by genomic adaptation.

2.7. Metabolism

We identify metabolism reaction-state to the measurable space

(M, M)

with probability measure

ν

. Let

Q_{m}

be a transition kernel from

(M, M)

into

(E, E)

, positing that metabolism precedes replication. For example, certain metabolic state may be precursors to the synthesis of RNA. Consider product space

(M \times E, M \otimes E)

with measure

μ = ν \times Q_{m}

. Now we suppose that, upon achieving Darwinian evolution in replicators, the replicators will eventually become adapted to

(M, M)

. Hence, we interpret

(M, M)

as a mark-space of

(M \times E, M \otimes E)

, representing genomic adaptation. Let

Q_{m}^{'}

be a transition kernel from

(M \times E, M \otimes E)

into

(M, M)

. Then,

μ = ν \times Q_{m} \times Q_{m}^{'}

is a probability measure on

(M \times E \times M, M \otimes E \otimes M)

, where

μ (f) = \int_{M} ν (d x) \int_{E} Q_{m} (x, d y) \int_{M} Q_{m}^{'} ((x, y), d z) f (x, y, z) for f \in {(M \otimes E \otimes M)}_{\geq 0}

or

μ (d x, d y, d z) = ν (d x) Q_{m} (x, d y) Q_{m}^{'} ((x, y), d z) .

Therefore, metabolism-first followed by replication and genomic adaptation is encoded by the structures of

Q_{m}

and

Q_{m}^{'}

. We do not specify these transition kernels in this article but mention that they are richly textured.

2.8. Reaction Overview

The reactions of the system having decay and clay and their reaction orders are shown in Table 2. There is one zero-order reaction, three first-order reactions, and two second-order reactions. Additionally, we show reactions and orders for compartmentalization and metabolism.

3. Results

Consider some initial population of I random sequences

X_{0}

. The population over time is given by

X_{t}

with associated random counting measure

N_{t}

on

(\bar{G}, 2^{\bar{G}})

. Recall parameters

θ = (n, q, k, l, m, p, k_{⌀}, k_{s s}, k_{d s}, k_{r e p}, k_{c l a y - o}, k_{c l a y - p})

for sequence dimension n, high-fidelity sequence set size q, fitness degree k, similarity degree l, fidelity degree m, clay fidelity probability p, RNA decay rate

k_{⌀}

, double-strand dissociation rate

k_{s s}

, double-strand formation rate

k_{d s}

, RNA replication rate

k_{r e p}

, and clay oligomerization and polymerization rates

k_{c l a y - o}

and

k_{c l a y - p}

. These parameters are summarized in Table 3.

The following is the description of how the parameter values were specified and to what they biologically correspond. The sequence dimension n is chosen from

{3, 4, 5}

. The fitness and similarity functions are chosen by setting the value of the range of the curvature parameters k and l from one (inside the high-fidelity manifold) to some small values, such as over an exponential grid. For example, when

i = 0.1

, the fitness of sequences that are maximally dissimilar have 10% of the fitness of the high-fidelity sequences. We range the grid from

0.1

to

0.001

for fitness and similarity. The RNA fidelity parameter

m = 0.25

is chosen such that the high-fidelity sequences have value one and the lowest fidelity sequences have value 0.25, equal to random chance. The clay fidelity parameter is set to an optimistically high value of

0.9

for clay studies. The double-strand dissociation and formation rates

k_{s s}

and

k_{d s}

are set to unity as a baseline. In comparison, the RNA replication rate is set to a large value, 10, whereby replication is the dominant reaction. The decay parameter

k_{⌀}

is set to some uniform random value in (0, 1). The clay RNA oligomerization rate is set to unit, and ‘clay’ RNA polymerization rate is set to a uniform random value in (0, 20).

With the parameters governing the reaction rates, different values of these parameters confer different regimes for the system.

3.1. Stability: ODEs

We characterize the zeros of the vector field f from ODE system (A1) and use the eigenvalues of the Jacobian (A2) to determine their stability.

Theorem 2.

The ODE system (A1) for

R = {x}

,

x \in E

, has a single unstable fixed-point at

[x] = 1

and

[y] = 0

for

y \in G \ x

.

Proof.

Solving

f = 0

gives a single solution

[x] = 1

and

[y] = 0

for

y \in G \ x

. For this solution, the eigenvalues of the Jacobian contain no zero values and positive values. Therefore, the solution is unstable. □

It follows from Theorem 2 that, for all other initial conditions, the system has no equilibria.

Corollary 2

(Unbounded). For all initial conditions

X_{0}

such that

I = | X_{0} | > 1

, the system is unbounded.

This confirms the obvious: the system, a replicating network with no death, is almost always an increasing system.

3.2. Simulation Reaction State

We are interested in the behavior of temporal probability measures,

ν_{t}

(25) on the sequence product space and

μ_{t} = ν_{t} Q

(27) on the sequence space, for RNA polymerization. These reveal the instantaneous information of the system. The structure of

μ_{t}

reveals the state of polymerization and is a leading indicator of the population concentrations over time.

3.2.1. Core Model with “Tent” Functions, Probable Hitting $P (τ_{v} (θ) < \infty) \sim 1$

Take sequence dimension

n = 3

and fitness and similarity curvature parameters

k = l = - log (0.01) / n

and fidelity parameter

m = - log (0.25) / n

. Set rates for double-strand dissociation and formation

k_{s s} = k_{d s} = 1

and polymerization rate

k_{r e p} = 10

and use the “tent” function for fitness, similarity, and fidelity. Take random initial population

X_{0}

with initial population size

I = | X_{0} | = 10

and random singleton

R = {{x}}

(

q = 0

). We simulate 5000 reactions, simulation censored at hitting time

τ_{v}

for volume fraction

v = 0.25

. Take partition of the sequence space by Hamming distance to the high-fidelity manifold

(H_{i})

(28) of sequence space E. In Figure 1, we plot measures of a typical realization of the system

X_{t}

on the partition

(H_{i})

of sequence concentration (Figure 1a), growth (Figure 1b), and polymerase sequence output

μ_{t}

(Figure 1c). Some quantities are plotted on log-log scale, whereas others are plotted on a linear-log scale. These results show that the concentrations are relatively stable for most time, until the high-fidelity manifold is hit. Then, the concentration of high-fidelity replicators rapidly increases to exceed 25%. Similarly, Figure 1b shows the growth curves on a log-log scale, where the high-fidelity manifold rapidly increases near the end of the simulation. Figure 1c shows the structure of the RNA sequence polymerization output temporal probability measure

μ_{t}

. Low probability is assigned to polymerization of high-fidelity replicators for most of the reaction time, followed by a large increase near the end of the simulation, where high-fidelity replicators dominate with 56% probability. Therefore, the RNA sequence polymerization output temporal probability measure

μ_{t}

is a leading indicator of the concentration curve, i.e., at simulation end-time, concentration of high-fidelity replicators is 25% and polymerization output is 56%.

3.2.2. Core Model with “Tent” Functions, Improbable Hitting $P (τ_{v} (θ) < \infty) \sim 0$

We use the same configuration as Section 3.2.1 except for setting fitness and similarity curvature parameters to

k = l = - log (0.1) / n

. In Figure 2, we plot measures of a typical realization of the system

X_{t}

on sequence partition by Hamming distance to the high-fidelity manifold

(H_{i})

of concentration (Figure 2a), growth (Figure 2b), and

μ_{t}

(Figure 2c). The behavior has completely changed: the high-fidelity group ends the simulation with around 6% concentration, only steadily increasing, and never hits. The polymerase output

μ_{t}

shows 6%. This indicates that the concentration of high-fidelity replicators is unlikely to increase further, as the population is generally in equilibrium with the polymerase output.

3.2.3. Core Model with Linear Functions, Improbable Hitting $P (τ_{v} (θ) < \infty) \sim 0$

The same configurations for Section 3.2.1 are used, except the fitness, similarity, and fidelity functions are linear. Similar to the “tent” functions, we specify the terminus landscape curvature for fitness and similarity

k = l

. Then, the fitness function for RNA polymerization is given by

f_{k} (x, R) = 1 + (\frac{k - 1}{n}) H (x, R) for x \in E

and

s_{k} (x, y) = 1 + (\frac{k - 1}{n}) S (x, y) for x, y \in E

We put fitness and similarity landscape curvature

k = l = 0.01

for fitness and similarity and

v = 0.25

for hitting volume fraction of high-fidelity replicators. We simulate

X_{t}

for 5000 reactions. We find that the probability of hitting is near zero

P (τ_{0.25} (θ) < \infty) \sim 0

. In Figure A1, we plot measures of a typical realization of

X_{t}

on

(H_{i})

of concentration (Figure A1a), growth (Figure A1b), and

μ_{t}

(Figure A1c). The simulation ends with high-fidelity concentration of ∼5% and polymerase output of ∼4%. Therefore, the concentration of high-fidelity replicators will continue to decrease. Linear surfaces are not sufficient to achieve hitting times

τ_{v} (θ) < \infty

for high-fidelity replicator volume fraction

v = 0.25

, in contrast to the nonlinear “tent” functions.

3.2.4. Expanded Model with “Tent” Functions, Probable Hitting $P (τ_{v} (θ) < \infty) \sim 1$

We consider a similar model to the previous subsections and expand it with clay oligomerization rate (of RNA)

k_{c l a y - o}

, clay polymerization rate (of RNA)

k_{c l a y - p}

, and clay polymerization fidelity p. Therefore, the full set of variables is given by

θ = (n, k_{s s}, k_{d s}, k, k_{c l a y - o}, k_{c l a y - p}, p)

. The value of fitness/similarity landscape curvature

k, l

and clay RNA polymerization raet

k_{c l a y - p}

are set such that the replicative mass of each is initialized to 10. This means that RNA and clay polymerization have the same reaction mass at the beginning of the simulation. We set sequence dimension

n = 3

, fitness/similarity landscape curvature

k = l = - log (0.01) / n

, clay RNA polymerization fidelity

p = 0.9

, and double-strand dissociation and formation rates

k_{s s} = k_{d s} = 1

. This is a high hitting regime, i.e., the probability of hitting is close to one

P (τ_{v} (θ) < \infty) \sim 1

. In Figure A2, we plot measures of a typical realization of

X_{t}

on

(H_{i})

and additionally the probability of reactions over time. High-fidelity replicators ended the simulation with 25% concentration (Figure A2a) and RNA polymerase output ∼62% (Figure A2c), indicating that the concentration of high-fidelity replicators will continue to increase. All species exhibit superexponential growth (Figure A2b). Clay polymerization decreases in contribution over time, whereas RNA polymerization increases substantially over time, and RNA double-strand reactions are small and stable (Figure A2d).

3.3. Hitting Times: Functional and Survival Analysis

We study various models in order of increasing complexity. We examine the hitting time surface

τ_{v} (θ)

in the parameters

θ \in Θ

, including probability of hitting

P (τ_{v} (θ) < \infty)

. We begin with the core model with no decay or clay.

3.3.1. Core Model, $τ_{v} (θ)$ for $v = 0.1$ with $θ = (n, k)$ and “Tent” Functions

We are interested in the structure of the hitting time

τ_{v} (θ)

of (19) as a function of the parameter vector

θ

. We use the Weibull–Cox proportional hazard’s model of Table 1 for the hitting time

τ_{v}

for volume fraction

v = 0.1

. Let

θ = (n, k)

with sequence dimension

n \in {3, 4}

and fitness and similarity parameters

k = l = - log (i) / n

for

i \in {0.1, 0.05, 0.01, 0.005, 0.001}

. Set

m = - log (0.25) / n

for fidelity probability parameters. For each value of sequence dimension n, take random initial population

X_{0}

with initial population size

I = | X_{0} | = 10

and random singleton

R = {{x}}

for the high-fidelity manifold and fix these for the fitness/similarity landscape curvature parmeters

k = l

. We fix the double-strand dissociation and formation rate parameters

k_{s s} = k_{d s} = 1

and set RNA polymerization rate

k_{r e p}

such that the overall RNA polymerization rate is given by

{\overset{˘}{k}}_{r e p} (0) = 10

and use the “tent” function for fitness, similarity, and fidelity. We take 10 realizations of hitting time

τ_{v} (θ)

for each parameter vector

θ \in Θ

and allocate 5000 reactions. This gives 100 independent hitting times and up to 500,000 reactions. The times are comparable because the system is initialized to the same replication mass.

For the simulations, 66 hitting times are finite. The coefficients positively contribute to hitting, where

γ_{n} \approx 0.97

and

γ_{k} \approx 13.29

, both with p-values less than 0.005. Therefore, hittings are strongly positively influenced by the parameters of the fitness and similarity functions and less so by the dimension. Plots of the coefficients and survival and cumulative hazard curves are given in Figure A3. High survival is found for k large and low survival for k small. Cumulative hazard is highest for k small.

We estimate HDMR of the classifier (whether or not hitting time is finite) using all 100 samples. The results are shown in Figure A5 and Table 4. The HDMR explains roughly 80% of variance.

S_{k} \approx 0.69

and

S_{n} \approx 0.06

, so hitting probability is strongly influenced by fitness landscape curvature k and less so by sequence length n. The component functions

f_{k}

and

f_{n}

for fitness landscape curvature and sequence dimension are strictly decreasing, where larger fitness landscape curvature parameter k results in decreasing hitting probability. These results are consistent with the survival analysis.

We estimate HDMR of the regressor (hitting time) using the 66 simulations with finite hitting time. The results are shown in Figure A4 and Table 5. The HDMR explains roughly 60% of variance.

S_{n} \approx 0.57

and

S_{k} \approx 0.04

, so sequence dimension dominates the hitting time. Both HDMR component functions

f_{n}

and

f_{k}

for sequence dimension and fitness landscape curvature are increasing. The HDMR results reveal that conditioning on hitting reverses the roles of sequence dimension n and fitness landscape curvature k.

3.3.2. Clay and Decay Model, $τ_{v} (θ)$ for $v = 0.1$ with $θ = (n, k, k_{⌀}, k_{c l a y - p}, p)$ and “Tent” Functions

We expand the model to include clay and decay. We take parameter vector

\begin{matrix} θ & = (n, k, k_{⌀}, f_{c l a y}, p_{c l a y}) \in Θ \\ Θ & = {3, 4} \times {- log (i) / n : i = 0.1, 0.05, 0.01, 0.005, 0.001} \times (0, 1) \times (0, 1) \times (0, 1) \end{matrix}

with double-strand dissociation and formation and clay oligomerization reaction rates

k_{s s} = k_{d s} = k_{c l a y - o} = 1

. For each parameter vector

θ \in Θ

, (i) we choose singleton high-fidelity replicator manifold

R = {x}

for some RNA sequence

x \in E

and choose random initial population of RNA molecules

X_{0}

such that the initial population size is 10,

I = | X_{0} | = 10

, and where the initial population does not intersect the high-fidelity manifold

X_{0} \cap R = ⌀

, that is, the initial population does not reside on the high-fidelity manifold; (ii) we initialize the replicative mass of the system such that the initial overall RNA polymerization reaction rate is given by

{\overset{˘}{k}}_{r e p} (0) = (1 - f_{c l a y}) 20

and the initial overall clay RNA polymerization reaction rate

{\overset{˘}{k}}_{c l a y - p} (0) = f_{c l a y} 20

; (iii) we sample the hitting times

τ_{v} (θ)

for volume fraction of high-fidelity replicators

v = 0.10

a total of

M = 10

times, each censored by 5000 reactions, giving hitting time set

T (θ)

of (20). We attain input–output data set as

D = {(θ_{i}, T (θ_{i})) : i = 1, \dots, 240}

. This gives a total of 2400 simulations.

For the simulations, 1546 hitting times are finite. The results of fitting the Weibull–Cox model are shown below in Table 6 and Figure A6. The curvature parameter k again significantly dominates with a large positive value. All the remaining parameters have values less than one. Sequence dimension n again is a relatively small positive contributor. The clay fraction

f_{c l a y}

is small and positive and replication fidelity parameter p is not significant.

We estimate HDMR of the classifier (whether or not hitting time is finite) using all 2400 samples. Component functions and sensitivity indices are shown below in Figure A7 and Table 7. First-order HDMR captures 74% of explained variance, and second-order captures 4%. Curvature dominates hitting probability with large sensitivity index

S_{k} \approx 67 %

. The HDMR component function in landscape curvature

f_{k}

is a decreasing function, where small values increase and large values decrease hitting probability. Sequence dimension n has sensitivity index

S_{n} \approx 2 %

, and the HDMR component function

f_{n}

is decreasing, where high dimension decreases the probability of hitting. Clay parameter sensitivity index is small

S_{f_{c l a y}} \approx 2 %

, and the HDMR component function for fractional clay RNA polymerization rate,

f_{f_{c l a y}}

, is decreasing, where low-to-medium clay fractions increase and high-clay fractions decrease probability of hitting. The HDMR results are consistent with the Weibull–Cox model.

We estimate HDMR of the regressor (hitting time) using the 1546 simulations with finite hitting time. Component functions and sensitivity indices are shown below in Figure A8 and Table 8. First-order HDMR captures 33% of explained variance, and second-order captures 7%. In stark contrast to the contributions to the classifier, the parameters k and n are insignificant to hitting time. Instead, the largest sensitivity index is

S_{f_{c l a y}} \approx 20 %

. The HDMR component function for fractional clay RNA polymerization rate,

f_{f_{c l a y}}

, is an increasing function, where small

f_{c l a y}

decreases and large

f_{c l a y}

increases the hitting time. This suggests that high clay-fractions representing first-order reactions increase the hitting time, as clay polymerization has less replicative mass than RNA polymerization, i.e., things go faster with RNA polymerization. The second largest sensitivity index is decay

S_{k_{⌀}} \approx 11 %

. Decay is an increasing function, with sharp increase in hitting times nearby one, i.e., things go slower with large decay resulting in increased hitting time.

3.4. Compartmentalization

Compartmentalization has a direct effect on the calculation of the reaction rates, specifically replication, by computing only a subset of the reactions in

X_{1}^{2}

. Put

X_{t} (A) = {x \in X_{t} : l (x) \in A} .

For vesicle region

A \in M

, we have that

\begin{matrix} {\overset{˘}{k}}_{r e p} (t, A) & = \sum_{(x, y) \in X_{1}^{2}} k_{r e p} (x, y) M_{t} ({x} \times A) (M_{t} ({y} \times A) - I_{} (x = y)) for t \in R_{\geq 0}, A \subset [- T, T] \\ = \sum_{(x, y) \in X_{t}^{2} (A)} k_{r e p} (x, y) M_{t} ({x} \times A) (M_{t} ({y} \times A) - I_{} (x = y)) \end{matrix}

and total replicative mass

{\overset{˘}{k}}_{r e p} (t) = \sum_{A \in M} {\overset{˘}{k}}_{r e p} (t, A) for t \in R_{\geq 0}

As

M

increases in size over time, the number of partitions grows, and

\sum_{A \in M} | X_{t}^{2} (A) | ≪ | X_{1}^{2} | .

Therefore, the replicative mass will be reduced with

M

, and the system evolves less quickly. This suggests that there is a trade-off between the degree of compartmentalization and the replicative mass of the system.

4. Discussion and Conclusions

Origins of life is a fascinating problem. The wonderful complexity of extant life follows from origins. The distribution of life in the universe is tied to origins.

In this article, we have attempted to peek into the problem by concentrating on the RNA world hypothesis, studying hitting times of high-fidelity replicators. We develop fitness, similarity, and fidelity functions as landscapes for a mathematical model of replicating RNA molecules at the sequence level and observe hitting times through simulation studies. We draw attention to the distinction between the probability of hitting

P (τ (θ) < \infty)

and the hitting time

τ (θ) < \infty

.

In terms of mathematical set-up, we interpret the reactions as measure-kernel-functions. Each reaction is identified to and fully encoded by a probability transition kernel. The reactions take place in some domain, whereby all molecules may interact. We note that, in reality, molecules have limited diffusion, and this effectively breaks the reaction domain into independent subdomains above some length scale, i.e., molecules are more likely to react with their neighbors. Therefore, we assume our reaction volume is sufficiently small such that all molecules may participate in the reactions. We use for modeling purposes the ansatz that sequence distance is correlated to spatial proximity, where similar sequences are proximal, using a non-trivial similarity function

s : E \times E \mapsto (0, 1]

.

Theorem 2 and its Corollary 2 show that the system (without decay) is unbounded and strictly increases. This formally shows the system to be a growth process. Next, we illustrate findings about the core system with probable hitting (Section 3.2.1). In particular, we see that the temporal image measure

μ_{t} = ν_{t} Q

, which describes the polymerization output, is a leading indicator of high-fidelity sequence concentration. Polymerization output and high-fidelity replicators super-exponentially increase near the end of the simulation. Next, the fitness and sequence curvature parameters are set at a higher value which confers reduced fidelity (Section 3.2.2). This reveals that hitting is never achieved and that the polymerization output is in equilibrium with population composition. Hence, the probability of hitting is strongly influenced by landscape curvatures. Next, linear curvature is utilized for fitness and similarity and results in no hitting (Section 3.2.3). This reveals that nonlinear curvature is necessary to achieve hitting of high-fidelity replicators. Next, we expand the model with non-RNA (‘clay’) based polymerization and find that such activity decreases over time, in contrast to RNA polymerization, which greatly increases and dominates other reactions over time the system (Section 3.2.4).

For functional and survival analysis of the hitting times, we study the core model, whereby hitting times are strongly positively influenced by the fitness and similar functions yet are not impacted significantly by sequence dimension (Section 3.3.1). In particular, survival analysis reveals low fitness curvature confers low survival (high hitting), whereas high fitness curvature confers high survival (low hitting). HDMR analysis shows that hitting probability is strongly influenced by fitness curvature and much less so by sequence dimension, supporting the survival analysis. HDMR analysis of hitting time shows reversed roles for sequence dimension and landscape curvature, where sequence dimension dominates hitting time, with curvature playing a far less significant role. This gives the finding that hitting probability is driven by curvature, whereas hitting time is driven by sequence dimension. Next, we perform functional and survival analysis of the core model augmented with ‘clay and decay’ dynamics (Section 3.3.2). Survival analysis shows similar results to the core model, where curvature dominates survival (no hitting), with sequence dimension playing a significantly reduced role. HDMR analysis of hitting probability shows that curvature dominates hitting probability, similar to the core model, whereas sequence dimension again plays a significantly reduced role. HDMR analysis of hitting time reveals that the presence of ‘clay and decay’ significantly increase hitting time, with curvature and sequence dimension playing insignificant roles. These results are consistent in that clay polymerization has less replicative mass than RNA polymerization, where RNA polymerization is a faster reaction.

Overall, we find that nonlinear landscapes are necessary for hitting: linear landscapes are insufficient. For nonlinear landscapes, we find that the probability of hitting is dominated by curvature and that hitting times are dominated by sequence dimension. These results suggest that the landscapes in nature are nonlinear with high curvature, and that the hitting time for high-fidelity replicators is an increasing function of sequence dimension. When clay and decay are added to the model, hitting probability is again dominated by curvature, and clay and decay are relatively insignificant. This reflects that clay and decay are low order reactions. They increase hitting times.

For replication and compartmentalization, we suggest that compartmentalization, while a necessary condition, slows overall system dynamics with increasing vesiculation rate. Essentially, as compartmentalization increases, there is a corresponding reduction in absolute replicative system mass, as certain reactions among elements are no longer possible, being physically sequestered. While the timescale of a simulation is tied to the replicative system mass of the system, there is variability in replicative mass across compartments. Some compartments contain large genomic and metabolic populations. It favors the search for the high-fidelity replicator by there being a distribution on compartmental ‘fitness’ such as resource concentrations so that the high-fitness compartments drive replicative system mass. Compartmentalization is identified to the measure

ν

on coordinates in

(C, C)

, for which coordinates are “marked” by sequences through the transition probability kernel

Q_{c}

, and followed by genomic adaptation via the transition probability kernel

Q_{c}^{'}

.

Metabolism is thought to be identified to production of precursors to RNA synthesis, leading to replication identification, followed by genomic adaptation to metabolic state. Metabolism is thus defined through the transition kernels

Q_{m}

and

Q_{m}^{'}

.

The independence of the transition kernels can be scrutinized, and it is possible that general transition kernels on the full product spaces across location, genomic, and metabolic states are necessary to satisfactorily explain RNA origins, i.e., all three functions may have co-evolved. This notion is suggested in the hot springs hypothesis for origins, where compartmentalization is hypothesized to furnish necessary conditions to genomic and metabolic state [26,27,28]. In this telling, the base measurable space of interest is

(F, F) = (C \times E \times M, C \otimes E \otimes M)

with measure

ν

. Then, identification of a high-fidelity replicator is described through the transition probability kernel

Q_{c e m}

from

(F, F)

into

(F, F)

in “marking” the base measurable space with state for genomic replication; finally, genomic adaptation is conveyed through the kernel

Q_{c e m}^{'}

from

(F \times F, F \otimes F)

into

(F, F)

. Hence, RNA origins of life has law

ν \times Q_{c e m} \times Q_{c e m}^{'}

on the product space

(F \times F \times F, F \otimes F \otimes F)

, reflecting the steps of replicator identification and adaptation through the definitions transition kernels

Q_{c e m}

and

Q_{c e m}^{'}

.

A putative “genesis machine” here is a mapping from the base (initial) measurable space

(F, F)

into the product space of identified high-fidelity replicators undergoing adaptation, i.e.,

(F \times F \times F, F \otimes F \otimes F)

. More generally the base space could additionally contain amino acid sequence space

(P, P)

. Such a machine is fully specified through the definitions of the distribution

ν

on the base space and the transition kernels

Q_{c e m p}

and

Q_{c e m p}^{'}

(pre and post genes, respectively). Because the stages of transition occur purely through random drift, an experiment performed by such a machine would take an unacceptably long period of time to complete. Experimental demonstration can be contemplated by augmenting the base measurable space with a control space

(X, X)

to accelerate dynamics, using for instance closed-loop shaped radiation to address molecular degrees of freedom in their appropriate frequency domains, (open-loop) catalysts, temperature, geometry, selection, concentration through centrifugal force, etc., resulting in the new (four-dimensional) base space

(\tilde{F}, \tilde{F}) = (C \times E \times M \times P \times X, C \otimes E \otimes M \otimes P \otimes X)

. Then, the transition kernels

{\tilde{Q}}_{c e m p}

and

{\tilde{Q}}_{c e m p}^{'}

become mappings from

(\tilde{F}, \tilde{F})

into

(\tilde{F}, \tilde{F})

and from

(\tilde{F} \times \tilde{F}, \tilde{F} \times \tilde{F})

into

(\tilde{F}, \tilde{F})

, respectively. The general design of a genesis machine is the definitions of the augmented base space

(\tilde{F}, \tilde{F})

, its distribution

\tilde{ν}

, and the augmented transition kernels

{\tilde{Q}}_{c e m p}

and

{\tilde{Q}}_{c e m p}^{'}

, giving law

\tilde{μ} = \tilde{ν} \times {\tilde{Q}}_{c e m p} \times {\tilde{Q}}_{c e m p}^{'}

on the full 15-dimensional product space

(\tilde{F} \times \tilde{F} \times \tilde{F}, \tilde{F} \otimes \tilde{F} \otimes \tilde{F})

, written in differential notation

\tilde{μ} (d x, d y, d z) = \tilde{ν} (d x) {\tilde{Q}}_{c e m p} (x, d y) {\tilde{Q}}_{c e m p}^{'} ((x, y), d z)

Origins could be experimentally demonstrated using a sequence of adaptive control fields in

(X, X)

, cycling through the transitions, and a detection system for online identification of the system whose elements belong to the product measurable space. The full design and estimated operating timescale for such a machine needs further research to assess practical feasibility. We call the creation of primordial life (primordia) by the continuous causal efforts of a genesis machine given initial prebiotic conditions artebiogenesis, where arte- is Latin and means “from skill.” The primordia are not necessarily those that occurred in nature. Primordia and their genesis represent non-trivial system trajectories across the transition to the earliest life in sterile environments and belong to a manifold of primordial lifeforms, each having characteristic geochemical setting.

The probability measure

μ_{t} = ν_{t} Q

has additional utility to integrate ‘test’ functions or queries about the system. If we let

f \in E_{\geq 0}

be a fitness function, then the fitness value

J (μ_{t}) = μ_{t} (f)

is the expected value of the fitness function with respect to the probability measure

μ_{t}

. In OptiEvo theory,

J (μ_{t})

is studied as a function of the population

X_{t}

on

(E, E)

[52]. OptiEvo assumes that the set of all probability measures

{μ_{t}}

is convex and that

X_{t}

has sufficient flexibility such that

J (μ_{t})

may be explored around

μ_{t}

. Then, OptiEvo predicts that

J (μ_{t})

has global maxima on

(E, E)

and that these form a connected level-set of sequences with the same fitness value. Both predictions are consistent with our model. The first prediction is consistent with zero distance in fitness and similarity functions for high-fidelity sequences. The second prediction is consistent with the high-fidelity set being a singleton or a product-space construction. A contention is whether

X_{t}

has sufficient flexibility in exploring

J (μ_{t})

around

μ_{t}

. This has direct bearing on the structure of Q: if

X_{t}

is inflexible, then Q is constrained to certain subspaces of

(E, E)

, i.e., not all transitions are possible.

In future work, the model could be extended to the space of sequences of lengths up to n,

(E^{*}, E^{*})

or even the space of sequences of all lengths, where distance and similarity functions would utilize a more general string distance metric, e.g., Levenshtein distance. We note that the size of

(E^{*}, E^{*})

is not much larger than

(E, E)

. Alternative similarity functions could be explored, such as the trivial case of constant similarity, e.g.,

s (x, y) = 1

for

(x, y) \in E \times E

. A limitation of this article is the restriction to short sequences due to computational efficiency. The numerical size of the sequences is mathematically low-dimensional and does not correspond to actual functions of RNA molecules. Other parameter sets can be explored for example using experimentally derived values for reaction rates, so that the timescales are calibrated. Future research could see the simulation software rewritten for a high-performance computing environment, enabling much longer, e.g., length 10–1,000, sequences to be studied. Polymerase fitness can be made empirical using known RdRp sequences as members of the high-fidelity manifold. Another area of future work could be studying the aforementioned transition kernels

Q_{c}

,

Q_{c}^{'}

,

Q_{m}

, and

Q_{m}^{'}

. More general models for polymerization transition kernel based on the structure of the Poisson-binomial distribution could be employed. It would be interesting to study lipid-RNA and metabolism-RNA interactions and equip the system with the ability to append nucleotides to their sequences to form functional genes, such as storing useful information for the replication channel, perhaps a Hammerhead ribozyme to convey rolling-circle amplification. We note that transition kernels here generally lack amino acid state and are pure-RNA. An area to explore is the notion of the transition kernel into the space of high-fidelity replicators to depend on amino acid sequences and then to elaborate the system to contain a primitive translation system and examine various hitting times. Additional reactions can be introduced as operations on pairs of sequences, such as concatenation, and others for sequence splitting, and so on, with corresponding transition kernels enabling RNA networking and recombination dynamics.

Author Contributions

All authors contributed to the study conception and design. Material preparation, prepared software, data generation and analysis were performed by C.D.B. The first draft of the manuscript was written by C.D.B. and all authors commented on previous versions of the manuscript. Grant funding was provided through H.R. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Science Foundation Grant No. CHE-1763198.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Please see https://github.com/calebbastian/originoflife (accessed on 31 October 2021) for the Python software and example script of usage.

Acknowledgments

The authors thank the late Freeman Dyson for the discussions in 2018 of these ideas at the Institute for Advanced Study in Princeton, New Jersey, as well as anonymous reviewers who gave critical comments that substantially improved the quality of this article.

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Appendix A. Discrete Probability Space

We define some concepts related to the space

(E, E)

. The discrete probability measure

ν

on

(E, E)

is defined by

ν (A) = ν I_{A} = \sum_{x \in E} ν {x} I_{A} (x) for A \in E

where

ν {x}

is the probability mass at the point

x \in E

and

I_{A} (x) = \{\begin{matrix} 1 & if x \in A \\ 0 & otherwise \end{matrix}

For the collection of non-negative

E

-measurable functions

E_{\geq 0}

, we have

ν (f) = \sum_{x \in E} ν {x} f (x) for f \in E_{\geq 0} .

Appendix B. Other Fitness Functions

Another fitness function can be defined using polynomials, such as lines, quadratics, etc, in terms of

k \in N_{> 0}

f_{k} (x, R) = {(1 - \frac{H (x, R)}{n + 1})}^{k} \in (0, 1] for x \in E

or

f_{k} (x, R) = 1 - {(\frac{H (x, R)}{n + 1})}^{k} \in (0, 1] for x \in E

However, another surface is using a sigmoid function. Put

E (x) = \frac{1}{1 + exp [- x]} for x \in R

We have fitness for

k \in (0, \infty)

f_{k} (x, R) = \frac{1 - E (\frac{H (x, R) - \frac{n}{2}}{k})}{1 - E (- \frac{n}{2 k})} \in (0, 1] for x \in E

Appendix C. Measure-Kernel-Function

We recall a few facts about transition kernel Q. Q defines a function

Q f (x) = \int_{Y} Q (x, d y) f (y) for x \in E

that is in

X_{\geq 0}

for every function

f \in Y_{\geq 0}

.

For every probability measure

ν_{t}

on

(E, E)

and at time

t \geq 0

, the quantity

μ_{t} = v_{t} Q

defines a probability measure on

(Y, Y)

as

μ_{t} (A) = \int_{X} ν_{t} (d x) Q (x, A) for A \in Y .

For every probability measure

ν_{t}

on

(X, X)

and function

f \in Y_{\geq 0}

, we have that

μ_{t} (f) = (ν_{t} Q) f = \int_{X} ν_{t} (d x) \int_{Y} Q (x, d y) f (y) .

Here, the spaces are discrete, i.e.,

ν_{t} (A) \equiv ν_{t} (I_{A}) = \sum_{x \in X} ν_{t} {x} I_{A} (x) for A \in X

where

ν_{t} {x}

is the probability mass at the point

x \in X

at time

t \geq 0

.

Appendix C.1. Reactions as Measure-Kernel-Functions

We index the reaction types on

(Z, Z) = (N_{> 0}, 2^{N_{> 0}})

. Let

η_{t}

be the probability measure on

(Z, Z)

formed from the normalized reaction rates. Let

Q_{*}

be the transition kernel from

(Z, Z)

into

(X, X)

. Then,

η_{t} Q_{*} = ν_{t}

is the distribution on

(X, X)

and

μ_{t} = ν_{t} Q

is the distribution on

(Y, Y)

. Then, a reaction is the mapping

(Z, Z) \mapsto (X, X) \mapsto (Y, Y)

.

Appendix C.2. Deterministic Model

Consider the core model defined in (2)–(4) with reaction rates

{\overset{˘}{k}}_{d s} (t)

,

{\overset{˘}{k}}_{s s} (t)

, and

{\overset{˘}{k}}_{r e p} (t)

. We are neglecting clay and decay for the moment. All the reactions impact

(E, E)

. For double-strand formation, the input space is

(X, X) = (E \times E, E \otimes E)

and the output space is

(Y, Y) = (F, F)

. For double-strand dissociation, the input and output spaces are swapped. For polymerization,

(X, X) = (E \times, E \otimes E)

and

(Y, Y) = (E, E)

. Hence,

(E, E)

is positively impacted by polymerization, positively impacted by double-strand dissociation, and negatively impacted by double-strand formation. Put

[x] = N_{t} ({{x}})

and

[x, x^{c}] = N_{t} ({{x, x^{c}}})

. Recall the

ν_{t}

, Q, and

μ_{t} = ν_{t} Q

for the reactions, e.g.,

ν_{t}^{r e p}, ν_{t}^{d s}, Q^{r e p}

, etc.

For

x \in E

, we have the system of

m = \frac{3}{2} 4^{n}

deterministic nonlinear ordinary differential equations (ODEs) in m variables as mean-field equations

\begin{matrix} \frac{d [x]}{d t} & = f_{x} = {\overset{˘}{k}}_{r e p} (t) μ_{t}^{r e p} {{x}} + {\overset{˘}{k}}_{s s} (t) μ_{t}^{s s} {{x}, {x^{c}}} - {\overset{˘}{k}}_{d s} (t) μ_{t}^{d s} {{x, x^{c}}} \\ = {\overset{˘}{k}}_{r e p} (t) \sum_{(y, z) \in E \times E} ν_{t}^{r e p} {{y}, {z}} Q^{r e p} ((y, z), {{x}}) + k_{s s} [x, x^{c}] - k_{d s} [x] [x^{c}] \\ = \sum_{(y, z) \in E \times E} k_{r e p} (y, z) [y] ([z] - I_{} (y = z)) Q^{r e p} ((y, z), {{x}}) + k_{s s} [x, x^{c}] - k_{d s} [x] [x^{c}] \\ \frac{d [x, x^{c}]}{d t} & = f_{x x^{c}} = k_{d s} [x] [x^{c}] - k_{s s} [x, x^{c}] \end{matrix}

(A1)

The fixed points of f are the equilibria of the system, i.e.,

f (x) = 0

for

x \in R_{\geq 0}^{m}

. The Jacobian of the system is

J = [\begin{matrix} \frac{d f_{1}}{d x_{1}} & \dots & \frac{d f_{1}}{d x_{m}} \\ ⋮ & ⋱ & ⋮ \\ \frac{d f_{m}}{d x_{1}} & \dots & \frac{d f_{m}}{d x_{m}} \end{matrix}]

(A2)

The eigenvalues of the Jacobian reveal the stability of the fixed points. If all the eigenvalues of the Jacobian evaluated at the fixed point have negative real parts, then the fixed point is stable. If none of the eigenvalues are zero and at least one of the eigenvalues has a positive real part, then the fixed point is unstable. If at least one eigenvalue is zero, then the fixed point can be either stable or unstable.

Appendix D. Hitting Cardinality

We index the N reactions of

X_{t}

with arrival times

{T_{i}}

in

({\bar{R}}_{\geq 0}, B_{{\bar{R}}_{\geq 0}})

, where

\bar{R} = R_{\geq 0} \cup {\infty}

. Define the hitting reaction

ϖ (θ) \in N

in terms of hitting time

τ (θ) \in {\bar{R}}_{\geq 0}

as

ϖ (θ) = \sum_{i = 1}^{N} I_{[0, τ (θ)]} (T_{i})

ϖ (θ)

is right-censored at N reactions. If

τ (θ) = \infty

, then

ϖ (θ) = N

. If

τ (θ) < \infty

, then

ϖ (θ) < N

.

Reaction Cardinality

In this section, we study, instead of the hitting time

τ_{v} (θ)

, the hitting reaction number

ϖ_{v} (θ)

for

v = 0.1

. We study the core model with parameter vector

θ = (n, k)

. We uniformly sample sequence length

n \in {3, 4, 5}

and fitness/similarity landscape curvature

k = l \in {0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1}

and form input–output data

D = {(θ_{i}, ϖ (θ_{i})) : i = 1, \dots, 1200}

. We set double-strand dissociation and formation rates

k_{s s} = k_{d s} = 1

and initialize the overall RNA polymerization rate

k_{r e p}

such that the initial replicative mass is 10. For each parameter vector

θ \in Θ

, we randomly choose the initial population of sequences

X_{0}

with initial population size

I = | X_{0} | = 10

and singleton high-fidelity sequence manifold

R

such that the initial population of sequences does not intersect the high-fidelity replicator manifold,

X_{0} \cap R = ⌀

.

D

contains 781 hitting events. We denote these

D^{*}

. We form a first-order HDMR on

ϖ (θ)

using

D^{*}

. The HDMR truth-plot, component functions, and sensitivity indices are shown below in Figure A9. First-order HDMR captures approximately 31% of variance. Both component functions

f_{n}

and

f_{k}

have similar sizes, with

f_{n}

somewhat larger than

f_{k}

. The HDMR component function for sequence dimension

f_{n}

is essentially an increasing linear function of sequence length n, and component function for landscape curvature

f_{k}

is generally an increasing function of the fitness/similarity landscape curvature. These results suggest that sequence dimension and curvature influence the hitting reaction. Larger sequences and flatter curvature increase the hitting reaction.

Table A1. HDMR sensitivity indices of

ϖ_{v} (θ) < \infty

for core model and

v = 0.1

.

Table A1. HDMR sensitivity indices of

ϖ_{v} (θ) < \infty

for core model and

v = 0.1

.

$θ$	$S_{θ}$
Sequence dimension n	0.1619
Curvature k	0.1464
∑	0.3083

Appendix E. Approximate Reaction Rates

One approach to reducing the computational complexity of

\overset{˘}{k} (t)

is to approximate the sums using Monte Carlo. Define random variables

X_{1} \sim Uniform (X_{1})

and

X_{2} \sim Uniform (X_{2})

. Let

{X_{1 i}}

and

{X_{2 i}}

be independencies of such random variables. Given N random samples of

X_{1}

, the first reaction rate becomes

{\overset{˘}{k}}_{d s} (t | N) = \frac{| X_{1} |}{2 N} \sum_{i = 1}^{N} k_{d s} N_{t} (X_{1 i}) N_{t} (X_{1 i}^{c})

whose expected value is approximated using M realizations,

E {\overset{˘}{k}}_{d s} (t | M, N) ≃ \frac{| X_{1} |}{2 M N} \sum_{j = 1}^{M} \sum_{i = 1}^{N} k_{d s} N_{t} (X_{1 i j}) N_{t} (X_{1 i j}^{c})

requiring a total of

M N

evaluations. In a similar manner, the second reaction rate is

E {\overset{˘}{k}}_{s s} (t | M, N) ≃ \frac{| X_{2} |}{M N} \sum_{j = 1}^{M} \sum_{i = 1}^{N} k_{s s} N_{t} (X_{2 i j} \cup X_{2 i j}^{c})

and, putting

X_{1}^{*} \sim Uniform (X_{1})

, we have the third reaction

E {\overset{˘}{k}}_{r e p} (t | M, N) ≃ \frac{| X_{1} |^{2}}{M N} \sum_{j = 1}^{M} \sum_{i = 1}^{N} k_{r e p} (X_{1 i j}, X_{1 i j}^{*}) N_{t} (X_{1 i j}) (N_{t} (X_{1 i j}^{*}) - I_{} (X_{1 i j} = X_{1 i j}^{*}))

(A3)

We refer to SSA simulation with Monte Carlo approximate reaction rates as Monte Carlo Approximate SSA, or MCASSA.

Appendix F. High Dimensional Model Representation

Suppose we have a real-valued square-integrable function

f (x) \in L^{2} (E, E, ν)

with

E = R^{n}

,

E = B_{R^{n}}

,

ν = \prod_{i} ν_{i}

, and

x = (x_{1}, \dots, x_{n}) \in E

. The

{ν_{i}}

may be diffuse (continuous) and/or discrete. Put

B = {1, \dots, n}

. We would like to decompose f into orthogonal function subspaces

{f_{u} : u \subseteq B}

(projections) in such a way that each projection on an input subspace

f_{u}

maximizes variance and across subspaces retrieves total variance, i.e.,

f = \sum_{u \subseteq B} f_{u}

and

V a r f = \sum_{u \subseteq B} V a r f_{u}

. The solution to this problem in the retrieval of

{f_{u} : u \subseteq B}

is known as high dimensional model representation (HDMR) or functional ANOVA expansion and for f is written as

f (x_{1}, \dots, x_{n}) = f_{0} + \sum_{i} f_{i} (x_{i}) + \sum_{i < j} f_{i j} (x_{i}, x_{j}) + \dots + f_{1 \dots n} (x_{1}, \dots, x_{n})

where the

{f_{u} : u \subseteq B}

are called component functions. For independent inputs, the component functions are mutually orthogonal and, aside from the constant component function

f_{0} = E f

(order zero), have zero mean

E f_{u} = 0

for all non-empty (

2^{n} - 1

) subspaces, where

V a r f_{u} = \int_{E} f_{u}^{2} (x_{u}) ν (d x) < \infty for u \subseteq B

and

V a r f = \int_{E} {(f (x) - f_{0})}^{2} ν (d x) = \sum_{u \subseteq B} V a r f_{u} < \infty .

A key principle of HDMR is that the expansion for most f may be truncated at low order

T ≪ n

in a T-order HDMR,

f (x) ≃ f^{T} (x) = \sum_{u \subseteq B : | u | \leq T} f_{u} (x_{u}) for T ≪ n .

HDMR is often used in global sensitivity analysis to assess input–output correlations at various orders, where the variances are normalized to define sensitivity indices

S_{u} = \frac{V a r f_{u}}{V a r f} for u \subseteq B .

When the inputs are correlated

ν \neq \prod_{i} ν_{i}

, then the component functions may still be uniquely recovered under hierarchical orthogonality, the variance decomposes

V a r f = \sum_{u, v \subseteq B} C o v (f_{u}, f_{v}),

where

C o v (f_{u}, f_{v}) = \int_{E} f_{u} (x_{u}) f_{v} (x_{v}) ν (d x) for u, v \subseteq B

and the sensitivity indices generalize to structural and correlative sensitivity indices [56], defined respectively as

S_{u}^{a} = \frac{V a r f_{u}}{V a r f} for u \subseteq B

and

S_{u}^{b} = \frac{\sum_{v \subseteq B : u \neq v} C o v (f_{u}, f_{v})}{V a r f} for u \subseteq B

with total sensitivity index

S_{u} = S_{u}^{a} + S_{u}^{b} for u \subseteq B

where

\sum_{u \subseteq B} S_{u} = 1

. We use the total sensitivity index as a measure of variable importance and the component functions as profiles of output dependence on the input subspaces.

Appendix G. Reliability Definitions

Given a failure distribution f and reliability (survival) distribution R, we give some relations: the cumulative failure distribution is defined as

F (t | ϑ) = \int_{0}^{t} f (s | ϑ) d s

where

R (t | ϑ) + F (t | ϑ) = 1,

the hazard rate

h (t | ϑ)

is defined as

h (t | ϑ) = \frac{f (t | ϑ)}{R (t | ϑ)},

the cumulative hazard is defined as

H (t | ϑ) = \int_{0}^{t} h (s | ϑ) d s,

and we have reliability expressed in terms of the cumulative hazard

R (t | ϑ) = e^{- H (t | ϑ)} .

Another useful quantity is the mean residual life

μ (t | ϑ) = \frac{\int_{t}^{\infty} R (s | ϑ) d s}{R (t | ϑ)} .

Appendix H. Additional Figures

Appendix H.1. Linear Landscape

Figure A1. Linear landscape: Measures of system population

X_{t}

until hitting time

τ_{v}

for high-fidelity replicator volume fraction

v = 0.25

with sequence dimension

n = 3

, fitness/similarity curvature

k = l = - log (0.01) / n

, initial population size,

I = | X_{0} | = 10

, singleton high-fidelity replicator

R = {{x}}

, with linear fitness and similarity functions. (a) Concentration of RNA sequences by Hamming distance to high-fidelity replicator; (b) population size of RNA sequences by Hamming distance to high-fidelity replicator; (c) polymerase RNA sequence output by Hamming distance to high-fidelity replicator.

Figure A1. Linear landscape: Measures of system population

X_{t}

until hitting time

τ_{v}

for high-fidelity replicator volume fraction

v = 0.25

with sequence dimension

n = 3

, fitness/similarity curvature

k = l = - log (0.01) / n

, initial population size,

I = | X_{0} | = 10

, singleton high-fidelity replicator

R = {{x}}

, with linear fitness and similarity functions. (a) Concentration of RNA sequences by Hamming distance to high-fidelity replicator; (b) population size of RNA sequences by Hamming distance to high-fidelity replicator; (c) polymerase RNA sequence output by Hamming distance to high-fidelity replicator.

Appendix H.2. Core Model with Clay

Figure A2. Core model with clay: Measures of system population

X_{t}

until hitting time

τ_{v} (θ)

for high-fidelity replicator volume fraction

v = 0.25

with sequence dimension

n = 3

, fitness/similarity curvature

k = - log (0.01) / n

, initial population size

I = | X_{0} | = 10

, singleton high-fidelity replicator

R = {{x}}

, double strand separation and formation rates reaction rate

k_{s s} = k_{d s} = 1

, clay replication fidelity probability

p = 0.9

, and RNA polymerization rate

k_{r e p}

and clay polymerization rate

k_{c l a y - p}

chosen such that the replicative mass of each is 10. (a) Concentration of RNA sequences by Hamming distance to high-fidelity replicator; (b) population size of RNA sequences by Hamming distance to high-fidelity replicator; (c) polymerase RNA sequence output by Hamming distance to high-fidelity replicator; (d) probability of reactions over time.

Figure A2. Core model with clay: Measures of system population

X_{t}

until hitting time

τ_{v} (θ)

for high-fidelity replicator volume fraction

v = 0.25

with sequence dimension

n = 3

, fitness/similarity curvature

k = - log (0.01) / n

, initial population size

I = | X_{0} | = 10

, singleton high-fidelity replicator

R = {{x}}

, double strand separation and formation rates reaction rate

k_{s s} = k_{d s} = 1

, clay replication fidelity probability

p = 0.9

, and RNA polymerization rate

k_{r e p}

and clay polymerization rate

k_{c l a y - p}

chosen such that the replicative mass of each is 10. (a) Concentration of RNA sequences by Hamming distance to high-fidelity replicator; (b) population size of RNA sequences by Hamming distance to high-fidelity replicator; (c) polymerase RNA sequence output by Hamming distance to high-fidelity replicator; (d) probability of reactions over time.

Appendix H.3. Hitting/Survival Analysis

Figure A3. Survival analysis of hitting time

τ_{v}

for high-fidelity replicator volume fraction

v = 0.1

for core model. (a) Coefficients of the Cox proportional hazard survival model; (b) survival curves in sequence dimension

n = L

and landscape curvature

k = l

; (c) cumulative hazard in sequence dimension

n = L

and landscape curvature

k = l

.

Figure A3. Survival analysis of hitting time

τ_{v}

for high-fidelity replicator volume fraction

v = 0.1

for core model. (a) Coefficients of the Cox proportional hazard survival model; (b) survival curves in sequence dimension

n = L

and landscape curvature

k = l

; (c) cumulative hazard in sequence dimension

n = L

and landscape curvature

k = l

.

Figure A4. First-order HDMR analysis of hitting time

τ_{v} (θ) < \infty

for core model. (a) Hexagonal-bin truth plot; (b) HDMR component function for sequence length

n = L

,

f_{n} (n)

, in sequence length for hitting time, of hitting time; (c) HDMR component function for landscape curvature,

f_{k} (k)

, in landscape curvature, of hitting time. The color function is from blue (negative) to white (zero) to red (positive). The black dots represent standard deviation of the error.

Figure A4. First-order HDMR analysis of hitting time

τ_{v} (θ) < \infty

for core model. (a) Hexagonal-bin truth plot; (b) HDMR component function for sequence length

n = L

,

f_{n} (n)

, in sequence length for hitting time, of hitting time; (c) HDMR component function for landscape curvature,

f_{k} (k)

, in landscape curvature, of hitting time. The color function is from blue (negative) to white (zero) to red (positive). The black dots represent standard deviation of the error.

Figure A5. First-order HDMR analysis of hitting probability

P (τ_{v} (θ) < \infty)

for core model. (a) Hexagonal-bin truth plot; (b) HDMR component function for sequence length,

f_{n} (n)

, in sequence length, of hitting probability; (c) HDMR component function for landscape curvature,

f_{k} (k)

, in landscape curvature, of hitting probability. The color function is from blue (negative) to white (zero) to red (positive). The black dots represent standard deviation of the error.

Figure A5. First-order HDMR analysis of hitting probability

P (τ_{v} (θ) < \infty)

for core model. (a) Hexagonal-bin truth plot; (b) HDMR component function for sequence length,

f_{n} (n)

, in sequence length, of hitting probability; (c) HDMR component function for landscape curvature,

f_{k} (k)

, in landscape curvature, of hitting probability. The color function is from blue (negative) to white (zero) to red (positive). The black dots represent standard deviation of the error.

Figure A6. Survival analysis of hitting time

τ_{v}

for volume fraction

v = 0.1

for expanded model (clay and decay). Coefficients of the Cox proportional hazards survival model.

Figure A6. Survival analysis of hitting time

τ_{v}

for volume fraction

v = 0.1

for expanded model (clay and decay). Coefficients of the Cox proportional hazards survival model.

Figure A7. First-order HDMR analysis of hitting probability

P (τ_{v} (θ) < \infty)

for expanded model (clay and decay). (a) Hexagonal-bin truth plot; (b) HDMR component function for sequence length

n = L

,

f_{n} (n)

, in sequence length, of hitting probability; (c) HDMR component function for curvature,

f_{k} (k)

, in curvature, of hitting probability; (d) HDMR component function for clay fitness,

f_{f_{c l a y}} (f_{c l a y})

, in clay fitness, of hitting probability. The color function is from blue (negative) to white (zero) to red (positive). The black dots represent standard deviation of the error.

Figure A7. First-order HDMR analysis of hitting probability

P (τ_{v} (θ) < \infty)

for expanded model (clay and decay). (a) Hexagonal-bin truth plot; (b) HDMR component function for sequence length

n = L

,

f_{n} (n)

, in sequence length, of hitting probability; (c) HDMR component function for curvature,

f_{k} (k)

, in curvature, of hitting probability; (d) HDMR component function for clay fitness,

f_{f_{c l a y}} (f_{c l a y})

, in clay fitness, of hitting probability. The color function is from blue (negative) to white (zero) to red (positive). The black dots represent standard deviation of the error.

Figure A8. First-order HDMR analysis of hitting time

τ_{v} (θ) < \infty

for expanded model (clay and decay). (a) Hexagonal-bin truth plot; (b) HDMR component function for sequence length,

f_{n} (n)

, in sequence length, of hitting time; (c) HDMR component function for curvature,

f_{k} (k)

, in curvature, of hitting time; (d) HDMR component function for clay-fraction,

f_{f_{c l a y}} (f_{c l a y})

, in clay fraction, of hitting time; (e) HDMR component function for decay rate,

f_{k_{⌀}} (k_{⌀})

, in decay rate, of hitting time. The color function is from blue (negative) to white (zero) to red (positive). The black dots represent standard deviation of the error.

Figure A8. First-order HDMR analysis of hitting time

τ_{v} (θ) < \infty

for expanded model (clay and decay). (a) Hexagonal-bin truth plot; (b) HDMR component function for sequence length,

f_{n} (n)

, in sequence length, of hitting time; (c) HDMR component function for curvature,

f_{k} (k)

, in curvature, of hitting time; (d) HDMR component function for clay-fraction,

f_{f_{c l a y}} (f_{c l a y})

, in clay fraction, of hitting time; (e) HDMR component function for decay rate,

f_{k_{⌀}} (k_{⌀})

, in decay rate, of hitting time. The color function is from blue (negative) to white (zero) to red (positive). The black dots represent standard deviation of the error.

Figure A9. First-order HDMR analysis of hitting cardinality

ϖ_{v} (θ) < \infty

for core model and volume fraction

v = 0.1

. (a) Hexagonal-bin truth plot; (b) HDMR component function for sequence dimension,

f_{n} (n)

, in sequence dimension, of hitting cardinality; (c) HDMR component function for curvature,

f_{k} (k)

, in curvature, of hitting cardinality. The color function is from blue (negative) to white (zero) to red (positive). The black dots represent standard deviation of the error.

Figure A9. First-order HDMR analysis of hitting cardinality

ϖ_{v} (θ) < \infty

for core model and volume fraction

v = 0.1

. (a) Hexagonal-bin truth plot; (b) HDMR component function for sequence dimension,

f_{n} (n)

, in sequence dimension, of hitting cardinality; (c) HDMR component function for curvature,

f_{k} (k)

, in curvature, of hitting cardinality. The color function is from blue (negative) to white (zero) to red (positive). The black dots represent standard deviation of the error.

References

Ganti, T. The Principles of Life; Oxford University Press: Oxford, UK, 2003. [Google Scholar]
Ganti, T. Chemoton Theory; Kluwer Academic/Plenum Publishers: Dordrecht, The Netherlands, 2003. [Google Scholar]
Orgel, L.E. Evolution of the genetic apparatus. J. Mol. Biol. 1968, 38, 381–393. [Google Scholar] [CrossRef]
Gilbert, W. Origin of life: The RNA world. Nature 1986, 319, 618. [Google Scholar] [CrossRef]
Joyce, G.F. The antiquity of RNA-based evolution. Nature 2002, 418, 214–221. [Google Scholar] [CrossRef] [PubMed]
White, H.B. Coenzymes as fossils of an earlier metabolic state. J. Mol. Evol. 1976, 7, 101–104. [Google Scholar] [CrossRef]
Eigen, M. Selforganization of matter and the evolution of biological macromolecules. Naturwissenschaften 1971, 58, 465–523. [Google Scholar] [CrossRef] [PubMed]
Eigen, M.; Schuster, P. A principle of natural self-organization. Naturwissenschaften 1977, 64, 541–565. [Google Scholar] [CrossRef]
Cech, T.R. The Ribosome Is a Ribozyme. Science 2000, 289, 878. [Google Scholar] [CrossRef]
Diener, T.O. Potato spindle tuber “virus”. IV. A replicating, low molecular weight RNA. Virology 1971, 45, 411–428. [Google Scholar] [CrossRef]
Tupper, A.S.; Higgs, P.G. Rolling-circle and strand-displacement mechanisms for non-enzymatic RNA replication at the time of the origin of life. J. Theor. Biol. 2021, 527, 110822. [Google Scholar] [CrossRef]
Vaidya, N.; Manapat, M.L.; Chen, I.A.; Xulvi-Brunet, R.; Hayden, E.J.; Lehman, N. Spontaneous network formation among cooperative RNA replicators. Nature 2012, 491, 72–77. [Google Scholar] [CrossRef] [PubMed]
de Farias, S.T.; dos Santos Junior, A.P.; Rêgo, T.G.; José, M.V. Origin and Evolution of RNA-Dependent RNA Polymerase. Front. Genet. 2017, 8, 125. [Google Scholar] [CrossRef] [PubMed]
Koonin, E.V.; Krupovic, M.; Ishino, S.; Ishino, Y. The replication machinery of LUCA: Common origin of DNA replication and transcription. BMC Biol. 2020, 18, 61. [Google Scholar] [CrossRef]
Ghadessy, F.J.; Ong, J.L.; Holliger, P. Directed evolution of polymerase function by compartmentalized self-replication. Proc. Natl. Acad. Sci. USA 2001, 98, 4552. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Tjhung, K.F.; Shokhirev, M.N.; Horning, D.P.; Joyce, G.F. An RNA polymerase ribozyme that synthesizes its own ancestor. Proc. Natl. Acad. Sci. USA 2020, 117, 2906. [Google Scholar] [CrossRef]
Ertem, G.; Ferris, J.P. Template-Directed Synthesis Using the Heterogeneous Templates Produced by Montmorillonite Catalysis. A Possible Bridge Between the Prebiotic and RNA Worlds. J. Am. Chem. Soc. 1997, 119, 7197–7201. [Google Scholar] [CrossRef]
Acevedo, O.L.; Orgel, L.E. Non-enzymatic transcription of an oligodeoxynucleotide 14 residues long. J. Mol. Biol. 1987, 197, 187–193. [Google Scholar] [CrossRef]
Szostak, J.W. The eightfold path to non-enzymatic RNA replication. J. Syst. Chem. 2012, 3, 2. [Google Scholar] [CrossRef] [Green Version]
Cairns-Smith, A.G. Genetic Takeover and the Mineral Origins of Life; Cambridge University Press: Cambridge, UK, 1987. [Google Scholar]
Dyson, F. Origins of Life; Cambridge University Press: Cambridge, UK, 1999. [Google Scholar] [CrossRef] [Green Version]
Sakuma, Y.; Imai, M. From vesicles to protocells: The roles of amphiphilic molecules. Life 2015, 5, 651–675. [Google Scholar] [CrossRef]
Szostak, J.W.; Bartel, D.P.; Luisi, P.L. Synthesizing life. Nature 2001, 409, 387–390. [Google Scholar] [CrossRef]
Segré, D.; Ben-Eli, D.; Deamer, D.W.; Lancet, D. The Lipid World. Orig. Life Evol. Biosph. 2001, 31, 119–145. [Google Scholar] [CrossRef]
Martin, W.; Baross, J.; Kelley, D.; Russell, M.J. Hydrothermal vents and the origin of life. Nat. Rev. Microbiol. 2008, 6, 805–814. [Google Scholar] [CrossRef] [PubMed]
Damer, B.; Deamer, D. The Hot Spring Hypothesis for an Origin of Life. Astrobiology 2019, 20, 429–452. [Google Scholar] [CrossRef] [Green Version]
Damer, B.; Deamer, D. Coupled phases and combinatorial selection in fluctuating hydrothermal pools: A scenario to guide experimental approaches to the origin of cellular life. Life 2015, 5, 872–887. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Deamer, D.; Damer, B.; Kompanichenko, V. Hydrothermal Chemistry and the Origin of Cellular Life. Astrobiology 2019, 19, 1523–1537. [Google Scholar] [CrossRef]
Kvenvolden, K.; Lawless, J.; Pering, K.; Peterson, E.; Flores, J.; Ponnamperuma, C.; Kaplan, I.R.; Moore, C. Evidence for Extraterrestrial Amino-acids and Hydrocarbons in the Murchison Meteorite. Nature 1970, 228, 923–926. [Google Scholar] [CrossRef] [PubMed]
Miller, S.L.; Urey, H.C. Organic Compound Synthesis on the Primitive Earth. Science 1959, 130, 245. [Google Scholar] [CrossRef]
Pino, S.; Sponer, J.E.; Costanzo, G.; Saladino, R.; Mauro, E.D. From formamide to RNA, the path is tenuous but continuous. Life 2015, 5, 372–384. [Google Scholar] [CrossRef] [Green Version]
Powner, M.W.; Sutherland, J.D. Prebiotic chemistry: A new modus operandi. Phil. Trans. R. Soc. B Biol. Sci. 2011, 366, 2870–2877. [Google Scholar] [CrossRef] [Green Version]
Powner, M.W.; Gerland, B.; Sutherland, J.D. Synthesis of activated pyrimidine ribonucleotides in prebiotically plausible conditions. Nature 2009, 459, 239–242. [Google Scholar] [CrossRef]
Attwater, J.; Wochner, A.; Holliger, P. In-ice evolution of RNA polymerase ribozyme activity. Nat. Chem. 2013, 5, 1011–1018. [Google Scholar] [CrossRef] [Green Version]
Hays, L. NASA Astrobiology Strategy; Technical report; National Aeronautics and Space Administration: Washington, DC, USA, 2015.
Coveney, P.V.; Swadling, J.B.; Wattis, J.A.D.; Greenwell, H.C. Theory, modelling and simulation in origins of life studies. Chem. Soc. Rev. 2012, 41, 5430–5446. [Google Scholar] [CrossRef]
Lanier, K.A.; Williams, L.D. The Origin of Life: Models and Data. J. Mol. Evol. 2017, 84, 85–92. [Google Scholar] [CrossRef] [Green Version]
Wu, M.; Higgs, P.G. The origin of life is a spatially localized stochastic transition. Biol. Direct 2012, 7, 42. [Google Scholar] [CrossRef] [Green Version]
Gillespie, D.T. Exact stochastic simulation of coupled chemical reactions. J. Phys. Chem. 1977, 81, 2340–2361. [Google Scholar] [CrossRef]
Hordijk, W.; Steel, M. Detecting autocatalytic, self-sustaining sets in chemical reaction systems. J. Theor. Biol. 2004, 227, 451–461. [Google Scholar] [CrossRef] [Green Version]
Walker, S.I. Origins of life: A problem for physics, a key issues review. Rep. Prog. Phys. 2017, 80, 092601. [Google Scholar] [CrossRef]
Wattis, J.A.D.; Coveney, P.V. The Origin of the RNA World: A Kinetic Model. J. Phys. Chem. B 1999, 103, 4231–4250. [Google Scholar] [CrossRef] [Green Version]
Kun, Á.; Szilágyi, A.; Könnyű, B.; Boza, G.; Zachar, I.; Szathmáry, E. The dynamics of the RNA world: Insights and challenges. Ann. N. Y. Acad. Sci. 2015, 1341, 75–95. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Szilágyi, A.; Zachar, I.; Scheuring, I.; Kun, Á.; Könnyű, B.; Czárán, T. Ecology and Evolution in the RNA World Dynamics and Stability of Prebiotic Replicator Systems. Life 2017, 7, 48. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Scheuring, I.; Szilágyi, A. Diversity, stability, and evolvability in models of early evolution. Curr. Opin. Syst. Biol. 2019, 13, 115–121. [Google Scholar] [CrossRef] [Green Version]
Takeuchi, N.; Hogeweg, P. Evolutionary dynamics of RNA-like replicator systems: A bioinformatic approach to the origin of life. Phys. Life Rev. 2012, 9, 219–263. [Google Scholar] [CrossRef] [Green Version]
Orgel, L. A Simpler Nucleic Acid. Science 2000, 290, 1306. [Google Scholar] [CrossRef] [PubMed]
Maury, C.P.J. Origin of life. Primordial genetics: Information transfer in a pre-RNA world based on self-replicating beta-sheet amyloid conformers. J. Theor. Biol. 2015, 382, 292–297. [Google Scholar] [CrossRef] [Green Version]
Ehrenfreund, P.; Rasmussen, S.; Cleaves, J.; Chen, L. Experimentally Tracing the Key Steps in the Origin of Life: The Aromatic World. Astrobiology 2006, 6, 490–520. [Google Scholar] [CrossRef]
Kunin, V. A System of Two Polymerases—A Model for the Origin of Life. Orig. Life Evol. Biosph. 2000, 30, 459–466. [Google Scholar] [CrossRef] [PubMed]
Wright, S. Evolution and the Genetics of Populations, Volume 1; The University of Chicago Press: Chicago, IL, USA, 1984. [Google Scholar]
Feng, X.; Pechen, A.; Jha, A.; Wu, R.; Rabitz, H. Global optimality of fitness landscapes in evolution. Chem. Sci. 2012, 3, 900–906. [Google Scholar] [CrossRef]
Davidson-Pilon, C. lifelines: Survival analysis in Python. J. Open Source Softw. 2019, 4, 1317. [Google Scholar] [CrossRef] [Green Version]
Bornholt, J.; Lopez, R.; Carmean, D.M.; Ceze, L.; Seelig, G.; Strauss, K. A DNA-Based Archival Storage System. In Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems, Association for Computing Machinery, New York, NY, USA, 19–23 April 2016; pp. 637–649. [Google Scholar] [CrossRef] [Green Version]
Kitadai, N.; Maruyama, S. Origins of building blocks of life: A review. Geosci. Front. 2018, 9, 1117–1153. [Google Scholar] [CrossRef]
Li, G.; Rabitz, H. General formulation of HDMR component functions with independent and correlated variables. J. Math. Chem. 2012, 50, 99–130. [Google Scholar] [CrossRef]

Figure 1. Measures of system population

X_{t}

until hitting time

τ_{v}

for high-fidelity replicator volume fraction

v = 0.25

with sequence dimension

n = 3

, fitness/similarity curvature

l = k = - log (0.01) / n

, initial population size

I = | X_{0} | = 10

, singleton high-fidelity replicator

R = {{x}}

, with “tent” fitness and similarity functions. (a) concentration of RNA sequences by Hamming distance to high-fidelity replicator; (b) population size of RNA sequences by Hamming distance to high-fidelity replicator; (c) polymerase RNA sequence output by Hamming distance to high-fidelity replicator.

Figure 1. Measures of system population

X_{t}

until hitting time

τ_{v}

for high-fidelity replicator volume fraction

v = 0.25

with sequence dimension

n = 3

, fitness/similarity curvature

l = k = - log (0.01) / n

, initial population size

I = | X_{0} | = 10

, singleton high-fidelity replicator

R = {{x}}

, with “tent” fitness and similarity functions. (a) concentration of RNA sequences by Hamming distance to high-fidelity replicator; (b) population size of RNA sequences by Hamming distance to high-fidelity replicator; (c) polymerase RNA sequence output by Hamming distance to high-fidelity replicator.

Figure 2. Measures of system population

X_{t}

until hitting time

τ_{v}

for high-fidelity replicator volume fraction

v = 0.25

with sequence dimension

n = 3

, fitness/similarity curvature

l = k = - log (0.1) / n

, initial population size

I = | X_{0} | = 10

, singleton high-fidelity replicator

R = {{x}}

, with “tent” fitness and similarity functions. (a) concentration of RNA sequences by Hamming distance to high-fidelity replicator; (b) population size of RNA sequences by Hamming distance to high-fidelity replicator; (c) polymerase RNA sequence output by Hamming distance to high-fidelity replicator.

Figure 2. Measures of system population

X_{t}

until hitting time

τ_{v}

for high-fidelity replicator volume fraction

v = 0.25

with sequence dimension

n = 3

, fitness/similarity curvature

l = k = - log (0.1) / n

, initial population size

I = | X_{0} | = 10

, singleton high-fidelity replicator

R = {{x}}

, with “tent” fitness and similarity functions. (a) concentration of RNA sequences by Hamming distance to high-fidelity replicator; (b) population size of RNA sequences by Hamming distance to high-fidelity replicator; (c) polymerase RNA sequence output by Hamming distance to high-fidelity replicator.

Table 1. Weibull reliability model.

Name	Quantity	Baseline	Quantity	Proportional Hazards
Failure density	$f (t \| α, β)$	$\frac{α}{t} {(\frac{t}{β})}^{α} e^{- {(\frac{t}{β})}^{α}}$	$f (t, θ \| α, β, γ)$	$\frac{α}{t} {(\frac{t}{β})}^{α} e^{γ \cdot θ} e^{- {(\frac{t}{β})}^{α} e^{γ \cdot θ}}$
Failure distribution	$F (t \| α, β)$	$1 - e^{- {(\frac{t}{β})}^{α}}$	$F (t, θ \| α, β, γ)$	$1 - e^{- {(\frac{t}{β})}^{α} e^{γ \cdot θ}}$
Reliability distribution	$R (t \| α, β)$	$e^{- {(\frac{t}{β})}^{α}}$	$R (t, θ \| α, β, γ)$	$e^{- {(\frac{t}{β})}^{α} e^{γ \cdot θ}}$
Cumulative hazard	$H (t \| α, β)$	${(\frac{t}{β})}^{α}$	$H (t, θ \| α, β, γ)$	${(\frac{t}{β})}^{α} e^{γ \cdot θ}$
Hazard rate	$h (t \| α, β)$	$\frac{α}{t} {(\frac{t}{β})}^{α}$	$h (t, θ \| α, β, γ)$	$\frac{α}{t} {(\frac{t}{β})}^{α} e^{γ \cdot θ}$

Table 2. Reactions and orders.

Reaction	Order
RNA double strand formation	2
RNA double strand dissociation	1
RNA polymerization	2
RNA decay	1
Clay polymerization	1
Clay oligomerization	0
Compartmentalization	1
Metabolism to replication	1
Metabolism & replication to metabolism	2

Table 3. Model parameters.

$θ$	Name	Domain	Value(s)
n	sequence dimension	$N_{> 0}$	${3, 4, 5}$
k	RNA fitness parameter	$R_{\geq 0}$	${- log (i) / n : i = 0.1, 0.05, 0.01, 0.005, 0.001}$
l	RNA similarity parameter	$R_{\geq 0}$	${- log (i) / n : i = 0.1, 0.05, 0.01, 0.005, 0.001}$
m	RNA fidelity parameter	$R_{\geq 0}$	$- log (0.25) / n$
p	clay fidelity probability	$(0, 1]$	0.9
$k_{s s}$	double-strand dissociation rate	$R_{\geq 0}$	1
$k_{d s}$	double-strand formation rate	$R_{\geq 0}$	1
$k_{r e p}$	RNA replication rate	$R_{\geq 0}$	10
$k_{⌀}$	RNA decay rate	$R_{\geq 0}$	(0, 1)
$k_{c l a y - o}$	clay RNA oligomerization rate	$R_{\geq 0}$	1
$k_{c l a y - p}$	clay RNA polymerization rate	$R_{\geq 0}$	(0, 20)

Table 4. HDMR sensitivity indices of hitting probability

P (τ_{v} (θ) < \infty)

for the core model.

Table 4. HDMR sensitivity indices of hitting probability

P (τ_{v} (θ) < \infty)

for the core model.

$θ$	$S_{θ}$
Sequence length n	0.06
Curvature k	0.69
∑	0.75

Table 5. HDMR sensitivity indices of hitting time

τ_{v} (θ)

for the core model.

Table 5. HDMR sensitivity indices of hitting time

τ_{v} (θ)

for the core model.

$θ$	$S_{θ}$
Sequence length n	0.57
Curvature k	0.04
∑	0.61

Table 6. Weibull–Cox model parameters for hitting times of clay and decay model.

$θ$	Name	Coefficient $γ_{θ}$	p-Value
n	sequence dimension	0.54	<0.005
$k = l$	RNA fitness parameter	27.87	<0.005
p	clay fidelity probability	−0.08	0.35
$k_{⌀}$	RNA decay rate	0.80	<0.005
$f_{c l a y}$	fraction clay RNA polymerization rate	0.96	<0.005

Table 7. HDMR sensitivity indices of hitting probability

P (τ_{v} (θ) < \infty)

for expanded model (clay and decay).

Table 7. HDMR sensitivity indices of hitting probability

P (τ_{v} (θ) < \infty)

for expanded model (clay and decay).

$θ$	$S_{θ}$
Sequence length n	0.0213
Curvature k	0.6732
Decay rate $k_{⌀}$	0.0120
Clay fidelity p	0.0114
Fraction clay RNA polymerization rate $f_{c l a y}$	0.0219
∑	0.7399

Table 8. HDMR sensitivity indices of

τ_{v} (θ) < \infty

for expanded model (clay and decay).

Table 8. HDMR sensitivity indices of

τ_{v} (θ) < \infty

for expanded model (clay and decay).

$θ$	$S_{θ}$
Sequence dimension n	0.0013
Curvature k	0.0030
Decay $k_{⌀}$	0.1129
Clay fidelity p	0.0152
Fraction clay RNA polymerization rate $f_{c l a y}$	0.2014
∑	0.3339

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Bastian, C.D.; Rabitz, H. Hitting Times of Some Critical Events in RNA Origins of Life. Life 2021, 11, 1419. https://doi.org/10.3390/life11121419

AMA Style

Bastian CD, Rabitz H. Hitting Times of Some Critical Events in RNA Origins of Life. Life. 2021; 11(12):1419. https://doi.org/10.3390/life11121419

Chicago/Turabian Style

Bastian, Caleb Deen, and Hershel Rabitz. 2021. "Hitting Times of Some Critical Events in RNA Origins of Life" Life 11, no. 12: 1419. https://doi.org/10.3390/life11121419

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Hitting Times of Some Critical Events in RNA Origins of Life

Abstract

1. Introduction

2. Materials and Methods

2.1. Core Model

2.1.1. System

2.1.2. High-Fidelity Set

2.1.3. Distance

2.1.4. Fitness

2.1.5. Similarity

2.1.6. Fidelity

2.1.7. Counting Representation

2.1.8. Reaction Rates

2.2. Hitting Times

2.2.1. Functional Structure

2.2.2. Statistical Structure

2.3. Surface Chemistries

2.4. Reactions as Measure-Kernel-Functions

2.4.1. RNA Polymerization

2.4.2. Non-RNA Polymerization

2.5. Decay

2.6. Compartmentalization

2.7. Metabolism

2.8. Reaction Overview

3. Results

3.1. Stability: ODEs

3.2. Simulation Reaction State

3.2.1. Core Model with “Tent” Functions, Probable Hitting P ( τ v ( θ ) < ∞ ) ∼ 1

3.2.2. Core Model with “Tent” Functions, Improbable Hitting P ( τ v ( θ ) < ∞ ) ∼ 0

3.2.3. Core Model with Linear Functions, Improbable Hitting P ( τ v ( θ ) < ∞ ) ∼ 0

3.2.4. Expanded Model with “Tent” Functions, Probable Hitting P ( τ v ( θ ) < ∞ ) ∼ 1

3.3. Hitting Times: Functional and Survival Analysis

3.3.1. Core Model, τ v ( θ ) for v = 0.1 with θ = ( n , k ) and “Tent” Functions

3.3.2. Clay and Decay Model, τ v ( θ ) for v = 0.1 with θ = ( n , k , k ⌀ , k c l a y − p , p ) and “Tent” Functions

3.4. Compartmentalization

4. Discussion and Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A. Discrete Probability Space

Appendix B. Other Fitness Functions

Appendix C. Measure-Kernel-Function

Appendix C.1. Reactions as Measure-Kernel-Functions

Appendix C.2. Deterministic Model

Appendix D. Hitting Cardinality

Reaction Cardinality

Appendix E. Approximate Reaction Rates

Appendix F. High Dimensional Model Representation

Appendix G. Reliability Definitions

Appendix H. Additional Figures

Appendix H.1. Linear Landscape

Appendix H.2. Core Model with Clay

Appendix H.3. Hitting/Survival Analysis

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3.2.1. Core Model with “Tent” Functions, Probable Hitting $P (τ_{v} (θ) < \infty) \sim 1$

3.2.2. Core Model with “Tent” Functions, Improbable Hitting $P (τ_{v} (θ) < \infty) \sim 0$

3.2.3. Core Model with Linear Functions, Improbable Hitting $P (τ_{v} (θ) < \infty) \sim 0$

3.2.4. Expanded Model with “Tent” Functions, Probable Hitting $P (τ_{v} (θ) < \infty) \sim 1$

3.3.1. Core Model, $τ_{v} (θ)$ for $v = 0.1$ with $θ = (n, k)$ and “Tent” Functions

3.3.2. Clay and Decay Model, $τ_{v} (θ)$ for $v = 0.1$ with $θ = (n, k, k_{⌀}, k_{c l a y - p}, p)$ and “Tent” Functions