## 2. Autocatalytic Sets

Autocatalytic cycles and sets seem to play an important role in more than one of the steps in the above OoL scenario, and are a necessary, although not sufficient, condition for life.

Autocatalytic sets are defined more formally in [

37] as follows. Given a network of catalyzed chemical reactions, a (sub)set

$\mathcal{R}$ of such reactions is called:

Reflexively autocatalytic (RA) if every reaction in $\mathcal{R}$ is catalyzed by at least one molecule involved in any of the reactions in $\mathcal{R}$;

F-generated (F) if every reactant in $\mathcal{R}$ can be constructed from a small “food set” F by successive applications of reactions from $\mathcal{R}$;

Reflexively autocatalytic and F-generated (RAF) if it is both RA and F.

The food set

F contains molecules that are assumed to be freely available in the environment. Thus, an RAF set formally captures the notion of “catalytic closure”,

i.e., a self-sustaining set supported by a steady supply of (simple) molecules from some food set. Note that this notion of an autocatytic

set is slightly different from the (chemical) use of the term autocatalytic

reaction in which some molecule directly catalyzes its own production. With an autocatalytic set we do

not mean a set of autocatalytic reactions, but rather a set of (arbitrary) molecules and reactions which is “collectively autocatalytic” in the sense that all molecules help in producing each other (through mutual catalysis) in some closed and self-sustained manner (supported by a food set). Thus,

autocatalytic cycles, hypercycles, and

collectively autocatalytic sets can all be seen as particular instances of RAF sets.

Figure 1 shows a simple example.

This formal RAF framework was introduced and analyzed (both theoretically and computationally) in our previous work [

37,

38,

39]. In particular, we:

Formalized the notion of a catalytic reaction system (CRS) and autocatalytic (RAF) sets;

Introduced a polynomial-time algorithm for determining if any CRS has within it an RAF set, and if so, finding such self-sustaining subsystems (including ones that are minimal);

Showed that only a linear growth rate (with system size) in catalytic activity is sufficient for RAF sets to appear with high probability in random instances of a simple catalytic reaction system based on polymer cleavage and ligation reactions (the original model of [

26]).

**Figure 1.**
A simple example of a catalytic reaction system (CRS) with seven molecule types $\{a,b,c,d,e,f,g\}$ (solid nodes) and four reactions $\{{r}_{1},{r}_{2},{r}_{3},{r}_{4}\}$ (open nodes). The food set is $F=\{a,b\}$. Solid arrows indicate reactants going into and products coming out of a reaction, dashed arrows indicate catalysis. The subset $\mathcal{R}=\{{r}_{1},{r}_{2}\}$ (shown with bold arrows) is RAF: (1) The molecules involved in $\mathcal{R}$ are $\{a,b,c,d\}$, reaction ${r}_{1}$ is catalyzed by d, and ${r}_{2}$ by a, so $\mathcal{R}$ is reflexively autocatalytic (RA). (2) The reactants in $\mathcal{R}$ are either already in the food set (a and b) or can be created from it through ${r}_{1}$ (c), so $\mathcal{R}$ is also F-generated (F). The RAF algorithm applied to this CRS would return the set $\mathcal{R}$.

**Figure 1.**
A simple example of a catalytic reaction system (CRS) with seven molecule types $\{a,b,c,d,e,f,g\}$ (solid nodes) and four reactions $\{{r}_{1},{r}_{2},{r}_{3},{r}_{4}\}$ (open nodes). The food set is $F=\{a,b\}$. Solid arrows indicate reactants going into and products coming out of a reaction, dashed arrows indicate catalysis. The subset $\mathcal{R}=\{{r}_{1},{r}_{2}\}$ (shown with bold arrows) is RAF: (1) The molecules involved in $\mathcal{R}$ are $\{a,b,c,d\}$, reaction ${r}_{1}$ is catalyzed by d, and ${r}_{2}$ by a, so $\mathcal{R}$ is reflexively autocatalytic (RA). (2) The reactants in $\mathcal{R}$ are either already in the food set (a and b) or can be created from it through ${r}_{1}$ (c), so $\mathcal{R}$ is also F-generated (F). The RAF algorithm applied to this CRS would return the set $\mathcal{R}$.

Next to the above basic three requirements in an RAF we can easily build in additional assumptions to incorporate greater biochemical realism and to exclude trivial situations. Some such assumptions are:

Fortunately it is straightforward to add some of these constraints to the definition of an RAF without seriously affecting either the algorithm for finding an RAF, or the mathematical analysis. Indeed, regarding the first assumption, one can formally dispense with catalysis altogether by regarding each catalyst as both a reactant and a product of a reaction, and introducing a ‘dummy’ molecule that catalyzes all reactions. However we have found it useful to separate out catalysis explicitly for our mathematical analysis of catalytic reaction networks.

It is also possible to express the RAF model within other mathematical frameworks—in particular, it can be rephrased within the context of Petri nets, a model that Sharov has used to study self-reproducing systems in biology in an early paper [

40] (see also [

41]). The concept of an RAF also shares some similarities with Rosen’s category-theoretic approach to metabolic closure [

42,

43]. Other mathematically-based models of autocatalysis have been proposed in, e.g., [

44,

45,

46].

## 3. Autocatalytic Sets and the Origin of Life

The RAF framework is important and relevant to the origin of life in several ways. First, it places the notion of autocatalytic sets in a formal framework which can be (and has been) used and studied both analytically and computationally. This is a necessary first step when attempting to say anything about the plausibility of their appearance. As mentioned, other formal models have been introduced and studied previously, but many of these either already assumed the existence of one or more autocatalytic sets or cycles (such as the hypercycle [

20,

21,

22] or the Chemoton model [

44]), or their claims were based on flawed arguments (such as Kauffman’s original claim [

26], as pointed out in [

29]). Some claims (especially those arguing against the plausibility of autocatalytic sets) seem to lack any mathematical support at all [

16]. So, having a mathematically sound model and computationally efficient method available to study the probability of the emergence of autocatalytic sets is an important step forward in itself.

Second, our RAF framework includes an efficient algorithm for finding autocatalytic sets in general reaction networks, which allows us to study their appearance both in model systems and in real (bio)chemical networks (such an algorithm was not provided with any of the other mentioned mathematical models of autocatalytic sets). In [

37], we introduced a polynomial time algorithm for finding RAF sets, and applied it to Kauffman’s model of binary polymers with ligation and cleavage reactions [

26,

27]. The average running time of the algorithm was shown to be sub-quadratic (in the size of the reaction network). Therefore, it provides an efficient tool to study real (bio)chemical networks as well.

For example, recently the Beilstein database [

47] of all known organic compounds and reactions was studied [

48,

49], showing that there is a core set (a strongly connected component) that contains only 4% of the compounds, but which together give rise to 78% of all known organic molecules in just a few reactions (three steps on average). These studies, however, did

not take catalysis into account. It would be useful to apply the RAF algorithm to this same database of organic molecules and reactions,

including the catalysis, and perform a similar analysis for the occurrence of RAF sets. In [

49] the size of the reaction set studied was about 7 million reactions. The largest networks analyzed with the RAF algorithm in [

37] were about 5 million catalyzed reactions,

i.e., of the same order of magnitude as the Beilstein database. So, it is clearly possible to analyze this largest known real chemical reaction set with the RAF algorithm (which we hope to undertake in future work). Such an analysis could, for example, result in the identification of other possible candidates for prebiotic metabolism (like the reverse citric acid cycle), or provide useful directions for setting up chemical experiments to create autocatalytic sets in vitro, which is currently still one of the major challenges in systems chemistry [

50,

51].

As another example, complete metabolic networks of several organisms (mostly bacteria) were analyzed recently to find autocatalytically replicating molecules [

52]. However, although highly original in its setup, the algorithmic method used in that study is not as mathematically complete as the RAF framework, and mainly finds individual molecules (as opposed to complete RAF sets). Therefore, this provides another setting where the RAF algorithm can be applied to analyze the occurrence of autocatalytic sets, which we indeed expect to do in the future. This could help in answering many questions about the appearance, size distribution, and structure of autocatalytic sets in real (evolved) biochemical networks.

Finally, and perhaps most importantly, the RAF framework has provided strong support for the claim that autocatalytic sets indeed have a high probability of occurrence, even with very moderate levels of catalysis. Our computational results in [

37] indicate that only a

linear growth in catalytic activity (with system size) is necessary for RAF sets to appear with high likelihood in Kauffman’s binary polymer model. This was subsequently verified analytically [

39]. The level of catalysis necessary for RAF sets to occur in our simulations is between 1 and 2 reactions per molecule (on average), a number which is (bio)chemically quite realistic, especially for proteins [

53,

54,

55]. This is in stark contrast to the exponential growth required in Kauffman’s original argument, and therefore re-instates his claim that in “sufficiently complex chemical reaction systems” autocatalytic sets will arise almost inevitably. Moreover, we have provided a formal way of quantifying “sufficiently complex”, in terms of the level of catalysis required. These results, combined with existing experimental evidence, make autocatalytic sets a serious and plausible candidate for consideration in origin of life scenarios.

It is important, though, to stress that the existence of an RAF, while necessary for the emergence of self-sustaining life, is far from sufficient for it for at least three reasons. Firstly, the approach ignores the concentration and stoichiometry of reactants, the possibility of degrading side reactions or the presence of molecules that inhibit other reactions, or complex catalysis in which a catalyst may remain bound at the end of the reaction to some of the products. However, as with the Petri net formalism [

41], some of these extensions can be built into the model—for example inhibition in an RAF has already been modeled by a slightly more general definition of RAF in [

39]. Secondly, the approach does not address (let alone solve) the ‘containment problem’ that requires reactions to be physically contained by some boundary or membrane so the reactants do not diffuse and reduce their concentrations. This is, of course, a problem faced by nearly all attempts to explain the origin of early life, and is not specific to the autocatalytic set, or metabolism first, point of view. Moreover, several proposed theories already exist to try to solve this ’containment problem’, for example by considering reactions that take place on the surface of some (inorganic, possibly catalytic) substance [

13,

14,

41], or reactions contained inside tiny, naturally occurring compartmental structures in e.g., ocean-floor thermal vents [

35,

36], or even through the formation of self-organizing lipid membranes [

56]. And thirdly, the RAF approach does not directly address the problem of heredity in prebiotic systems, and the role of natural selection. But here too, this issue arises in all approaches to early-life models, and is obviously directly related to the containment problem.

While it will be interesting and worthwhile to extend the RAF approach to handle some of these complexities, we see the basic necessity (rather than sufficiency) of RAFs in early life research as the justification for searching for such subsets in biochemical systems: The RAF concept is a simple, testable and searchable combinatorial criterion, and while few RAFs would furnish a viable basis for early life, restricting attention to them severely limits the possible candidates one needs to examine in trying to identify the origins of primitive biochemistry.