Exploring Plant Sesquiterpene Diversity by Generating Chemical Networks

Plants produce a diverse portfolio of sesquiterpenes that are important in their response to herbivores and the interaction with other plants. Their biosynthesis from farnesyl diphosphate depends on the sesquiterpene synthases that admit different cyclizations and rearrangements to yield a blend of sesquiterpenes. Here, we investigate to what extent sesquiterpene biosynthesis metabolic pathways can be reconstructed just from the knowledge of the final product and the reaction mechanisms catalyzed by sesquiterpene synthases. We use the software package MedØlDatschgerl (MØD) to generate chemical networks and to elucidate pathways contained in them. As examples, we successfully consider the reachability of the important plant sesquiterpenes β -caryophyllene, α -humulene, and β -farnesene. We also introduce a graph database to integrate the simulation results with experimental biological evidence for the selected predicted sesquiterpenes biosynthesis.


Introduction
Terpenes form a large and diverse class of natural products appearing particularly in the essential oils of many plants. They have commercial uses in medicine and as fragrances in perfumery. Synthetic derivatives of natural terpenes are also used as aromas and foodadditives [1]. Ecologically, they perform key functions both in direct plant defense and in indirect mechanisms involving herbivores and their natural enemies [2].
Terpenes are produced throughout the tree of life from the C 5 compounds isopentenyl pyrophosphate (IPP) and dimethylallyl pyrophosphate (DMAPP). Leopold Ružička [3] formulated the are applicable to a given molecule or not. The representation of molecules as graphs and of reactions as graph transformations enables the efficient expansion of large networks of logically possible reactions. The integer flows in these networks correspond to feasible synthesis pathways, which can be identified efficiently as solutions of Integer Linear Programs (ILP). In Section 4, we describe in detail this approach and its computational efficiency as a key advantage, as well as adhering to FAIR Guiding Principles [24]. The software package MØD [25,26] makes it possible to explore the diversity of potential sesquiterpene synthesis pathways leading up to a user-defined set of reaction mechanisms (here, the types of rearrangements catalyzed by TPSs). Along with this, using a database of Scenarios taken from the literature, it is possible to filter from the universe of generated potential pathways those that are most plausibly expected in a given scenario on the basis of the available evidence. We have shown that such an explorative approach is indeed feasible and yields reasonable results without the need first to mine an extensive knowledge base. As examples, we consider the biosynthesis common plant sesquiterpenes as β-caryophyllene, α-humulene, and β-farnesene.

Results
In order to propose a potential biosynthesis pathway, we have solved the corresponding formal reachability problem: Given known starting material(s), a target molecule of which the biosynthesis is unknown, and a set of reaction rules modeling the class(es) of enzymes suspected to be involved in the potential pathway, we ask whether the target is reachable from the starting material. The graph grammar rules implemented here accurately emulate the natural chemical mechanisms for multiproduct sesquiterpene enzymes and their cyclization cascades.
It has been shown by a reduction to the word problem for (semi) groups [27] that this type of reachability question for chemical transformation systems is Turing undecidable. Hence, in practice, we only solve a restricted problem: Is the target reachable from the starting material within a bounded number of steps. This question can be solved efficiently within the MØD framework by expanding a network of potential reactions. Within this network, the reachability question reduces to the existence of a hyperpath connecting prescribed sources and targets -a problem that is solved efficiently by ILP. The reachability question for molecules of interest can be investigated under varying conditions by modulating the sets of reactions and/or additional molecules, e.g., nucleophiles, present during a reaction networks expansion and pathway search.
Terpene cyclization involves highly reactive cationic intermediates vulnerable to a nucleophilic attack or elimination reactions. This characteristic allows us to tie the changing mixture of reachable sesquiterpene products to variations in the external constraints which can, in many cases, be mapped to changes in the environmental conditions. A particular set of external conditions previously described in the literature, will, in the following, be called a Scenario. We also consider how some selected generated pathways fit into a Scenario, acknowledging their advantages and limitations.

Protonation-Dependent Diphosphate Cleavage
Natural class I terpene synthases precisely pre-orient their acyclic diphosphate substrates in their active site pockets in a reaction-ready conformation. Then, the complex reaction cascade is initiated by cleaving off the diphosphate group from the pre-oriented substrate molecule, a pH-dependent [4] process that requires divalent metal ions as cofactor [28,29]. It has been shown that this ionization step of the allylic diphosphate is the rate-limiting step in a class I terpene synthetase catalysis [30]. Then, the enzyme guides the resulting highly reactive carbocation via rearrangements of the carbon skeleton towards various polycyclic product molecules. The highly reactive carbocation intermediates can scavenge nucleophiles, present in the enzyme's active site, or can saturate the positive charge by deprotonation, terminating the rearrangement cascade in an early stage. A fair example of a rearrangement cascade is the isomerization of FPP to Nerolidyl Diphosphate (NPP) [31], where the diphosphate anion terminates the reaction cascade by reattaching to the carbocationic intermediate immediately after an allylic rearrangement has occurred. The reachability of NPP from FPP is trivial to ascertain in a constraint-independent simulation ( Figure 1) using only the cleavage of the diphosphate group and the addition of a nucleophile to a cation in the reaction set ( Figure 7).

Synthesis of β-Caryophyllene, α-Humulene, β-Farnesene, and their Side Products
(E)-β-caryophyllene is produced from FPP and is emitted by different plant tissues, often in response to a herbivore attack [16,32,33]. There is ample evidence that TPS catalysis produces a mixture of sesquiterpenes rather than a single sesquiterpene product [2,34,35]. The same mechanism that potentially produces β-caryophyllene, α-humulene, and β-farnesene also can produce additional compounds. Simulation 02 shows the computational reachability of these compounds ( Figure 2) together with P0, 0 as an intermediate compound, and the predicted side compounds P0, 1 and P0, 2. For this simulation, the applied set of rules are presented in Figure 8. E, E Humulyl cation C11+ Figure 2. The plotted result of Simulation 02 for β-caryophyllene, α-humulene, β-farnesene, and the side-predicted compounds from FPP.

Large-Scale Exploration of Terpene Space
Simulation 03 uses a more explorative derivation graph strategy. In this example, a vast diversity of the feasible compounds is generated starting with an FPP and a water molecule, an extended set of rules ( Figure 9) iteratively applied seven times ( Figure 3).

2Path-Sesquiterpenes Database
A reasonable question dealing with the predicted data is how to explain them. Another matter is how to express them properly. Graph databases are a suitable tool for this purpose that has been demonstrated to be an efficient and convenient way to store and explore metabolic networks [36,37]. Here, we used a proposed graph database (2Path-Sesquiterpenes) to store and enrich the simulation data with experimental evidence related to the predicted sesquiterpenes, the Scenarios. As an example of this use, any simulation using the provided set of rules can be stored using the complementary code Processes_Store, also available on GitHub.  Once stored, the generated chemical mechanisms level metabolic network can be traversed and handled through graph database query languages such as Cypher (https://neo4j.com/developer/ cypher-query-language) or Gremlin (https://tinkerpop.apache.org/gremlin.html) and visualized using tools such as Cytoscape [38]. Figure 4 shows a selected example of a query result for a stored simulation into Neo4J graph database (https://neo4j.com). This example shows how a particular generated pathway can be retrieved using Cypher queries, as well as its integration with Scenarios.  Figure 4. An example of a generated pathway stored in the Neo4J graph database. In this example, yellow nodes (label "Rules") denote the rules "diphosphate loss" and "H + loss". The green nodes (label "Compound") denote compounds from FPP to α-humulene and other generated compounds. The red nodes (label "Scenario") denote the Scenarios for the α-humulene biosynthesis. The details inside each Scenario node, as the experimental conditions, the plant tissue, EC (Enzyme Commission) numbers for the reactions, and cross-references, can be reached using the web interface, which is native for the Neo4J graph database. In this example, we expose these details for the particular Scenario S14.

Discussion
The metabolic networks have been abstracted by various data structures, including substrate graphs, bipartite graphs, directed hypergraphs, reaction graphs, stoichiometric matrix, and Petri-net [39][40][41]. Directed hypergraphs can overcome the conceptual limitations of the graph modeling of biological processes such as multilateral relationships, which are not compatible with graph edges [39]. These hypergraphs properties allow for multilateral relationships between the nodes resulting in a suitable description of biological processes [39]. For example, in a metabolic reaction such as (Compound 1 + Compound 2 → Compound 3 + Compound 4 ), a hypergraph allows the edges to connect more than two nodes.
A directed hypergraph is the result of a rule-based graph grammar that transforms compounds (abstracted as undirected graphs). Thus, the graph transformation framework MedØlDatschgerl granted a very intuitive way to explore the sesquiterpene biosynthesis through simulations of its cascade rearrangements. In this work, the proposed graph grammar rules emulate a set of natural cascade rearrangements. The resulting chemical network varies depending on the combination of rules, iteration steps, derivation strategy, and initial compounds. By exploring the possible chemical mechanisms of these reactions, it is reasonable to establish connections with the experimental results to draw from this universe of possibilities, those that can occur naturally or synthetically under certain circumstances.
Constraint-based models (CBM) have enormous potential in enhancing the understanding of reconstructed/predicted metabolic networks and predictive computational models by integrating biological evidence [42][43][44]. Omics studies can address important questions, as biosynthetic pathways, a selection of plants of interest, the environmental influence on the gene expression and the metabolome profile, and many further questions. 2Path-Sesquiterpenes is an optional feature of this work. It helps to address such questions focusing on plant sesquiterpene biosynthesis by aggregating the putative biological meaning to the predicted chemical network.
There are some related works as AFIR [23]/GRRM, RetroRules [22], and Biotransformer [21]. Isegawa et al. [45] had made computational simulations predicting pathways for terpene formation from a humulyl cation and other intermediate molecules using AFIR [23]/GRRM of which the chemical results align with ours, which allowed for a cross-confirmation. Compared to this work, the AFIR [23]/GRRM approach is a fundamentally distinct approach because it uses both a different data structure to design molecules and applies artificial forces between two or more reacting molecules. Both the Biotransformer and RetroRules [22] uses chemical reaction descriptions and rules encoded by SMARTS [46] and SMIRKS [47], and they are quite distinct from this work regarding the method. Different from these three related works, in 2Path-Sesquiterpenes, the geometry of chemical bonds are indistinguishable due to the data structure (undirected graphs) that abstracts the compound molecules.
Another matter is making the simulation results findable, accessible, interoperable, and reusable (FAIR). For this purpose, 2Path-Sesquiterpenes offers the option of storing the simulation results in a graph database. Metabolic network databases have been constructed since 1989 [48] through distinct methods, and many of them have been made available over time mostly due both to advances in metabolic network reconstruction methods and to the expansion of omic data. Despite their extensive range, such as KEGG (Kyoto Encyclopedia of Genes and Genomes) [49], Metacyc [50], and Reactome [51], to name a few, most of them do not provide information at the level of chemical mechanisms and intermediate compounds of a reaction, except for MACiE [52], which still offers rare data on biosynthetic sesquiterpene reactions. Also, graph databases can bring significant query performance improvements for selected problems including metabolic networks [51], confirming their suitability for this purpose. Table 1 shows an overview of some features of the 2Path-Sesquiterpenes and its related works.  [45] Internal AFIR/GRRM Sesquiterpenes Internal -RetroRules [22] SMART SMIRKS General -RetroRules BioTransformer [21] SMART SMIRKS General -MetXBioDB

Molecules as Undirected Graphs, Graph Transformation, Hypergraphs, and Integer Hyperflows
Labeled graphs are commonly used in the chemical literature to represent molecules. Undirected graphs are a convenient and natural way to model chemical molecules where the vertices denote atom types and the number of edges between the vertices denotes bond types. An undirected graph G can be defined as an ordered pair (V, E) consisting of a nonempty set of vertices V and a set of edges E disjoint of V [53]. Figure 5 shows an example of this abstraction for representing molecules as undirected graphs where Figure 5a shows the set of vertices V composed of v 1 and v 2 and Figure 5b shows the set of vertices V composed by v 1 and v 2 .   Graph grammars are formal systems describing rule-based graph transformation that generalize the much more commonly used term-rewriting systems [54]. In this context, the rules of a graph grammar transform undirected graphs (molecules) in other undirected graphs (molecules) through rules of a given graph grammar formalism. Thus, chemical reactions correspond to the transformations of graphs with particular features: Reactions may change the number of molecules; hence, both input (substrate) and output (product) graphs are not necessarily connected. (ii) All atoms are preserved, i.e., a chemical reaction defines a bijection between the vertex sets of input and output graphs.
Electrons are preserved as well, implying restrictive conditions on the way edges (bonds) can change, corresponding to chemical reaction mechanisms.
We favor the so-called double pushout (DPO) formalism [55] as a model of chemistry because it guarantees the structural reversibility of reactions [56] and it conveniently exposes the representation of the chemical transition state as part of the rule. In DPO graph rewriting, a transformation rule is of the form p = (L l ← − K r − → R) where L, R, and K are the left graph, right graph, and context graph, respectively. These three graphs are connected by graph morphisms l : K → L and r : K → R, describing the embedding of the context into the L and R. The application of the rule p to a graph G requires that L "matches" a part of G. The existence of another graph morphism captures this, the matching morphism m : L → G. Together, the rule p and the matching morphism m uniquely define the transformation G p,m =⇒ H of the substrate G to the product H by requiring that all morphisms in the following commutative diagram exist: In the context of modeling chemistry, we consider only injective graph morphisms, and we require that the restrictions of r and l to the vertex sets are bijective, ensuring a preservation of the atoms. Since each chemical reaction transforms a set of substrate molecules into a set of product molecules, chemical networks are directed hypergraphs, with molecules as vertices and concrete reactions as hyperedges. The iterated application of reaction rules to a set of starting molecules generates the network (directed hypergraph) of reachable molecules, i.e., the chemical space defined by the given starting molecules and reaction rules. Figure 6 shows a concrete example of a rule denoting the OPP loss and its application as a matching morphism in a molecule of FPP according to Diagram (1). Figure 6. An example of a graph grammar rule for the OPP loss applied to the FPP molecule. The resulting pathway is a directed hypergraph with the reachable molecules.
Chemical reactions preserve mass, atom types, and charges. Chemical reaction pathways, therefore, form flows in the reaction hypergraph that connect a set of input molecules with a set of output molecules [39,57]. As an immediate consequence, reachability questions in chemical reaction networks translate into the existence of integer flows [26], which is efficiently evaluated by means of Integer Linear Programming (ILP).

Simulations
Simulations were performed using MedØlDatschgerl (MØD) [25], a software package that combines a DPO graph rewriting engine and an ILP solver to generate and analyze large-scale reaction networks. MØD provides a Python interface (the Python 3 module PyMØD comprising bindings to the underlying library libMØD) as well as a system of generic exploration strategies [58] to guide and restrict the generation process. Graph transformation rules are specified manually in GML format [59]; substrate molecules can be provided either in GML or SMILES [60] format.
We have designed DPO graph transformation rules representing chemical mechanisms involved in the production of the plant sesquiterpenes β-caryophyllene, α-humulene, and β-farnesene from its precursor FPP, exploring the generation of distinct compounds through simulations with varying sets of rules. The presented simulations are available on GitHub. Figure 7 shows the rules employed in Simulation 01, Figure 8 shows the rules employed in Simulation 02, and Figure 9 shows the rules that, together with rules of Figures 7 and 8, were employed in Simulation 03.

Database Storage
NoSQL Graph Databases are Database Management Systems (DBMS) that can store graphs natively. They have been used in research with biological data, especially in cases where data integration is a determining factor [37]. In this work, we have used the Neo4J graph database to optionally store both the generated sesquiterpenes biosynthesis pathways and the previous data knowledge from Scenarios. The database was named 2Path-Sesquiterpenes, and its schema was modeled using the graphical diagram description called GRAPHED [61]. The modeled schema showed in Figure 10 minimizes the transition from the generated hypergraph and links it to the data from Scenarios. It makes it possible to associate the metadata to the predicted pathways to establish a relationship between the predictions and biological evidence.
(a) The OPP loss from FPP.
(c) C 1 to C 3 hidrid shift.  In the 2Path-Sesquiterpenes database, the nodes labeled as Compound represent the compounds generated during the simulation. The nodes labeled as Reaction represent the directed hyperedges, i.e., a chemical mechanism meant by an application of a graph grammar rule. All relationships between a Compound and a Reaction are directed and labeled as TO. Figure 10 shows all data inside each Node.
The 2Path-Sesquiterpenes database nodes labeled as Scenarios provide a kernel of manually curated biological experimental evidence about constraints under which the compound are produced. The Scenarios provide an NCBI accession number for the enzymes; a PUBMED accession number for the associated publication with the experimental results; the experimental conditions; the plant tissue using EMBL-EBI Plant Ontology (https://www.ebi.ac.uk/ols/ontologies/po); the compound yield; the EC numbers for the reactions; and the cross references to KEGG [49], Rhea [62] and ExploEnz (IUBMB) [63] in a taxonomic range of species. A relationship labeled as OCCURS is created between a node Scenarios and a node Compound when there is a biological scenario supporting the biosynthesis of this predicted compound.