Genomic Phylogeny Using the Maxwell TM Classiﬁer Based on Burrows–Wheeler Transform

: Background: In present genomes, current relics of a circular RNA appear which could have played a central role as a primitive catalyst of the peptide genesis. Methods: Using a proximity measure to this circular RNA and the distance, a new unsupervised classiﬁer called Maxwell TM has been constructed based on the Burrows–Wheeler transform algorithm. Results: By applying the classiﬁer to numerous genomes from various realms (Bacteria, Archaea, Vegetables and Animals), we obtain phylogenetic trees that are coherent with biological trees based on pure evolutionary arguments. Discussion: We discuss the role of the combinatorial operators responsible for the evolution of the genome of many species. Conclusions: We opened up possibilities for understanding the mechanisms of a primitive factory of peptides represented by an RNA ring. We showed that this ring was able to transmit some of its sub-sequences in the sequences of genes involved in the mechanisms of the current ribosomal production of proteins.


Introduction
Among the molecules that have possibly played an important role in the origin of life on Earth, the first RNAs and peptides were formed by chance through a concatenation process among the nucleotides and amino acids pools, respectively, synthesized from the atoms (C, O, H, and N) of the primitive atmosphere due to sufficient electrical discharge [1,2]. They combined in same favorable sites (volcanic hot spring pools [3], clays like montmorillonite [4], alkaline hydrothermal vent/serpentinization [5], etc.) giving rise to large polymers, e.g., circular RNAs and proteins, whose interactions allowed their reproduction and isolation from the external environment. RNA core was made of rings or chains with catalytic properties helping amino acids to bind together. Peptides created via this peptidebonding was later combined with lipids synthesized in the primitive atmosphere [6]. They could also assist with the synthesis of new RNA rings or chains that could serve further as ribozymes catalyzing the protein synthesis [7,8] as demonstrated in short segments of RNA [9,10]. By looking for the minimal circular RNA that first facilitated these interactions, we have previously identified an RNA structure [11,12], called AL (Archetypal Loop), capable of catalyzing peptide bonds between amino acids in its ring form ( Figure 1A) and resisting denaturing environmental conditions in its hairpin form ( Figure 1B) [13]. The AL sequence can be considered as the consensus sequence of tRNA loops of many species (only 4 species on Figure 1C and 242 others from GtRNAdB (see [14] and Supplementary Materials Table S1): ATGGTACTGCCATTCAAGATGA [15]. The RNA AL has interesting combinatorial properties: it comprises 22 nucleotides and offers 20 successive codons capable of binding transiently to the 20 amino acids of which they are the representatives in the genetic code  [14]).
It has been proven that amino acids have an affinity with their cognate codons and anti-codons involving weak electromagnetic or van der Waals forces [16,17], which causes transient binding between amino acids and the AL ring containing the corresponding cognate triplets of nucleotides, and after being spatially close together, amino acids can bind to each other or to a neighboring peptide, with the mechanism being analog to that of the present protein synthesis in current cells. This mechanism was proposed by Katchalsky [18] and Eigen [19], which showed that RNA, in particular the ancestors of current transfer RNA, could have been involved in a primitive matrix capable of catalyzing the synthesis of both peptides and new RNAs, favoring the emergence of an RNA world made of RNA molecules with catalytic and replicative properties [20,21].

Calculation of the Archetypal Loop-Proximity
The methodology chosen starts from the calculation of a proximity called the AL proximity, which estimates the degree of possible heritability from the AL of an RNA sequence. The sequences are obtained from the RefSeq database of NCBI (National Center for Biotechnology Information) [22], which contains the genomes of many species. On 5 May 2023, the RefSeq Release 218 included the genome of 133,740 organisms, with 52,503,423 of mRNA transcripts of 260,776,371 proteins of which the gene contains 24,000 nucleotides and the mRNA transcript 1300 nucleotides on average in humans. The method used to compare an RNA sequence to the AL involves counting the number of It has been proven that amino acids have an affinity with their cognate codons and anti-codons involving weak electromagnetic or van der Waals forces [16,17], which causes transient binding between amino acids and the AL ring containing the corresponding cognate triplets of nucleotides, and after being spatially close together, amino acids can bind to each other or to a neighboring peptide, with the mechanism being analog to that of the present protein synthesis in current cells. This mechanism was proposed by Katchalsky [18] and Eigen [19], which showed that RNA, in particular the ancestors of current transfer RNA, could have been involved in a primitive matrix capable of catalyzing the synthesis of both peptides and new RNAs, favoring the emergence of an RNA world made of RNA molecules with catalytic and replicative properties [20,21].

Calculation of the Archetypal Loop-Proximity
The methodology chosen starts from the calculation of a proximity called the AL proximity, which estimates the degree of possible heritability from the AL of an RNA sequence. The sequences are obtained from the RefSeq database of NCBI (National Center for Biotechnology Information) [22], which contains the genomes of many species. On 5 May 2023, the RefSeq Release 218 included the genome of 133,740 organisms, with 52,503,423 of mRNA transcripts of 260,776,371 proteins of which the gene contains 24,000 nucleotides and the mRNA transcript 1300 nucleotides on average in humans. The method used to compare an RNA sequence to the AL involves counting the number of common pentamers between those of the sequence and those located at the upper extremity of the hairpin form of the AL, which belongs to the following set, P, of 9 pentamers: P = {AUUCA, UUCAA, UCAAG, CAAGA, AAGAU, AGAUG, GAUGA, AUGAA, UGAAU}.
The 9 elements of P are called P-pentamers. They are extracted from an AL sequence located near the head of the hairpin form of the AL. We use P for defining a criterion of proximity to the AL for any RNA sequence, that is, the number of standard deviations (SDs) between calculated and expected numbers of P-pentamers in the chosen sequence. For example, let us consider the nucleotide sequence of length n = 2697 observed for the mRNA of the nucleolin of Camelus dromedarius ( Figure 2). Then, because the probability of ob-serving a pentamer by chance is p = 9/1024, the average number of expected P-pentamers is np = 2720 × (9/1024) = 23.9, with a standard deviation σ = np(1 − p)~23.9 1/2~4 .9. The number of calculated P-pentamers in the sequence is equal to 95; then, the difference between calculated and expected numbers is 95 − 23.9 = 71.1, corresponding to 71.1/4.9 = 14.5σ. The number of P-pentamers calculated in Figure 2 is 95 and the expected number is 23.9, with a difference equal to 14.5σ, where σ is equal to the standard deviation of the Bernoulli empirical distribution corresponding to the P-pentamers observed by chance. Then, the probability to observe these 95 P-pentamers equals to about 10 −14 . It is possible to search for relics such as the P-pentamers common to the AL and to rRNAs and mRNAs whose function is considered identical in the ribosomes of multiple species, like the mRNA of proteins nucleolin and NOL11 (see Supplementary Materials Tables S2 and S3). After calculating their P-proximity, we classify the corresponding mRNAs of various species using the classifier Maxwell TM , which is able to compare sequences of symbols [24], here the sequences of nucleotides, and conclude if the obtained clusters are coherent with the P-proximity values of their elements.

The Burrows-Wheeler Transform
The Burrows-Wheeler transform [25] is an algorithm used in lossless compression procedure which rearranges strings into runs of similar characters in a reversible way. Associated with a run-length algorithm, we obtain a function we use in "Normalized Compression Distance" (NCD) or Vitányi distance, in order to find similarities between them, like same repetition of motifs, same deletion or insertions, etc. The reason for the implementation of this "simplified" compression algorithm was to retrieve the symmetry of NCD. It is particularly convenient to compare genomic sequences independently of their length if they have coevolved under the action of the same operators. In evolution, there are 11 different genomic operators: Crossing-over, Mutation, Translocation, Insertion, Deletion, Transposition, Inversion, Repetition, Symmetrization, Palindromization, Because the Bernoulli distribution of the P-pentamers checks the approximation conditions by the Gaussian distribution, n = 2720 ≥ 30, np~23.9 ≥ 5, and n(1 − p)~2696 ≥ 5, so the probability of observing such a difference is less than 1 − F(14.5) < Proba({X ≥ 14.5}), where F is the Gaussian distribution function. Then, using the Gaussian distribution function approximation proposed in [23], we get: Because the value of the difference between the calculated and expected numbers of pentamers expressed as a number of standard deviations σ is directly linked to the probability of observing this difference, we retain this quantity as a measure of any RNA sequence's proximity to the AL, called P-proximity.
If the ring AL has played a role in building the first peptides, it is reasonable to search for the remnants of its nucleotidic sequence inside RNAs playing the same role in building the current proteins, e.g., ribosomal RNAs and mRNA of ribosomal proteins or of proteins favoring the accretion of the ribosomal components.
The number of P-pentamers calculated in Figure 2 is 95 and the expected number is 23.9, with a difference equal to 14.5σ, where σ is equal to the standard deviation of the Bernoulli empirical distribution corresponding to the P-pentamers observed by chance. Then, the probability to observe these 95 P-pentamers equals to about 10 −14 . It is possible to search for relics such as the P-pentamers common to the AL and to rRNAs and mRNAs whose function is considered identical in the ribosomes of multiple species, like the mRNA of proteins nucleolin and NOL11 (see Supplementary Materials Tables S2 and S3). After calculating their P-proximity, we classify the corresponding mRNAs of various species using the classifier Maxwell TM , which is able to compare sequences of symbols [24], here the sequences of nucleotides, and conclude if the obtained clusters are coherent with the P-proximity values of their elements.

The Burrows-Wheeler Transform
The Burrows-Wheeler transform [25] is an algorithm used in lossless compression procedure which rearranges strings into runs of similar characters in a reversible way. Associated with a run-length algorithm, we obtain a function we use in "Normalized Compression Distance" (NCD) or Vitányi distance, in order to find similarities between them, like same repetition of motifs, same deletion or insertions, etc. The reason for the implementation of this "simplified" compression algorithm was to retrieve the symmetry of NCD. It is particularly convenient to compare genomic sequences independently of their length if they have coevolved under the action of the same operators. In evolution, there are 11 different genomic operators: Crossing-over, Mutation, Translocation, Insertion, Deletion, Transposition, Inversion, Repetition, Symmetrization, Palindromization, and Permutation. When these operators are used with the same frequency during evolution, Burrows-Wheeler transform serves to compress the sequences of the same origin which have similar evolutionary history.
First, Burrows-Wheeler transform involves organizing the circular permutation of a word following the lexicographic order, then taking the last letter of these permuted words and calculate the run-length encoding (RLE) of this new word formed by the rank of the permutation identical to the initial word followed by the sequence of the last letters of permuted words, by indicating before the number of repeated letters (Figure 3). This coding constitutes a lossless compression method and during decompression, the initial word can be reconstructed exactly from this information in a reversible (or adiabatic) way. and Permutation. When these operators are used with the same frequency during evolution, Burrows-Wheeler transform serves to compress the sequences of the same origin which have similar evolutionary history. First, Burrows-Wheeler transform involves organizing the circular permutation of a word following the lexicographic order, then taking the last letter of these permuted words and calculate the run-length encoding (RLE) of this new word formed by the rank of the permutation identical to the initial word followed by the sequence of the last letters of permuted words, by indicating before the number of repeated letters (Figure 3). This coding constitutes a lossless compression method and during decompression, the initial word can be reconstructed exactly from this information in a reversible (or adiabatic) way.

The Vitányi Distance
The Vitányi distance between two sequences x and y [26,27] involves calculating the length of the RLE version of their Burrows-Wheeler transform (BWT), that is, the values of the coefficients Cx = Length[RLE(BWT(x))] and Cy = Length[RLE(BWT(y))], respectively, and then the value of the coefficient for the concatenated word xy, Cxy =

The Vitányi Distance
The Vitányi distance between two sequences x and y [26,27] involves calculating the length of the RLE version of their Burrows-Wheeler transform (BWT), that is, the values of the coefficients C x = Length[RLE(BWT(x))] and C y = Length[RLE(BWT(y))], respectively, and then the value of the coefficient for the concatenated word xy, C xy = Length[RLE(BWT(xy))] and calculating the ratio (Figure 3): d(x,y) = [C xy − min(C x ,C y )]/max(C x ,C y ). Vitányi distance using Burrows-Wheeler transform and run-length between words BANANA and CANADA is equal to 0,57 (Figure 3). Vitányi distance is a real mathematical distance, with d(x,x) = 0, d(x,y) = d(y,x) (symmetry) and d(x,z) ≤ d(x,y) + d(y,z) (triangular inequality).

The Maxwell TM Classifier
The principle of the Maxwell TM classifier [26] is to constitute clusters of words belonging to the set {x i } i=1,n , from which the distance matrix D ij = d(x i ,x j ) has been calculated. Then, each triplet of words constitutes a triangle in the graph associated with D and the area of this triangle is calculated using the classical Héron formula, and the original algorithm of Maxwell TM has the following steps: - Calculating the mean and standard deviation on histograms of triangle areas for filtering "large and deformed triangles" considered as outliers of the empirical distribution according to the number of standard deviations retained; -Examining sub-graphs whose "useless" (respectively "best") representative edges are identified as attached to the least (respectively the most) connected nodes and removing them (respectively keep them as cluster central node); -Processing sub-graphs with several local minima (i.e., nodes whose neighborhood does not contain another node that is closer to the sub-graph than the node itself) using Voronoï networking with the software Graphviz [28] for detecting internal boundaries; -Testing at the end for sub-graphs whose mean and standard deviation are varied until Graphviz no longer detects any boundaries; -Storing elements rejected by this statistics calculation in the form of "singleton clusters"; -Final recalling by clustering the population of singletons to detect new clusters. Table 1 shows that RNA classes whose content is homogeneous in AL-proximity, i.e., in evolutionary age (if the hypothesis on the primitivity of AL is true), are marked both by a large AL-proximity and by an upstream position in the Maxwell TM classification tree ( Figure 4). They correspond to ancient species in purely biological phylogenetic trees, calculated without reference to an ancestral RNA, and resulting only from comparisons between the genomic sequences of the compared species ( Figure 5). The Maxwell TM classification tree proposes a series of clusters organized from the root of the tree until its leaves and the content of each cluster is as presented in Table 1, which shows that one of the rules explaining the grouping in a class is the proximity to the AL of its members. It should be noted that at the root of the tree, where the hypothetical LUCA (the Last Universal Common Ancestor, defined first by C. Woese and G. Fox as the first living system [29,30]) is often placed, the ancient species of Salinarchaeum appears, which belongs to the very ancient classes of Archaea and Halobacteria from the Euryarchaeota branch ( Figure 5).      The first four clusters of the classification tree successively represent Archaea (class 1), Archaea with ancient Bacteria and Fungi (class 2), Bacteria with ancient Archaea (class 3) and Mammals (class 4). This classification respects the known hierarchies of successive clades, obtained by comparing genomes of the same nature (see the Supplementary Materials Table S4 for the whole clustering) and the Maxwell TM clustering with cladistic ranking can be described as a list, whose four first clusters are:

Discussion
In the classification obtained using the classifier Maxwell TM , there exists no information about the species, except the succession of nucleotides of some of their RNAs or mRNAS (5S ribosomal RNAs, nucleolin (NCL) and nucleolar protein (NOL11) mRNAs).
In Figure 5, the Archaea phylogeny [31] shows an organization compatible with the Maxwell TM classification tree in Figure 4. In particular, all the classes marked with a red star correspond to classes of the Maxwell TM tree, even though all their contents have not been systematically explored in the present study. This consistency between the classes discovered using the Maxwell TM algorithm, only from the nucleotide sequence of some RNAs and the classes of an Archaea phylogeny, is an important argument validating the new Maxwell TM classification method.

Conclusions and Perspectives
The challenging problem of finding an ancestor to RNAs related to the ribosomal protein factory can be partially solved by looking at the nucleotide sequence of some ribosomal RNAs and mRNAs of proteins involved in the building of the ribosome itself. Some invariant parts of these nucleotide sequences are detected via Maxwell TM and future work will be dedicated to the classification of random sequences, using Maxwell TM , respecting some evolutionary rules based on precise operators among the eleven acting in genome evolution and used via the genetic algorithms: Crossing-over, Mutation, Translocation, Insertion, Deletion, Transposition, Inversion, Repetition, Symmetrization, Palindromization, and Permutation [32,33]. This will allow us to extensively understand the hidden mechanisms of the Maxwell TM algorithm in detecting common motifs in the nucleotide sequences of ribosomal and messenger RNAs. As the Maxwell TM classifier mainly detects repeats, insertions, mutations and palindromizations common to multiple genomes that we wish to compare, the clustering trees obtained via it will have biological significance. These trees will complement the classical phylogenetic trees from the primitive molecular structures of the current species in order to refine our current knowledge on evolution.