The Maximal C3 Self-Complementary Trinucleotide Circular Code X in Genes of Bacteria, Archaea, Eukaryotes, Plasmids and Viruses

In 1996, a set X of 20 trinucleotides was identified in genes of both prokaryotes and eukaryotes which has on average the highest occurrence in reading frame compared to its two shifted frames. Furthermore, this set X has an interesting mathematical property as X is a maximal C3 self-complementary trinucleotide circular code. In 2015, by quantifying the inspection approach used in 1996, the circular code X was confirmed in the genes of bacteria and eukaryotes and was also identified in the genes of plasmids and viruses. The method was based on the preferential occurrence of trinucleotides among the three frames at the gene population level. We extend here this definition at the gene level. This new statistical approach considers all the genes, i.e., of large and small lengths, with the same weight for searching the circular code X. As a consequence, the concept of circular code, in particular the reading frame retrieval, is directly associated to each gene. At the gene level, the circular code X is strengthened in the genes of bacteria, eukaryotes, plasmids, and viruses, and is now also identified in the genes of archaea. The genes of mitochondria and chloroplasts contain a subset of the circular code X. Finally, by studying viral genes, the circular code X was found in DNA genomes, RNA genomes, double-stranded genomes, and single-stranded genomes.


Introduction
Circular code is a mathematical structure of genes and genomes. This concept initially found for genes is extended for genomes (non-coding regions of eukaryotes) according to recent results. A circular code X is a set of words such that any motif from X, called X motif, allows it to retrieve, maintain, and synchronize the original (construction) frame.
The circular code X identified in the genes of bacteria, eukaryotes, plasmids, and viruses [1,2] contains the 20 following trinucleotides which allows it to both retrieve the reading frame with a window of 13 nucleotides (Figure 3 in [3]) and to code the 12 following amino acids {Ala, Asn, Asp, Gln, Glu, Gly, Ile, Leu, Phe, Thr, Tyr, Val}. ( The current genetic code is not circular. Thus, it cannot retrieve the reading frame. The loss during evolution of this circular code property on the 4-letter alphabet {A, C, G, T} required a complex translation mechanism using 20 amino acids and proteins in current genomes. X motifs from Equation (1) are identified in (i) genes "universally" [1,4]; (ii) tRNAs of prokaryotes and eukaryotes [3,5]; (iii) rRNAs of prokaryotes (16S) and eukaryotes (18S), in particular in the ribosome decoding center where the universally conserved nucleotides G530, A1492, and A1493 are included in the X motifs [3,6,7]; and (iv) genomes (non-coding regions of eukaryotes) [4,8].
The X motifs of maximal cardinality 20 (composition) in genes with the properties of the circular code, C 3 and complementary allow the two reading frames and the four shifted frames to be retrieved by pairing between DNAs-DNAs, DNAs-mRNAs, mRNAs-rRNAs, mRNAs-tRNAs, and rRNAs-tRNAs, as shown with a 3D visualization of the X motifs in the ribosome [3,6,7].
The X motifs in genomes have a different structure compared to the X motifs in genes [8]. Indeed, their cardinality is not maximal (less than 10 for an order of magnitude), their size is longer, and their structure contains repeated trinucleotides. Furthermore, the X motifs of minimal cardinality 1 generated with the 20 repeated trinucleotides t n where t ∈ X (Equation (1)) are very common in the genomes of eukaryotes (e.g., [8][9][10]). Their length n can be very large (e.g., n > 6000, see Figure 1). The repeated trinucleotides are very unstable with mutation rates up to 100,000 times higher than the genomic average mutation rate. Mutation in repeats increases its evolutionary stability.
Life 2017, 7, 20 2 of 16 in the ribosome decoding center where the universally conserved nucleotides G530, A1492, and A1493 are included in the motifs [3,6,7]; and (iv) genomes (non-coding regions of eukaryotes) [4,8]. The motifs of maximal cardinality 20 (composition) in genes with the properties of the circular code, and complementary allow the two reading frames and the four shifted frames to be retrieved by pairing between DNAs-DNAs, DNAs-mRNAs, mRNAs-rRNAs, mRNAs-tRNAs, and rRNAs-tRNAs, as shown with a 3D visualization of the motifs in the ribosome [3,6,7]. The motifs in genomes have a different structure compared to the motifs in genes [8]. Indeed, their cardinality is not maximal (less than 10 for an order of magnitude), their size is longer, and their structure contains repeated trinucleotides. Furthermore, the motifs of minimal cardinality 1 generated with the 20 repeated trinucleotides where ∈ (Equation (1)) are very common in the genomes of eukaryotes (e.g., [8][9][10]). Their length can be very large (e.g., > 6000, see Figure 1). The repeated trinucleotides are very unstable with mutation rates up to 100,000 times higher than the genomic average mutation rate. Mutation in repeats increases its evolutionary stability.  (1)) by increasing its cardinality (composition) and decreasing its length. Evolution begins with motifs of minimal cardinality 1 (long repeated trinucleotides) in genomes (the examples given are extracted from Table 2 in [8]). Then, the mutations in repeated trinucleotides lead to motifs of low cardinality < 10 (different and short repeated trinucleotides) in genomes (the examples given are extracted from Table 4 in [8]) up to motifs of high ≥ 10 and maximal cardinality 20 coding the 12 amino acids (Equation (2)).
A model of evolution of the motifs in genes and genomes can be proposed according to the previous works and the recent results [8]. It proposes that the motifs of maximal cardinality 20 in genes have evolved from the motifs of minimal cardinality 1 (repeated trinucleotides) in genomes ( Figure 1). An motif of minimal cardinality 1 which is unstable, mutates into an motif of low cardinality < 10 containing thus different repeated trinucleotides of short lengths. This evolutionary process continues by increasing the cardinality and decreasing the length of the motifs up to generate the motifs of high ≥ 10 and maximal cardinality 20 coding the 12 amino acids (Equation (2)) in genes. The motifs of high cardinality have acquired the protein coding function in addition to the reading frame retrieval. This model suggests that the property of reading frame retrieval has preceded the protein coding function. Evolution of X circular code motifs by increasing its cardinality and decreasing its length  (1)) by increasing its cardinality (composition) and decreasing its length. Evolution begins with X motifs of minimal cardinality 1 (long repeated trinucleotides) in genomes (the examples given are extracted from Table 2 in [8]). Then, the mutations in repeated trinucleotides lead to X motifs of low cardinality <10 (different and short repeated trinucleotides) in genomes (the examples given are extracted from Table 4 in [8]) up to X motifs of high ≥10 and maximal cardinality 20 coding the 12 amino acids (Equation (2)).
A model of evolution of the X motifs in genes and genomes can be proposed according to the previous works and the recent results [8]. It proposes that the X motifs of maximal cardinality 20 in genes have evolved from the X motifs of minimal cardinality 1 (repeated trinucleotides) in genomes ( Figure 1). An X motif of minimal cardinality 1 which is unstable, mutates into an X motif of low cardinality <10 containing thus different repeated trinucleotides of short lengths. This evolutionary process continues by increasing the cardinality and decreasing the length of the X motifs up to generate the X motifs of high ≥10 and maximal cardinality 20 coding the 12 amino acids (Equation (2)) in genes.
The X motifs of high cardinality have acquired the protein coding function in addition to the reading frame retrieval. This model suggests that the property of reading frame retrieval has preceded the protein coding function.
Since 1996, all the statistical analyses studying the preferential occurrence of trinucleotides among the three frames were done at the gene population level (kingdoms, taxonomic groups, genomes). We extend here the method from [1] at the gene level. This new approach is important as all the genes, i.e., of large and small lengths, are now considered with the same weight in the statistical definition for searching the circular code X. As a consequence, the concept of circular code, in particular the reading frame retrieval, is directly associated to each gene. Thus, at the gene level, the circular code X is searched here in the genes of bacteria, archaea, eukaryotes, plasmids, viruses, and eukaryotic organelles, i.e., mitochondria and chloroplasts. Finally, genes of double-stranded DNA and RNA viruses, and single-stranded DNA and RNA viruses are also analysed with this approach in order to assign a genetic information unit (DNA or RNA, double-stranded or single-stranded) to the circular code X.

Definitions
We recall a few definitions without detailed explanations (i.e., without figures and examples) for understanding the main properties of the trinucleotide circular code X identified in genes [1,2]. Notation 1. Let us denote the nucleotide 4-letter alphabet B = {A, C, G, T} where A stands for Adenine, C stands for Cytosine, G stands for Guanine, and T stands for Thymine. The trinucleotide set over B is denoted by B 3 = {AAA, . . . , TTT}. The set of non-empty words (words, respectively) over B is denoted by B + (B * , respectively).

Notation 2.
Genes have three frames f . By convention here, the reading frame f = 0 is set up by a start trinucleotide {ATG, CTG, GTG, TTG}, and the frames f = 1 and f = 2 are the reading frame f = 0 shifted by one and two nucleotides in the 5 − 3 direction (to the right), respectively.
Two biological maps are involved in gene coding.

Definition 1.
According to the complementary property of the DNA double helix, the nucleotide complementarity map C : B → B is defined by C(A ) = T, C(C) = G, C(G) = C, and C(T) = A. According to the complementary and antiparallel properties of the DNA double helix, the trinucleotide complementarity map C : B 3 → B 3 is defined by C(l 0 l 1 l 2 ) = C(l 2 )C(l 1 )C(l 0 ) for all l 0 , l 1 , l 2 ∈ B. By extension to a trinucleotide set S, the set complementarity map C : P B 3 → P B 3 , P being the set of all subsets of B 3 , is defined by Definition 2. The trinucleotide circular permutation map P : B 3 → B 3 is defined by P (l 0 l 1 l 2 ) = l 1 l 2 l 0 for all l 0 , l 1 , l 2 ∈ B. P 2 denotes the 2nd iterate of P. By extension to a trinucleotide set S, the set circular permutation map P : P B 3 → P B 3 is defined by P (S) = v : u, v ∈ B 3 , u ∈ S, v = P (u) , e.g., P ({CGA, GAT}) = {ATG, GAC} and P ({CGA, GAT}) = {ACG, TGA}.

Definition 4.
Any non-empty subset of the code B 3 is a code and called trinucleotide code C.
The trinucleotide set X = X 0 (Equation (1)) coding the reading frame ( f = 0) in genes is a maximal (20 trinucleotides) C 3 self-complementary trinucleotide circular code [2] where the circular code X 1 = P (X) coding the frame f = 1 contains the 20 following trinucleotides and the circular code X 2 = P 2 (X) coding the frame f = 2 contains the 20 following trinucleotides The trinucleotide circular codes X 1 and X 2 are related by the permutation map, i.e., X 2 = P (X 1 ) and X 1 = P 2 (X 2 ), and by the complementary map, i.e., X 1 = C(X 2 ) and X 2 = C(X 1 ) [14].
Several classes of methods were developed for identifying the circular code X in genes over the last 20 years: frequency methods [2,15,16], correlation function [17], covering capability function [18], and occurrence probability of a complementary/permutation (CP) trinucleotide set at the gene population level [1].
The class of the 216 C 3 self-complementary trinucleotide circular codes (Definition 7; [2]; list given in Tables 4a, 5a, and 6a in [19]; [20]) is included in a larger class of codes C by relaxing the circularity property which was defined in [1]: The statistical approach developed analyses the C 3 self-complementary codes (Definition 8) for searching the particular circular code X.

Gene Kingdoms
Gene kingdoms K of bacteria B, archaea A, plasmids P, eukaryotes E, chromosomes of eukaryotes E chr , mitochondria M, chloroplasts C, viruses V, and its five taxonomic double-stranded DNA viruses V dsDNA , double-stranded RNA viruses V dsRNA , single-stranded DNA viruses V ssDNA , single-stranded RNA viruses V ssRNA , and retro-transcribing viruses V rt are obtained from the GenBank database (http: //www.ncbi.nlm.nih.gov/genome/browse/, May 2016) ( Table 1). Computer tests exclude genes when (i) their nucleotides do not belong to the alphabet B; (ii) they do not begin with a start trinucleotide {ATG, CTG, GTG, TTG}; (iii) they do not end with a stop trinucleotide {TAA, TAG, TGA}; and (iv) their lengths are not modulo 3. In order to have an order of magnitude of data acquisition (details in

Preferential Frame of a Trinucleotide in a Gene
The method developed in [1] for identifying the circular code X in genes determined the preferential frame of trinucleotides at the gene population level (kingdoms, taxonomic groups, genomes), i.e., after summing the trinucleotide frequencies of all genes in a kingdom. We extend this method at the gene level, i.e., the preferential frame of trinucleotides among the three frames is determined for each gene. There is no sum of trinucleotide frequencies of all genes in a kingdom. Thus, all the genes, i.e., of large and small lengths, have the same weight in respect to the preferential frame.
Consider a gene kingdom K listed in Table 1. Let Pr f (t, g) be the occurrence frequency of a trinucleotide t ∈ B 3 in a frame f ∈ {0, 1, 2} of a gene g belonging to a kingdom K. Thus, there are 3 × 64 = 192 trinucleotide occurrence frequencies Pr f (t, g) in the three frames f of a gene g. Then, the preferential frame F(t, g) ∈ {0, 1, 2} of a trinucleotide t in a gene g is the frame of maximal occurrence frequency Pr f (t, g) among the three frames f of g The three frequencies of a given trinucleotide are computed in the three frames 0, 1, and 2 of a gene. Then, the preferential frame of the trinucleotide in this gene is the frame associated to its highest trinucleotide frequency.

Remark 1.
In [1], the three occurrence frequencies Pr f (t, K) of a trinucleotide t in the three frames f computed in a gene kingdom K, always have different values, thus a unique preferential frame can be assigned to the trinucleotide. At the gene level, particularly for genes g of small lengths, a trinucleotide t may have an identical occurrence frequency Pr f (t, g) in two or three frames f . In this case, two or three preferential frames F(t, g) are assigned to the trinucleotide t. If a trinucleotide t is absent in a gene g, mainly for genes g of very small lengths, then no preferential frame is attributed to t.
The indicator function δ f (F(t, g)) ∈ {0, 1} is 1 if the preferential frame F(t, g) of a trinucleotide t is equal to the frame f of a gene g, and 0 otherwise where F(t, g) is defined in Equation (5).

Number of Preferential Frames of a Trinucleotide in a Gene Kingdom
The number Nb f (t, K) ∈ N of preferential frames of a trinucleotide t ∈ B 3 for each frame f ∈ {0, 1, 2} in a gene kingdom K is simply obtained by summing for all genes in K where δ f (F(t, g)) is defined in Equation (6).

Occurrence Probability of a Complementary/Permutation Trinucleotide Set in a Gene Kingdom
In order to study the C 3 self-complementary codes C (Definition 8) including the class of circular codes, and in particular the circular code X, Equation (7) for a trinucleotide t is expanded to a set T of six trinucleotides involving the complementarity map C and the permutation map P simultaneously, ) in frame 2, and t ∈ B 3 \{AAA, CCC, GGG, TTT}. T is called a complementary and permutation (CP) trinucleotide set and is completely defined by the trinucleotide t.

Notation 3.
A CP trinucleotide set T = T 0 , T 1 , T 2 belongs to the C 3 self-complementary trinucleotide circular code X, i.e., T ∈ X, if T 0 ∩ X = ∅, i.e., if the trinucleotide t and its complementary trinucleotide C(t) belong to X. Ten CP trinucleotide sets T among 30 belong to the C 3 circular code X, i.e., such that 10 sets T 0 ∈ X with T 1 = P T 0 ∈ P (X) = X 1 and T 2 = P 2 T 0 ∈ P 2 (X) = X 2 .

Notation 4.
In order to facilitate the reading of Table 2, the 30 CP trinucleotide sets T = T 0 , T 1 , T 2 are presented in the following way (i) the first 10 sets T 1 , . . . , T 10 belong to the circular code X (with T 0 = {t, C(t)} ∈ X, T 1 ∈ X 1 and T 2 ∈ X 2 ) and are in lexicographical order with respect to the trinucleotide t ∈ X (in bold), and (ii) the 20 remaining sets T 11 , . . . , T 30 are in lexicographical order with respect to the trinucleotide t ∈ X 1 (in italics).
The occurrence number Nb(T, K) of a CP trinucleotide set T = T 0 , T 1 , T 2 in a gene kingdom K is equal to where Nb f (t, K) is defined in Equation (7).
In order to normalize the numbers Nb(T, K) which depend on the numbers of genes in a kingdom K, we simply define the occurrence probability Pb(T, K) of a CP trinucleotide set T = T 0 , T 1 , T 2 in a gene kingdom K as follows where Nb(T, K) is defined in Equation (8).
The parameter Rk(T, K) ∈ {1, . . . , 30} gives the rank of the values Pb(T, K) among the 30 CP trinucleotide sets T, the 1st rank being associated to the highest value of Pb(T, K) and the 30th rank, to the lowest value of Pb(T, K).

A Statistical Test to Evaluate the Significance of the Obtained Ranks
In order to evaluate the statistical significance of the ranks Rk(T, K) of the probabilities Pb(T, K) (Equation (9)) of the 30 CP trinucleotide sets T in a given kingdom K, we derive confidence intervals for Pb(T, K). If the confidence interval for two probabilities Pb(T, K) do not overlap, then their associated ranks Rk(T, K) are assumed to be valid (in the population). The confidence interval for two probabilities Pb(T, K) is evaluated by using the classical 2-sample z-test which is briefly recalled here.
Let P (T) and P (T ) be the populations associated to the CP trinucleotide sets T and T of probabilities Pb(T, P ) and Pb(T , P ), respectively. The probabilities Pb(T, K) and Pb(T , K) of T and T are observed in a given gene kingdom K (sample) of size n = ∑ 30 i=1 Nb(T i , K) (defined from Equation (8) The z-value and the p-value are given for each statistical test carried out in Section 3.

Explained Example of the Statistical Approach Developed
As an example, we explain the definition of the occurrence probability Pb(T, K) (Equation (9)) which takes the value of 6.1% (see Table 2) with the CP trinucleotide set , P (C(t))} = {ACA, TTG} in frame 1 and T 2 1 = P 2 T 0 1 = P 2 (t), P 2 (C(t)) = {CAA, TGT} in frame 2 in the gene kingdom of bacteria K = B (Table 1).
The 3 × 64 = 192 occurrence frequencies Pr f (t, g) of the 64 trinucleotides t are computed in the three frames f of each gene g belonging to B. Then, the preferential frame F(t, g) of each trinucleotide t for each gene g in B is determined according to Equation (5). For example, with the trinucleotide t = AAC in a gene g 1 of B, if the frequency Pr 0 (AAC, g 1 ) of AAC in frame f = 0 (reading frame) is greater than the two frequencies Pr 1 (AAC, g 1 ) and Pr 2 (AAC, g 1 ) of AAC in frames f = 1 and f = 2, i.e., Pr 0 (AAC, g 1 ) > Max{Pr 1 (AAC, g 1 ), Pr 2 (AAC, g 1 )}, then the preferential frame of AAC in g 1 is 0, i.e., F(AAC, g 1 ) = 0.
The indicator function δ f (F(t, g)) of each trinucleotide t for each gene g in B is obtained from Equation (6). With the previous example of AAC in the gene g 1 of B, the indicator function is equal to δ 0 (F(AAC, g 1 )) = 1 for the frame f = 0 and δ 1 (F(AAC, g 1 )) = δ 2 (F(AAC, g 1 )) = 0 for the frames f = 1 and f = 2.
The number Nb f (t, B) of preferential frames of each trinucleotide t for each frame f in B is computed according to Equation (7). With the previous example of AAC in B, the following numbers are obtained: Nb 0 (AAC, B) = 3486 for the frame f = 0, Nb 1 (AAC, B) = 1742 for the frame f = 1, and Nb 2 (AAC, B) = 1819 for the frame f = 2. Thus, the preferential frame of AAC in B is 0.
The occurrence number Nb(T, B) of the 30 CP trinucleotide sets

Maximal C 3 Self-Complementary Circular Code X in Genes
This new statistical approach will show that the same set X of 20 trinucleotides among 30 10 = 30, 045, 015 sets occurs preferentially in genes (reading frame) of bacteria B, archaea A, plasmids P, eukaryotes E, and viruses V. This set X is the maximal C 3 self-complementary circular code defined in Equation (1).

Circular Code X in Genes of Archaea
In the genes of archaea A, the eight CP trinucleotide sets T 1 , T 2 , T 4 , T 6 , . . . , T 10 ∈ X (except T 3 and T 5 ) have occurrence probabilities Pb(T, A) with the eight highest ranks Rk(T, A) among 30 ( Table 2). The highest rank with Pb(T 8 , A) = 9.7% is also related to the complementary pair {GAC, GTC} ∈ X. The CP set T 22 / ∈ X with Rk(T 22 , A) = 9 explains that the two complementary trinucleotides {t, C(t)} = {ACC, GGT} ∈ X (T 3 ) do not occur preferentially in A. As the CP set T 5 ∈ X has a rank Rk(T 5 , A) = 13 with Pb(T 5 , A) = 3.66% greater than Rk(T 15 , A) = 14 with Pb(T 15 , A) = 3.39% and Rk(T 28 , A) = 15 with Pb(T 28 , A) = 2.95%, the two complementary trinucleotides {t, C(t)} = {CAG, CTG} ∈ X occur preferentially in A compared to {AGC, GCT} (T 15 ) and {GCA, TGC} (T 28 ), however the statistical significance between the ranks Rk(T 5 , A) and Rk(T 15 , A) is not confirmed due to the lack of archaeal gene data (see Section 2.2) (n = ∑ 30 i=1 Nb(T i , A) = 10921, z-value = 1.08 , p-value = 0.14). Thus, a subset of X of 18 trinucleotides (a non-maximal C 3 self-complementary circular code) is identified in the genes of archaea: Note that the code X A ∪ {CAC, GTG} (T 22 ) is the variant X code observed in Deinococcus [1]. The circular code X retrieved in the genes of archaea is a new result which was not found in a study of variant X codes in archaeal genomes [15].

Circular Code X in Genes of Plasmids
In the genes of plasmids P, the 10 CP trinucleotide sets T 1 , . . . , T 10 ∈ X have occurrence probabilities Pb(T, P) with the 10 highest ranks Rk(T, P) among 30 ( Table 2). The highest rank with Pb(T 8 , P) = 7.8% is again related to the complementary pair {GAC, GTC} ∈ X. The 10th rank with Pb(T 5 , P) = 3.93% is very significantly greater than the 11th rank with Pb(T 21 , P) = 3.43% (n = ∑ 30 i=1 Nb(T i , P) = 144366, z-value = 7.14, p-value = 10 −13 ). The 20 trinucleotides of the circular code X are identified in the genes of plasmids: The same result is obtained at the gene level and the gene population level [1].

Circular Code X in Genes of Eukaryotes
In the genes of eukaryotes E, the 10 CP trinucleotide sets T 1 , . . . , T 10 ∈ X have occurrence probabilities Pb(T, E) with the 10 highest ranks Rk(T, E) among 30 ( Table 2). The highest rank with Pb(T 8 , E) = 9.0% is again related to the complementary pair {GAC, GTC} ∈ X. The 10th rank with Pb(T 5 , E) = 4.23% is significantly greater than the 11th rank with Pb(T 22 , E) = 3.82% (n = ∑ 30 i=1 Nb(T i , E) = 11401, z-value = 1.57, p-value = 0.06). The 20 trinucleotides of the circular code X are identified in the genes of eukaryotes: The same result is obtained at the gene level and the gene population level [1]. The subset X E Homo sapiens = X\{ACC, GCC, GGC, GGT} of X of 16 trinucleotides in the genes of Homo sapiens identified at the gene level is also identical to the subset found at the gene population level [1].

Circular Code X in Genes of Eukaryotic Chromosomes
The statistical analysis in Section 3.1.4 takes the eukaryotic genome as the genetic information unit. Indeed, Equation (7) with g ∈ E is achieved with Card(E) = 190 eukaryotic genomes (see Table 1). We complete this classical approach by choosing the eukaryotic chromosome as the genetic information unit. Thus, Equation (7) with g ∈ E chr is performed with Card(E chr ) = 2979 eukaryotic chromosomes of Card(E) = 190 genomes (see Table 1).
In the genes of eukaryotic chromosomes E chr , the 10 CP trinucleotide sets T 1 , . . . , T 10 ∈ X have occurrence probabilities Pb(T, E chr ) with the 10 highest ranks Rk(T, E chr ) among 30 ( Table 2). The highest rank with Pb(T 8 , E chr ) = 9.1% is again related to the complementary pair {GAC, GTC} ∈ X. The 10th rank with Pb(T 3 , E chr ) = 4.74% is very significantly greater than the 11th rank with Pb(T 22 , E chr ) = 4.47% (n = ∑ 30 i=1 Nb(T i , E chr ) = 179136, z-value = 3.86, p-value = 10 −5 ). The 20 trinucleotides of the circular code X are identified in the genes of eukaryotic chromosomes: It is a new result which completes the statistical analysis of genes in eukaryotic genomes (Section 3.1.4).

Non-Maximal Circular Code X in Genes of Eukaryotic Organelles
The genes of eukaryotic organelles, i.e., mitochondria and chloroplasts, are investigated with this statistical approach. It should also be stressed that the available data have an order of magnitude very significantly lower than the other gene kingdoms studied (less than 1 million trinucleotides for each class of organelles, see Table 1). However, we can already observe some statistical trends with the trinucleotides in the preferential frame.

Non-Maximal Circular Code X in Genes of Mitochondria
Surprisingly, in the genes of mitochondria M, the four CP trinucleotide sets T 9 , T 7 , T 8 , T 3 ∈ X have occurrence probabilities Pb(T, M) with the four highest ranks Rk(T, M) among 30 ( Table 2). The CP set

Non-Maximal Circular Code X in Genes of Chloroplasts
In the genes of chloroplasts C, the highest occurrences of CP trinucleotide sets again belong to the circular code X. The three CP trinucleotide sets T 2 , T 9 , T 3 ∈ X have occurrence probabilities Pb(T, C) with the three highest ranks Rk(T, C) among 30 ( Table 2). The CP set T 13 / ∈ X with Rk(T 13 , C) = 4 explains that the two complementary trinucleotides {GAC, GTC} ∈ X (T 8 ) do not occur preferentially in C. The CP set T 28 / ∈ X with Rk(T 28 , C) = 5 states that the two complementary trinucleotides {CAG, CTG} ∈ X (T 5 ) do not occur preferentially in C. The CP set T 14 / ∈ X with Rk(T 14 , C) = 8 implies that the two complementary trinucleotides {GTA, TAC} ∈ X (T 10 ) do not occur preferentially in C. The CP set T 18 / ∈ X with Rk(T 18 , C) = 10 explains that the two complementary trinucleotides {ATC, GAT} ∈ X (T 4 ) do not occur preferentially in C. The CP set T 25 / ∈ X with Rk(T 25 , C) = 12 implies that the two complementary trinucleotides {CTC, GAG} ∈ X (T 6 ) do not occur preferentially in C. Thus, a subset of X of 10 trinucleotides (a non-maximal C 3 self-complementary circular code) is identified in the genes of chloroplasts C:

Circular Code X in Genes of Viruses
In the genes of viruses V, the nine CP trinucleotide sets T 1 , . . . , T 4 , T 6 , . . . , T 10 ∈ X (except T 5 ) have occurrence probabilities Pb(T, V) with the nine highest ranks Rk(T, V) among 30 ( Table 2). The highest rank with Pb(T 8 , V) = 7.2% is again related to the complementary pair {GAC, GTC} ∈ X. The CP set T 15 / ∈ X with Rk(T 15 , V) = 10 explains that the two complementary trinucleotides {CAG, CTG} ∈ X (T 5 ) do not occur preferentially in V. Thus, a subset of X of 18 trinucleotides (a non-maximal C 3 self-complementary circular code) is identified in the genes of viruses: The statistical method of viral genes at the gene population level [1] could not decide between the two codes X 18 = X\{CAG, CTG} and X 16 = X\{CAG, CTG, GTA, TAC}. The statistical analysis at the gene level confirms the code X V = X 18 of 18 trinucleotides in the genes of viruses.

Circular Code X Found in DNA and RNA Genomes and in Double-Stranded and Single-Stranded Genomes
The self-complementary property of the circular code X has been related since 1996 to the complementary property of the DNA double helix. In order to deepen this idea, we searched with this statistical approach the circular code X in five important sub-classes of viral genes using either DNA genome or RNA genome, and either double-stranded genome or single-stranded genome, i.e., in the genes of double-stranded DNA viruses V dsDNA , double-stranded RNA viruses V dsRNA , single-stranded DNA viruses V ssDNA , single-stranded RNA viruses V ssRNA , and retro-transcribing viruses V rt .
In the genes of double-stranded DNA viruses V dsDNA , the 10 CP trinucleotide sets T 1 , . . . , T 10 ∈ X have occurrence probabilities Pb(T, V dsDNA ) with the 10 highest ranks Rk(T, V dsDNA ) among 30 (Table 2). Thus, the circular code X is found in V dsDNA : In the genes of double-stranded RNA viruses V dsRNA , single-stranded RNA viruses V ssRNA , and retro-transcribing viruses V rt , respectively, the nine CP trinucleotide sets T 1 , . . . , T 4 , T 6 , . . . , T 10 ∈ X (except T 5 ) have occurrence probabilities Pb(T, V dsRNA ), Pb(T, V ssRNA ), and Pb(T, V rt ), respectively, with the nine highest ranks Rk(T, V dsRNA ), Rk(T, V ssRNA ), and Rk(T, V rt ), respectively, among 30 ( Table 2). Note that the ranks Rk(T, V dsRNA ), Rk(T, V ssRNA ), and Rk(T, V rt ) for a given CP trinucleotide set are not identical (Table 2). Thus, by using the reasoning mentioned previously (T 15 / ∈ X with Rk(T 15 , V) > Rk(T 5 , V) for V in V dsRNA , V ssRNA , and V rt ), a subset of X of 18 trinucleotides is observed in V dsRNA , V ssRNA , and V rt : In the genes of single-stranded DNA viruses V ssDNA , the eight CP trinucleotide sets T 1 , . . . , T 4 , T 6 , . . . , T 9 ∈ X (except T 5 and T 10 ) have occurrence probabilities Pb(T, V ssDNA ) with the eight highest ranks Rk(T, V ssDNA ) among 30 (Table 2). Thus, by using the reasoning as previously mentioned (T 15 / ∈ X with Rk(T 15 , V ssDNA ) > Rk(T 5 , V ssDNA ) and T 14 / ∈ X with Rk(T 14 , V ssDNA ) > Rk(T 10 , V ssDNA )), a subset of X of 16 trinucleotides is observed in V ssDNA : All these results show that the circular code X is found almost perfectly in DNA genomes, RNA genomes, double-stranded genomes, and single-stranded genomes. The very few exceptions, either the two trinucleotides {CAG, CTG} or the four trinucleotides {CAG, CTG, GTA, TAC} for one case, are related to the CP set or the two CP sets having the lowest occurrence among the 10 CP sets T 1 , . . . , T 10 ∈ X. Table 2. Identification of the maximal C 3 self-complementary trinucleotide circular code X in gene kingdoms K of bacteria B, archaea A, plasmids P, eukaryotes E, chromosomes of eukaryotes E chr , mitochondria M, chloroplasts C, viruses V, and its five taxonomic groups: double-stranded DNA viruses V dsDNA , double-stranded RNA viruses V dsRNA , single-stranded DNA viruses V ssDNA , single-stranded RNA viruses V ssRNA , and retro-transcribing viruses V rt (Table 1). Occurrence probability Pb(T, K) (%) of the 30 complementary and permutation (CP) trinucleotide sets T = T 0 , T 1 , T 2 with T 0 = {t, C(t)} in frame 0, T 1 = P T 0 = {P (t), P (C(t))} in frame 1, T 2 = P 2 T 0 = P 2 (t), P 2 (C(t)) in frame 2, in a gene kingdom K computed according to Equation (9) and its rank Rk(T, K), the 1st rank being associated to the highest value of Pb(T, K) and the 30th rank, to the lowest value of Pb(T, K). The 20 trinucleotides of the C 3 self-complementary circular code X are in bold, the 20 trinucleotides of the circular code X 1 = P (X) are in italics, and the 20 trinucleotides of the circular code X 2 = P 2 (X) are both in bold and italics. The first 10 CP sets T 1 , . . . , T 10 belong to the circular code X (T 0 = {t, C(t)} ∈ X with T 1 = P T 0 ∈ P (X) = X 1 and T 2 = P 2 T 0 ∈ P 2 (X) = X 2 ) and are in lexicographical order with respect to the trinucleotide t ∈ X in bold, and the 20 remaining CP sets T 11 , . . . , T 30 are in lexicographical order with respect to the trinucleotide t ∈ X 1 in italics. The numbers in italics occurring with the CP sets T 1 , . . . , T 10 are associated with the two trinucleotides T 0 = {t, C(t)} of X which do not occur preferentially in the gene kingdom.

Conclusions
The "universal" occurrence in genes of a same set X of 20 trinucleotides, which has in addition the mathematical property to be a circular code, must be confirmed by several statistical approaches and various gene data analyses at different levels: kingdom, taxonomic group, genome, and gene. All the previous approaches have studied and identified the circular code X at the gene population level (kingdom, taxonomic group, and genome) [1,2,[15][16][17]21]. The statistical approach at the gene level developed here, for the first time since 1996, analyses the preferential occurrence of trinucleotides among the three frames of each gene. This new methodology allows all genes, i.e., of large and small lengths, to be considered with the same weight. As a consequence, the concept of circular code, in particular the reading frame retrieval, is directly associated to each gene. Thus, X motifs from the circular code X at different locations in a gene may assist the ribosome to maintain and synchronize the reading frame. The number, the cardinality, and the length of X motifs in genes may be associated to the length, the function, and the ancestry of genes. This research work is currently under investigation.
At the gene level, the circular code X is strengthened in the genes of bacteria, eukaryotes, plasmids, and viruses, and is now also identified in the genes of archaea. In addition to eukaryotic genomes, it is also found in the genes of eukaryotic chromosomes. The genes of mitochondria and chloroplasts contain a subset of the circular code X. It should be stressed that some mitochondrial and chloroplast genes lack the stop codon and are excluded from this data acquisition. Such a statistical bias may prevent a proper detection of preferential frames for some trinucleotides in the genes of eukaryotic organelles. The circular code X is searched in the large class of 30 10 = 30, 045, 015 C 3 self-complementary trinucleotide codes which contains in particular the 216 maximal C 3 self-complementary circular codes. Thus, for a basic order of magnitude, the probability to retrieve the same circular code X in four independent gene kingdoms (bacteria B, plasmids P, eukaryotes E, double-stranded DNA viruses V dsDNA ) is equal to 1/ 30 10 4 ≈ 10 −30 .
In the genes of the bacterial, eukaryotic, and plasmid kingdoms, 14 among the 47 studied gene taxonomic groups (about 30%) have variant X codes [1], i.e., trinucleotide codes which differ from X. Seven variant X codes are identified. However, all have at least 16 trinucleotides of X. Two variant X codes X A (according to the notation in [1]) in cyanobacteria and plasmids of cyanobacteria, and X D in birds, are self-complementary, without permuted trinucleotides, but are non-circular. Five variant X codes X B in Deinococcus, plasmids of chloroflexi and Deinococcus, mammals, and kinetoplasts, X C in elusimicrobia and apicomplexans, X E in fishes, X F in insects, and X G in basidiomycetes and plasmids of spirochaetes, are C 3 self-complementary circular. Thus, two variant X codes X A and X D are not circular and do not belong to the set of the 216 maximal C 3 self-complementary circular codes [2] having the strong mathematical structure of the dihedral group [20]. The reason could be related to the gene data or to a biological property which remains to be identified. All these variant X codes in the genes are identified at the taxonomic group level. However, as the circular code X is now also identified at the gene level, variant X codes may also be associated with genes belonging to the same genome but with different protein coding functions. This interesting and open problem should be investigated in the future.
A probability measure of the reading frame retrieval (RFR) of each trinucleotide of X has been introduced in [22] and [23] (Section 2.2 and 1st row of Table 1). The RFR probability PrRFR of the circular code X, i.e., the average RFR probability of the 20 trinucleotides of X, is equal to PrRFR(X) = 82.5% (Result 5 in [22]; 1st row of Table 1 in [23]). This RFR measure can be applied to the non-maximal C 3 self-complementary circular codes, precisely to the excluded trinucleotides Y A = {ACC, GGT} of archaea (Equation (11) (22)). The computation leads to PrRFR(Y A ) = 69.0%, PrRFR(Y M ) = 88.5%, PrRFR(Y C ) = 87.1%, PrRFR(Y V ) = 100.0%, and PrRFR Y V ssDNA = 85.7%. Archaeal genes miss two trinucleotides of X which have the lowest RFR values. In contrast, mitochondrial, chloroplast, and viral genes miss trinucleotides of X with high RFR values. Thus, the genes in reduced genomes are more flexible in translation, allowing overlap coding by frameshifting in agreement with [24] (and the cited references). However, it should be stressed that this result may vary with the increase of gene data of eukaryotic organelles in the future. The circular code X (20 trinucleotides) with the functions of reading frame retrieval and maintenance in regular RNA transcription, may also have, through its bijective transformation codes, the same functions in nucleotide exchanging RNA transcription in mitochondrial genes [23]. Indeed, as the mitochondrial gamma polymerase has bacterial origins (e.g., [25]), mitochondrial polymerization and its associated bijective transformations might use the circular code X. However at the translational level, the ribosome might follow the non-maximal C 3 self-complementary circular code X M observed in mitochondrial genes (Equation (15)). A similar explanation could be applied to the chloroplast genes which have also bacterial origins (cyanobacteria).
By a study of viral genes, the circular code X is found in DNA genomes, RNA genomes, double-stranded genomes, and single-stranded genomes. Thus, the reading frame retrieval property of X could operate for translating DNA and RNA genes, in particular for the "primitive" RNA genes. The C 3 property of X could be involved for translating the two shifted frames in DNA and RNA genes, in particular for optimizing the genomes of small sizes. The complementarity property of X is naturally associated to the double-stranded DNA and RNA genomes. It could also be used to pair single-stranded DNA genomes between them and single-stranded RNA genomes between them. Thus, the C 3 and complementary properties of X could be involved for translating the three frames (reading frame and its two shifted frames) in one strand and the three frames in the complementary strand of DNA and RNA genes.
In summary, this new statistical approach at the gene level which is applied to massive gene data identifies the maximal C 3 self-complementary trinucleotide circular code X in the genes of bacteria, archaea, eukaryotes, plasmids, and viruses, which may be involved in translation coding [3].