Enrichment of Circular Code Motifs in the Genes of the Yeast Saccharomyces cerevisiae

A set X of 20 trinucleotides has been found to have the highest average occurrence in the reading frame, compared to the two shifted frames, of genes of bacteria, archaea, eukaryotes, plasmids and viruses. This set X has an interesting mathematical property, since X is a maximal C3 self-complementary trinucleotide circular code. Furthermore, any motif obtained from this circular code X has the capacity to retrieve, maintain and synchronize the original (reading) frame. Since 1996, the theory of circular codes in genes has mainly been developed by analysing the properties of the 20 trinucleotides of X, using combinatorics and statistical approaches. For the first time, we test this theory by analysing the X motifs, i.e., motifs from the circular code X, in the complete genome of the yeast Saccharomyces cerevisiae. Several properties of X motifs are identified by basic statistics (at the frequency level), and evaluated by comparison to R motifs, i.e., random motifs generated from 30 different random codes R. We first show that the frequency of X motifs is significantly greater than that of R motifs in the genome of S. cerevisiae. We then verify that no significant difference is observed between the frequencies of X and R motifs in the non-coding regions of S. cerevisiae, but that the occurrence number of X motifs is significantly higher than R motifs in the genes (protein-coding regions). This property is true for all cardinalities of X motifs (from 4 to 20) and for all 16 chromosomes. We further investigate the distribution of X motifs in the three frames of S. cerevisiae genes and show that they occur more frequently in the reading frame, regardless of their cardinality or their length. Finally, the ratio of X genes, i.e., genes with at least one X motif, to non-X genes, in the set of verified genes is significantly different to that observed in the set of putative or dubious genes with no experimental evidence. These results, taken together, represent the first evidence for a significant enrichment of X motifs in the genes of an extant organism. They raise two hypotheses: the X motifs may be evolutionary relics of the primitive codes used for translation, or they may continue to play a functional role in the complex processes of genome decoding and protein synthesis.

Example 5. The genetic code is (obviously) not circular.
We briefly recall the proof used to determine whether a code is circular or not, with the most recent and powerful approach which relates an oriented (directed) graph to a trinucleotide code. Definition 7. [8]. Let X ⊆ B 3 be a trinucleotide code. The directed graph G(X) = (V(X), E(X)) associated with X has a finite set of vertices (nodes) V(X) and a finite set of oriented edges E(X) (ordered pairs [v, w] where v, w ∈ X) defined as follows: The theorem below gives a relation between a trinucleotide code which is circular and its associated graph. Theorem 1. [8]. Let X ⊆ B 3 be a trinucleotide code. The following statements are equivalent: (i) The code X is circular.
(ii) The graph G(X) is acyclic.

Definition 8.
A trinucleotide circular code X ⊆ B 3 is C 3 self-complementary if X, X 1 = P (X) and X 2 = P 2 (X) are trinucleotide circular codes such that X = C(X) (self-complementary), C(X 1 ) = X 2 and C(X 2 ) = X 1 (X 1 and X 2 are complementary).

Definition of X Motifs and Random Motifs
Let a X motif m(X) be a sequence (word) constructed from the circular code X (1). Similarly, we define a R motif m(R) constructed from one of the random codes R given in Appendix A. In order to obtain a statistically significant distribution, a set of |R| = 30 random codes R are generated according to the properties of X, except its circularity property: (i) R has a cardinality equal to 20 trinucleotides; (ii) The total number of each nucleotide A, C, G and T in R is equal to 15 (note that 20 × 3 = 15 × 4); (iii) R has no stop trinucleotides {TAA, TAG, TGA} and no periodic trinucleotides {AAA, CCC, GGG, TTT}; (iv) R is not a circular code. Its associated graph G(R) is cyclic (G(R) being not shown).
Each motif, m(X) or m(R), is characterized by its cardinality c in trinucleotides and its length l in trinucleotides.

Example 6.
For the convenience of the reader, we give an example of a motif m(X) = m 1 from the circular code X (1) in a sequence s: . . . AAAGGTGCCGAAGCCCTGGAGGAAAAG . . . In s, there is a X motif m 1 = GGTGCCGAAGCCCTGGAGGAA of cardinality c = 5 trinucleotides {CTG, GAA, GAG, GCC, GGT} and length l = 7 trinucleotides. Note that this motif m 1 cannot be extended to the left or to the right in s due to the presence of the periodic trinucleotide AAA (left) and the trinucleotide AAG (right) which both do not belong to X.
The fundamental property of a motif m(X) is the ability to retrieve, synchronize and maintain the reading frame. Indeed, a window of 13 nucleotides located anywhere in a sequence generated from the circular code X (1) is sufficient to retrieve the reading (correct, construction) frame of the sequence.
It is important to stress again that this window for retrieving the reading frame in a sequence can be located anywhere in the sequence, i.e., no other frame signal, including start and stop trinucleotides, is required to identify the reading frame.
Since a huge number of X motifs m(X) can be identified in a complete genome, we selected specific classes of X motifs, denoted m(X, c), where c = 4, . . . 20 is the cardinal in trinucleotides, with any length l ≥ c ≥ 4 in trinucleotides. Thus, we analyzed 17 classes of motifs m(X, c): m(X, 4), . . . , m(X, 20). The minimal length l = 4 trinucleotides was chosen based on the requirement for 13 nucleotides in order to retrieve the reading frame. The motifs m(X, c) with cardinality c < 4 trinucleotides are excluded here because they are mostly associated with the "pure" trinucleotide repeats often found in non-coding regions of the genome [13].

Statistical Analysis of X Motifs in the Genome of S. cerevisiae
Let N(X, c; K) be the occurrence number of the X motifs m(X, c; K) in a sequence population K = {C, CH, C g , C g } where K can be the entire genome S. cerevisiae K = C, one of its 16 chromosomes K = CH, their genes K = C g or their non-coding regions K = C g . Similarly, we define N(R, c; K) as the occurrence number of the R motifs m(R, c; K) in K and N(R, c; K) = N(R, c; K)/|R| as the mean occurrence number of R motifs m(R, c; K) of the |R| = 30 random codes R in K. An X motif or a R motif is considered to belong to a gene C g if at least one trinucleotide of the motif is located within the gene.

Statistical Analysis of X Motifs in the Three Frames of S. cerevisiae Genes
The X motifs in the three frames of genes C g of S. cerevisiae were analyzed according to two properties p: their cardinality c and their length l. Let N X, p, f ; C g be the occurrence number of the X motifs m X, p, f ; C g in the frame f = 0, 1, 2 of genes C g . Note that for p = c, N X, c, f ; C g = N X, c; C g , N X, c; C g being defined in Section 2.3. We define the proportion P X, p, f ; C g of the X motifs m X, p, f ; C g in a frame f = 0, 1, 2 of genes C g as P X, p, f ; C g = N X, p, f ; C g /N X, p; C g . Let N R, p, f ; C g = N R, p, f ; C g /|R| be the mean occurrence number of the R motifs m R, p, f ; C g in a frame f = 0, 1, 2 of genes C g . Similarly, we define the mean proportion P R, p, f ; C g of the R motifs m R, p, f ; C g in a frame f = 0, 1, 2 of genes C g as P R, p, f ; C g = N R, p, f ; C g /N R, p; C g .

Statistical Analysis of S. cerevisiae Genes with X Motifs m X
A gene, called an X gene, is considered to have an X motif if at least one trinucleotide of the gene belongs to an X motif. Let N C g ; X, c be the occurrence number of X genes C g of S. cerevisiae with X motifs m X, c; C g . Similarly, we define N C g ; R, c as the occurrence number of genes C g with R motifs m R, c; C g and N C g ; R, c = N C g ; R, c /|R| as the mean occurrence number of genes C g with R motifs m R, c; C g from the |R| = 30 random codes R.
As previously, we define the proportion P C g ; X, c of X genes C g with X motifs m X, c; C g as P C g ; X, c = N C g ; X, c /N C g where N C g ; X, c is the number of X genes C g (see above) and N C g is the total number of genes C g in C (given in Section 2.7). Similarly, we define the mean proportion P C g ; R, c of genes C g with R motifs m R, c; C g as P C g ; R, c = N C g ; R, c /N C g where N C g ; R, c is the mean occurrence number of genes C g with R motifs m R, c; C g and N C g is the total number of genes C g in C (given in Section 2.7).

Software Development
A program was developed in the Java language to identify X and R motifs in all 3 frames of an input nucleotide sequence [13]. The program takes optional parameters that define the minimum cardinality c (in trinucleotides) and the length l (in trinucleotides) of the X motifs searched, as well as the trinucleotides making up the X or R code. It returns a list of all X or R motifs identified within the sequence, including the motif sequence, length, cardinality and frame.

Genome S. cerevisiae
The reference genome C of S. cerevisiae strain S288C (version R64-2-1) and gene annotations were downloaded from Ensembl (http://www.ensembl.org/, June 2017). The genome contains 13,986,094 nucleotides and a total number of N C g = 6691 genes, whose coding regions represent 8,997,548 nucleotides (64.3% of the genome).
Gene annotations included the positions of all protein coding regions (or CDS for CoDing Sequence), with exons, introns, start codons and stop codons identified. Of the 6691 genes, 6407 genes have a single exon, while 284 genes have a more complex structure with multiple exons separated by

Results
The results presented below are based on basic statistics (elementary frequencies) and their biological significance is clear. In order to evaluate the statistical significance of the different results presented below, we chose an approach that involved comparing the results obtained for the motifs with those obtained for random motifs generated by 30 different random codes (see Section 2.2 and Appendix A). This approach avoids the problems associated with defining statistical hypotheses about the nucleotide composition, the length and the random model of the different regions of the genome. The main disadvantage of our approach is the additional computational resources required to obtain the results for 30 different random codes.

Occurrence Number of Motifs in the Genome of S. cerevisiae
In the genome of S. cerevisiae, 70,204 motifs (from the circular code (1)) and a mean number of 52,183 motifs (from the 30 random codes ) are observed. The distributions of these and motifs according to their cardinality (trinucleotide composition) are shown in Figure 2. The highest cardinality of the motifs observed is = 12 trinucleotides. Regardless of the cardinality , Figure  2 shows that the occurrence number of motifs is very significantly larger than the number of motifs in S. cerevisiae. The distribution of the values obtained for the motifs is indicated by boxplots representing the mean, the standard deviation and the Minimum-Maximum occurrence numbers. Very similar boxplots were obtained using the median and Q1-Q3 quartiles (statistical results not shown).
Based on this preliminary study, we then wanted to know whether the motifs are uniformly distributed along the genome or enriched in functional regions, such as the genes.

Results
The results presented below are based on basic statistics (elementary frequencies) and their biological significance is clear. In order to evaluate the statistical significance of the different results presented below, we chose an approach that involved comparing the results obtained for the X motifs with those obtained for random R motifs generated by 30 different random codes R (see Section 2.2 and Appendix A). This approach avoids the problems associated with defining statistical hypotheses about the nucleotide composition, the length and the random model of the different regions of the genome. The main disadvantage of our approach is the additional computational resources required to obtain the results for 30 different random codes.

Occurrence Number of X Motifs in the Genome of S. cerevisiae
In the genome of S. cerevisiae, 70,204 X motifs (from the circular code X (1)) and a mean number of 52,183 R motifs (from the 30 random codes R) are observed. The distributions of these X and R motifs according to their cardinality (trinucleotide composition) c are shown in Figure 2. The highest cardinality of the X motifs observed is c = 12 trinucleotides. Regardless of the cardinality c, Figure 2 shows that the occurrence number of X motifs is very significantly larger than the number of R motifs in S. cerevisiae. The distribution of the values obtained for the R motifs is indicated by boxplots representing the mean, the standard deviation and the Minimum-Maximum occurrence numbers. Very similar boxplots were obtained using the median and Q1-Q3 quartiles (statistical results not shown).
Based on this preliminary study, we then wanted to know whether the X motifs are uniformly distributed along the genome or enriched in functional regions, such as the genes.  Figure 3. Regardless of the cardinality , Figure 3 shows that there is no significance difference between the distributions of the and motifs in the non-coding regions of S. cerevisiae.  In the non-coding regions of S. cerevisiae, 13,309 (19.0%) of the X motifs out of 70,204 and 12,936 (mean number) (24.8%) of the R motifs out of 52,183 are observed. The distributions of these X and R motifs according to the trinucleotide cardinality c are given in Figure 3. Regardless of the cardinality c, Figure 3 shows that there is no significance difference between the distributions of the X and R motifs in the non-coding regions of S. cerevisiae.

Occurrence Number of Motifs in the Non-Coding Regions of S. cerevisiae
In the non-coding regions of S. cerevisiae, 13,309 (19.0%) of the motifs out of 70,204 and 12,936 (mean number) (24.8%) of the motifs out of 52,183 are observed. The distributions of these and motifs according to the trinucleotide cardinality are given in Figure 3. Regardless of the cardinality , Figure 3 shows that there is no significance difference between the distributions of the and motifs in the non-coding regions of S. cerevisiae.   We conclude that the X motifs located in the non-coding regions are random occurrences and are probably not functional. Thus, the differences we observed at the genome level are undoubtedly due to differences in the genes. In the remaining sections of this article, we will concentrate on these important functional regions.

Occurrence Number of X Motifs in the Genes of S. cerevisiae
In the coding regions of the genes of S. cerevisiae, 56,895 (81.0%) of the X motifs out of 70,204 and 39,247 (mean number) (75.2%) of the R motifs out of 52,183 are identified. The distribution of these X and R motifs according to the trinucleotide cardinality c are given in Figure 4. As expected, important differences are observed in the occurrence numbers of X and R motifs and this is true for all cardinalities from 4 to 12.
Life 2017, 7, 52 8 of 21 We conclude that the motifs located in the non-coding regions are random occurrences and are probably not functional. Thus, the differences we observed at the genome level are undoubtedly due to differences in the genes. In the remaining sections of this article, we will concentrate on these important functional regions.

Occurrence Number of Motifs in the Genes of S. cerevisiae
In the coding regions of the genes of S. cerevisiae, 56,895 (81.0%) of the motifs out of 70,204 and 39,247 (mean number) (75.2%) of the motifs out of 52,183 are identified. The distribution of these and motifs according to the trinucleotide cardinality are given in Figure 4. As expected, important differences are observed in the occurrence numbers of and motifs and this is true for all cardinalities from 4 to 12. cerevisiae. The abscissa shows the cardinality = 4, … ,12 in trinucleotides. The ordinate gives the occurrence numbers and in logarithm. Figure 4 suggests two properties of the motifs affecting the retrieval of the reading frame in genes, which are represented in more detail in Figure 5. First, the ratio of motifs to motifs, i.e., , ; ℂ = , ; ℂ / , ; ℂ , increases with the trinucleotide cardinality (red curve in Figure   5). At first sight, this might suggest that the motifs with large cardinalities are more important for retrieving the reading frame in genes. However, it should be noted that these motifs are relatively rare (131 motifs of cardinality = 9,10,11,12 trinucleotides) compared to the low cardinality motifs (49,265 motifs of cardinality = 4 trinucleotides) (blue curve in Figure 5). Indeed, the second property shows that low cardinality motifs are highly abundant with ~10,000 more motifs of cardinality = 4 trinucleotides, for example, than expected by chance. It is important to remember that an motif of cardinality = 4 trinucleotides, i.e., of length ≥ 4 trinucleotides, is sufficient to retrieve the reading frame (by definition of a circular code).
Furthermore, as shown in Figure 6, a significantly large number of motifs relative to motifs is observed in the genes ℂ of the 16 chromosomes ℂ of S. cerevisiae. This result is statistically significant. Indeed, the probability that a point in the curve of Figure 6 associated with the motifs is higher than the point associated with the motifs is equal to 1 2 ⁄ . Then, the probability that the motifs are more numerous than the motifs in each of the 16 independent chromosomes is equal to 1 2 ⁄ ≈ 10 . Finally, this result is independent of the length or coding gene density of the chromosomes.   Figure 4 suggests two properties of the X motifs affecting the retrieval of the reading frame in genes, which are represented in more detail in Figure 5. First, the ratio of X motifs to R motifs, i.e., r X, c; C g = N X, c; C g /N R, c; C g , increases with the trinucleotide cardinality (red curve in Figure 5). At first sight, this might suggest that the X motifs with large cardinalities are more important for retrieving the reading frame in genes. However, it should be noted that these X motifs are relatively rare (131 X motifs of cardinality c = 9, 10, 11, 12 trinucleotides) compared to the low cardinality X motifs (49,265 X motifs of cardinality c = 4 trinucleotides) (blue curve in Figure 5). Indeed, the second property shows that low cardinality X motifs are highly abundant with~10,000 more X motifs of cardinality c = 4 trinucleotides, for example, than expected by chance. It is important to remember that an X motif of cardinality c = 4 trinucleotides, i.e., of length l ≥ 4 trinucleotides, is sufficient to retrieve the reading frame (by definition of a circular code).
Furthermore, as shown in Figure 6, a significantly large number of X motifs relative to R motifs is observed in the genes CH g of the 16 chromosomes CH of S. cerevisiae. This result is statistically significant. Indeed, the probability that a point in the curve of Figure 6 associated with the X motifs is higher than the point associated with the R motifs is equal to 1/2. Then, the probability that the X motifs are more numerous than the R motifs in each of the 16 independent chromosomes is equal to 1/2 16 ≈ 10 −5 . Finally, this result is independent of the length or coding gene density of the chromosomes. repeated trinucleotide ( = 1), e.g., with 34 trinucleotides not repeated. An intermediary class is composed of motifs between these two extremes, e.g., is composed of a series of different short trinucleotide repeats.
In the next section, we describe a more in-depth statistical analysis of motifs in genes relative to their frames: the reading frame 0 and its two shifted frames 1 and 2.   Difference δ X, c; C g = N X, c; C g − N R, c; C g (blue, left) and ratio r X, c; C g = N X, c; C g /N R, c; C g (red, right) of X motifs m X, c; C g and R motifs m R, c; C g in the genes C g of S. cerevisiae (deduced from Figure 4). The abscissa shows the cardinality c = 4, . . . , 12 in trinucleotides. The ordinate gives the occurrence numbers δ and r.
repeated trinucleotide ( = 1), e.g., with 34 trinucleotides not repeated. An intermediary class is composed of motifs between these two extremes, e.g., is composed of a series of different short trinucleotide repeats.
In the next section, we describe a more in-depth statistical analysis of motifs in genes relative to their frames: the reading frame 0 and its two shifted frames 1 and 2.    Table 1 lists the longest X motifs in the genes of S. cerevisiae of length greater than 100 nucleotides. Surprisingly, these X motifs exhibit two fundamentally different structures. The first class consists of X motifs containing a sequence of a repeated trinucleotide (N 1 N 2 N 3 ) n , e.g., m 6 with a trinucleotide repeated 20 times, precisely (ATC) 20 . The second class includes X motifs with no repeated trinucleotide (n = 1), e.g., m 8 with 34 trinucleotides not repeated. An intermediary class is composed of X motifs between these two extremes, e.g., m 1 is composed of a series of different short trinucleotide repeats. Table 1. Longest X motifs in the genes C g of S. cerevisiae. The 1st column gives the chromosome number, the 2nd, 3rd, 4th and 5th indicate the name, the start position, the end position and the nucleotide length, respectively, of genes containing the longest X motifs, the 6th, 7th and 8th point out the start position, the end position and the nucleotide length, respectively, of the longest X motifs, and 9th column gives the sequence of the longest X motifs. In the next section, we describe a more in-depth statistical analysis of X motifs in genes relative to their frames: the reading frame 0 and its two shifted frames 1 and 2.

Occurrence Number of X Motifs in the Three Frames of S. cerevisiae Genes
The 56,895 X motifs and the 39,247 R motifs in the S. cerevisiae genes C g are analyzed according to their three frames (Figure 7).

Occurrence Number of Motifs in the Three Frames of S. cerevisiae Genes
The 56,895 motifs and the 39,247 motifs in the S. cerevisiae genes ℂ are analyzed according to their three frames (Figure 7). First, if we consider the case of the motifs, as expected their frequency is close to the random case of 1 3 ⁄ in each frame of genes (one chance out of 3 to retrieve the reading frame). The observed frequency of motifs in frame 2 is less than 1 3 ⁄ , which is related to the two facts that (i) there are more stop trinucleotides in frame 2 compared to frame 1 ( Table 2); and (ii) the motifs do not contain stop trinucleotides by construction (see Section 2.2). Indeed, among the 430,286 stop trinucleotides in the S. cerevisiae genes, 185,800 are located in frame 1 and 244,486 are located in frame 2. In contrast, the motifs present a non-random distribution, with 63% located in frame 0 (reading frame) of the genes (63% being also the average frequency of motifs for all cardinalities in frame 0 in Figure 7). Again, we found the same correlation as that described in Section 3.1.2 (see Figure 5), namely that the effect is more pronounced for motifs with large cardinalities. However, it is important to remember that the motifs of low cardinalities are much more abundant. Again in contrast to the motifs, the motifs occur preferentially in frame 2 compared to frame 1 with a significant difference of about 10%. Indeed, the observed average probability difference between the motifs in frame 2 and the motifs in frame 1 is equal to First, if we consider the case of the R motifs, as expected their frequency is close to the random case of 1/3 in each frame of genes (one chance out of 3 to retrieve the reading frame). The observed frequency of R motifs in frame 2 is less than 1/3, which is related to the two facts that (i) there are more stop trinucleotides in frame 2 compared to frame 1 ( Table 2); and (ii) the R motifs do not contain stop trinucleotides by construction (see Section 2.2). Indeed, among the 430,286 stop trinucleotides in the S. cerevisiae genes, 185,800 are located in frame 1 and 244,486 are located in frame 2. In contrast, the X motifs present a non-random distribution, with 63% located in frame 0 (reading frame) of the genes (63% being also the average frequency of X motifs for all cardinalities in frame 0 in Figure 7). Again, we found the same correlation as that described in Section 3.1.2 (see Figure 5), namely that the effect is more pronounced for X motifs with large cardinalities. However, it is important to remember that the X motifs of low cardinalities are much more abundant. Again in contrast to the R motifs, the X motifs occur preferentially in frame 2 compared to frame 1 with a significant difference of about 10%. Indeed, the observed average probability difference between the X motifs in frame 2 and the X motifs in frame 1 is equal to P X, 2; C g − P X, 1; C g = ∑ c≥4 [(P(X,c,2; C g )−P(X,c,1; C g ))(N(X,c,1; C g )+N(X,c,2; C g ))] ∑ c≥4 (N(X,c,1; C g )+N(X,c,2; C g )) = 10.0% where P X, c, f ; C g and N X, c, f ; C g with the frame f = 1, 2 are defined in Section 2.4. This result is in agreement with the circular code theory. Indeed, a simple probabilistic model based on the independent occurrence of trinucleotides in reading frame 0 can estimate the real probabilities of the three circular codes X, X 1 and X 2 (Definition 8) observed in the shifted frames 1 and 2. Indeed, the estimated probabilities of X in frames 2 and 1 of eukaryotic genes equal to 29.4% and 25.5%, respectively, are identical (at the level of the percentage) to their corresponding probabilities in real sequences which are equal to 29.4% and 25.6%, respectively (Table 5b in [2]). This frequency asymmetry of the circular code X in frames 1 and 2 has been related to the frequency asymmetry of the circular codes X 1 and X 2 in frame 0. Indeed, in frame 0 of eukaryotic genes, the frequencies of the circular codes X 1 and X 2 are equal to 39.0% and 28.9%, respectively (Table 5b in [2]).
Since the frame 0 has no stop trinucleotides, the theoretical occurrence probability of the circular code X, with 20 trinucleotides, is equal to 20/64 = 31.25%. Similarly, the occurrence probability of the circular code X 1 (20 trinucleotides with one stop trinucleotide, TAG) is equal to 19/64 = 29.69%, and the occurrence probability of the circular code X 2 (20 trinucleotides with two stop trinucleotides, TAA and TGA) is equal to 18/64 = 28.13%. Thus, the probability difference between the two circular codes X 1 and X 2 is equal to 1/64 = 1.56%. We conclude that the frequency asymmetry of X 1 and X 2 in frame 0 cannot be explained solely by the presence of stop trinucleotides.
Although this frequency asymmetry of X 1 and X 2 has been identified in eukaryotic genes ( [14], Figure 2 and Section 2.2; [15], Section 1.2.2) and prokaryotic genes ( [16], Section 3.1.2), it has no biological explanation so far. However, it can explain the frequency asymmetry of the code X in frames 1 and 2. Thus, there is a strong correlation between the theoretical results of the three circular codes X, X 1 and X 2 in genes, i.e., three sets of 20 trinucleotides, described in the previous work and the results observed here with the circular code motifs. In the same way that the frequency asymmetry of X 1 and X 2 in frame 0 of genes is not explained from a biological point of view, the frequency asymmetry of X in frames 1 and 2 of genes is also not explained.
The same results are observed by analyzing the distribution of the 56,895 X motifs and the 39,247 R motifs in the S. cerevisiae genes as a function of their lengths (Figure 8). Note that we did not observe R motifs of length strictly greater than 10 trinucleotides.
The observed average probability difference with the X motifs in frames 2 and 1 is retrieved as a function of their length P X, 2; C g − P X, 1; C g = ∑ l≥4 [(P(X,l,2; C g )−P(X,l,1; C g ))(N(X,l,1; C g )+N(X,l,2; C g ))] ∑ l≥4 (N(X,l,1; C g )+N(X,l,2; C g )) = 10.5% where P X, l, f ; C g and N X, l, f ; C g with the frame f = 1, 2 are defined in Section 2.4.

Identification of S. cerevisiae Genes
In the following, we define an gene to be a gene containing at least one motif of cardinality ≥ 4 trinucleotides in any frame. A non-gene is a gene with no motif of cardinality ≥ 4 trinucleotides in any frame. In the genome of S. cerevisiae, 6175 genes out of 6691 contain motifs (92.3%), while 516 genes do not contain motifs (7.7%). The number of motifs per gene varies from a single motif, up to the gene "huge dynein-related AAA-type ATPase (midasin)" of length 14,732 nucleotides containing a series of 67 motifs. Figure 9 shows the distributions of the genes and non-genes according to their lengths. The proportion of genes increases with their length. Indeed, more than 50% of the genes of length > 200 nucleotides and more than 90% of the genes of length > 500 nucleotides are genes. Nevertheless, an anomaly is observed for genes of length 1300-1399, where 27 out of the 266 genes (i.e., 10.2%) are not genes. A functional analysis showed that these 27 non-genes are in fact retrotransposons of viral origin.
This observation led us to perform a more detailed study of the functional annotations associated with the S. cerevisiae genes, as shown in Table 3. In the SGD database, 5383 genes have a status of "Verified" genes, meaning that experimental evidence exists and that a gene product is produced in S. cerevisiae; 546 genes have a status of "Uncharacterized" genes, implying that they are likely to encode expressed proteins, as suggested by the existence of orthologs in one or more other species, but for which there are no specific experimental data demonstrating that a gene product is produced in S. cerevisiae; 673 genes have a "Dubious" status meaning that they are unlikely to encode an expressed protein. Dubious genes may meet some or all of the following criteria: (i) the gene is not conserved in other Saccharomyces species; (ii) there is no well-controlled, small-scale, published experimental evidence that a gene product is produced; (iii) a phenotype caused by disruption of the gene can be ascribed to mutation of an overlapping gene; and (iv) the gene does not contain an intron. Finally, 89 genes are transposons, including any of the five classes (TY1 through TY5) of mobile genetic elements in yeast that contain long terminal repeats flanking a central epsilon element that encodes two gene products. X motifs in frame 0 of genes X motifs in frame 1 of genes X motifs in frame 2 of genes R motifs in frame 0 of genes R motifs in frame 1 of genes R motifs in frame 2 of genes

Identification of S. cerevisiae X Genes
In the following, we define an X gene to be a gene containing at least one X motif of cardinality c ≥ 4 trinucleotides in any frame. A non-X gene is a gene with no X motif of cardinality c ≥ 4 trinucleotides in any frame. In the genome of S. cerevisiae, 6175 genes out of 6691 contain X motifs (92.3%), while 516 genes do not contain X motifs (7.7%). The number of X motifs per gene varies from a single X motif, up to the gene "huge dynein-related AAA-type ATPase (midasin)" of length 14,732 nucleotides containing a series of 67 X motifs. Figure 9 shows the distributions of the X genes and non-X genes according to their lengths. The proportion of X genes increases with their length. Indeed, more than 50% of the genes of length >200 nucleotides and more than 90% of the genes of length >500 nucleotides are X genes. Nevertheless, an anomaly is observed for genes of length 1300-1399, where 27 out of the 266 genes (i.e., 10.2%) are not X genes. A functional analysis showed that these 27 non-X genes are in fact retrotransposons of viral origin.
This observation led us to perform a more detailed study of the functional annotations associated with the S. cerevisiae genes, as shown in Table 3. In the SGD database, 5383 genes have a status of "Verified" genes, meaning that experimental evidence exists and that a gene product is produced in S. cerevisiae; 546 genes have a status of "Uncharacterized" genes, implying that they are likely to encode expressed proteins, as suggested by the existence of orthologs in one or more other species, but for which there are no specific experimental data demonstrating that a gene product is produced in S. cerevisiae; 673 genes have a "Dubious" status meaning that they are unlikely to encode an expressed protein. Dubious genes may meet some or all of the following criteria: (i) the gene is not conserved in other Saccharomyces species; (ii) there is no well-controlled, small-scale, published experimental evidence that a gene product is produced; (iii) a phenotype caused by disruption of the gene can be ascribed to mutation of an overlapping gene; and (iv) the gene does not contain an intron. Finally, 89 genes are transposons, including any of the five classes (TY1 through TY5) of mobile genetic elements in yeast that contain long terminal repeats flanking a central epsilon element that encodes two gene products. The proportion of genes and non-genes strongly depends on their status. For example, 97.8% of verified genes are genes, 82.2% of uncharacterized genes are genes while only 60.0% of dubious genes are genes, in agreement with the experimental evidence available. Figure 9. Proportion of genes (blue) and non-genes (braun) according to their nucleotide length in S. cerevisiae. An gene is a gene containing at least one motif of cardinality ≥ 4 trinucleotides in any frame. A non-gene is a gene with no motif of cardinality ≥ 4 trinucleotides in any frame. The abscissa shows the gene length in intervals of 100 nucleotides. The ordinate gives the percentage of genes. Table 3. Numbers of genes and non-genes depending on the status of S. cerevisiae genes according to the SGD database. An gene is a gene containing at least one motif of cardinality ≥ 4 trinucleotides in any frame. A non-gene is a gene with no motifs of cardinality ≥ 4 trinucleotides in any frame. The total column represents the sum of genes with ≥ 1 motifs and the non-genes, i.e., the number of S. cerevisiae genes in each category.

Genes with
Motifs Non-Genes Total ≥ 1 ≥ 2 ≥ 3 ≥ 4 ≥ 5   Verified genes  5262 5082 4758 4388 4013  121  5383  Uncharacterized genes 449 348 266 221 174  97  546  Dubious genes  404 247 133 61  32  269  673  Transposable elements 60  60  60  59  59  29  89  Total  6175 5737 5217 4729 4278  516  6691 Thus, the presence-absence of motifs in a gene is an important and new factor in the classification of genes as functional or not as shown by the following conditional probabilities deduced from Table 3: (Non-verified genes | Non-genes) = (97 + 269 + 29)/516 = 395/516 = 76.6% (Verified genes | Non-genes) = 121/516 = 23.4% (Verified genes | genes with ≥ 1 motifs) = 5262/6175 = 85.2% (Verified genes | genes with ≥ 2 motifs) = 5082/5737 = 88.6% (Verified genes | genes with ≥ 3 motifs) = 4758/5217 = 91.2% (Verified genes | genes with ≥ 4 motifs ) = 4388/4729 = 92.8% (Verified genes | genes with ≥ 5 motifs) = 4013/4278 = 93.8% the non-verified genes being the uncharacterized and dubious genes, and the transposable elements. Percentage X genes Non-X genes Figure 9. Proportion of X genes (blue) and non-X genes (braun) according to their nucleotide length in S. cerevisiae. An X gene is a gene containing at least one X motif of cardinality c ≥ 4 trinucleotides in any frame. A non-X gene is a gene with no X motif of cardinality c ≥ 4 trinucleotides in any frame. The abscissa shows the gene length in intervals of 100 nucleotides. The ordinate gives the percentage of genes. Table 3. Numbers of X genes and non-X genes depending on the status of S. cerevisiae genes according to the SGD database. An X gene is a gene containing at least one X motif of cardinality c ≥ 4 trinucleotides in any frame. A non-X gene is a gene with no X motifs of cardinality c ≥ 4 trinucleotides in any frame. The total column represents the sum of X genes with ≥ 1 X motifs and the non-X genes, i.e., the number of S. cerevisiae genes in each category.
X Genes with X Motifs Non-X Genes Total Verified genes  5262  5082  4758  4388  4013  121  5383  Uncharacterized genes  449  348  266  221  174  97  546  Dubious genes  404  247  133  61  32  269  673  Transposable elements  60  60  60  59  59  29  89  Total  6175  5737  5217  4729  4278  516  6691 The proportion of X genes and non-X genes strongly depends on their status. For example, 97.8% of verified genes are X genes, 82.2% of uncharacterized genes are X genes while only 60.0% of dubious genes are X genes, in agreement with the experimental evidence available.
Thus, the presence-absence of X motifs in a gene is an important and new factor in the classification of genes as functional or not as shown by the following conditional probabilities deduced from Table 3: Clearly, the probability of verified genes in the set of genes with ≥ n X motifs increases as n increases. However, the biggest difference in conditional probabilities of verified genes is observed for genes with no X motifs compared to genes with ≥ 1X motifs, and therefore we retain our definition of an X gene as a gene containing at least one X motif in the remainder of this article.

Trinucleotide Composition in the X Motifs of S. cerevisiae Genes
We compared the trinucleotide composition of the 5262 S. cerevisiae verified X genes with the composition of the X motifs in frame 0 of these genes ( Table 4) and found that they are highly similar (correlation coefficient r = 0.99). As the length of the 5262 S. cerevisiae verified X genes is 2,719,966 trinucleotides, the coverage of X genes by the X motifs is equal to 154, 635/2, 719, 966 = 5.7%.

Conclusions
The theory of the circular code X in genes has been developed using a combinatorial approach since 1996. For the first time, we tested this theory by analysing the X motifs, i.e., motifs from this circular code X, in the complete genome of the yeast S. cerevisiae. This organism was chosen because it has been a "model" organism for many years, the genome is relatively small and compact, and the genes generally have a simple intron/exon structure.
The main result demonstrated is a significant enrichment of X motifs in the reading frame of genes of S. cerevisiae (see results in Section 3.1-Section 3.2). Furthermore, the statistical distribution of X motifs in the three frames of S. cerevisiae genes, in particular the preferential occurrence of X motifs in frame 2 compared to frame 1 (see results in Section 3.2), is in agreement with the circular code theory concerning the well-known frequency asymmetry of the circular codes X 1 and X 2 in prokaryotic and eukaryotic genes ( [14], Figure 2 and Section 2.2; [15], Section 1.2.2; [16], Section 3.1.2).
The longest X motifs in the genes of S. cerevisiae are of length greater than 100 nucleotides. Surprisingly, these X motifs exhibit two structures fundamentally different ( Table 1). The 1st class is exemplified by X motifs containing a sequence of a repeated trinucleotide (N 1 N 2 N 3 ) n , while the 2nd class is represented by X motifs with no repeated trinucleotides (n = 1). An intermediary class is composed of X motifs between these two extremes, i.e., composed of a series of different short trinucleotide repeats. Half of the S. cerevisiae genes with very long X motifs have paralogues that arose from the whole genome duplication (WGD) event that occurred in an ancestor of S. cerevisiaẽ 100 million years ago [17], even though~80% of the duplicated genes have since been lost [17]. Furthermore, the functional annotations found in the SGD database indicate that many of the genes with very long X motifs encode important physiological polypeptides involved in, for example, transport from the Golgi, chromatin modelling or are located in the mitochondria.
We have shown that the presence of X motifs in a potential open reading frame can be used to predict whether the gene is likely to encode a functional protein. Indeed, X motifs are found in 98% of verified genes, while only 60% of dubious genes contain X motifs (see results in Section 3.3). Additional parameters related to the genes themselves or the structure, the length and positions of X motifs may improve the prediction accuracy in the future.
The question remains of whether the X motifs are simply the evolutionary relics of a primordial code that might have existed in the early stages of cellular life, or do they represent functional elements of the complex genome decoding system in extant organisms?
There seems to be a consensus that the standard genetic code conserves vestiges of earlier, simpler codes, that may have been used to code fewer amino acids than the modern set of 20. Many examples of such ancient genetic codes have been proposed, including the codes RRY of size 8 [18] and RNY of size 16 [19,20] (R = {A, G}, Y = {C, T}, N = {A, C, G, T}), the codes GNC of size 4 and SNS of size 16 [21], and GHN of size 12 [22] (S = {C, G}, H = {A, C, T}), etc. All these codes are circular, with the exception of the SNS code (as, for example, CCC ∈ SNS). The codes RRY, RNY, GNC and GHN also belong to the more restrictive class of comma-free codes (longest path length l = 2 in their associated graphs G(RRY), G(RNY), G(GNC) and G(GHN), details in [23]). The code RRY is in addition strong comma-free (longest path length l = 1 in its associated graph G(RRY), details in [23]). The comma-free codes RRY and GHN are not self-complementary (as C(RRY) = RYY and C(GHN) = NDC with D = {A, G, T}), while the codes RNY and GNC are self-complementary (as C(RNY) = RNY and C(GNC) = GNC). The comma-free code RNY can be decomposed into two subcodes of size 8 each which are both strong comma-free and complementary to each other (Proposition 3.28 in [23]) and almost included in the circular code X (Table 3a in [3]). Today, the genetic code has become too complex to use strong comma-free codes and comma-free codes (in the sense of having strong error-detecting properties, i.e., recognizing a frameshift immediately), and therefore, we suggest that nature moved on to the weaker circular codes.
Numerous hypotheses have been formulated concerning the evolution of the ancient genetic codes into the modern standard genetic code (reviewed in [24]). For example, several lines of evidence have been used to classify the standard 20 amino acids into 'early' and 'late' ones. Ten early amino acids (EAA) have been consistently identified in prebiotic chemistry experiments as well as in meteorites, in the following order of abundance: < Gly, Ala, Asp, Glu, Val, Ser, Ile, Leu, Pro, Thr > (reviewed in [24]). The ten late amino acids are entirely biogenic and were probably recruited into the code after the evolution of the respective biosynthetic pathways, possibly in complementary pairs. The circular code X encodes 12 amino acids, of which 8 correspond to these early amino acids, with the exception of Ser and Pro. Furthermore, a (ordered) subcode X of 10 trinucleotides among the 20 trinucleotides of X X =< {GGC, GGT}, GCC, {GAC, GAT}, GAG, GTC, ATC, CTC, ACC > codes 8 (ordered) early amino acids of the ten EAA =< Gly, Ala, Asp, Glu, Val, Ile, Leu, Thr > .
The circular code X is C 3 self-complementary. This ancient code X is not comma-free as the longest path length l = 4 > 2 in its associated graph G(X ). This result may suggest that the ancestral circular codes of X are also C 3 self-complementary.
A model of the evolution of C 3 self-complementary circular codes can be proposed ( Figure 10). We will use the following abbreviation in the following to classify these circular codes: a C 3 SC l code stands for a C 3 Self-complementary Circular code of longest path length l ∈ {1, 2, 3, 4, 6, 8}, l = 5, 7 being excluded (see Theorem 4.2 given for self-complementary circular codes in [25]). According to this model, the evolution of C 3 SC l codes is based on an increase in combinatorial flexibility (number of codes, cardinality of codes, nucleotide window length of reading frame retrieval), starting with the strong comma-free codes (C 3 SC 1 codes) with the strongest error-detecting properties, then the comma-free codes (C 3 SC 2 codes) with strong error-detecting properties, then the C 3 SC 3 , C 3 SC 4 and C 3 SC 6 codes with low error-detecting properties, up to the C 3 SC 8 codes with the lowest error-detecting properties, such as the circular code X found in extant genes. Note that the 216 C 3 self-complementary circular codes are the sum of the 56 C 3 SC 4 codes plus the 56 C 3 SC 6 codes plus the 104 C 3 SC 8 codes. This combinatorial circular code evolution may also be associated with time evolution where strong comma-free codes and comma-free codes are more ancestral than circular codes. So, the circular code X (C 3 SC 4 of cardinality 10 trinucleotides) may be an intermediate between the ancient strong comma-free and comma-free codes (C 3 SC 1 and C 3 SC 2 codes), and the circular code X (C 3 SC 8 code of cardinality 20 trinucleotides) in extant organisms. The X motifs observed in the genes of S. cerevisiae may have retained a functional role in translation. Indeed, it has been observed previously that short X motifs have also been conserved in many transfer RNAs (tRNAs) and ribosomal RNAs (rRNAs) [26][27][28][29]. In particular, the universally conserved nucleotides A1492, A1493 and G530 in the ribosome decoding center are located in short X motifs. Understanding the pairing between the X motifs in genes and the short X motifs of the ribosome decoding center could shed light on the biological function of the circular code X in the genome decoding system of extant organisms. Furthermore, if X motifs do play a functional role, then mutations in these regions that lead to the loss of the X motif properties could have deleterious effects and may even be the cause of genetic diseases. In particular, long X motifs with repeats of certain trinucleotides could generate secondary structures that may be problematic in translation [30]. The effect of mutations in X motifs will be investigated in future work.
Author Contributions: All authors contributed equally to this work.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Random Codes
30 random codes R are generated according to the properties of the maximal C 3 self-complementary trinucleotide circular code X (1), except its circularity property: (i) R has a cardinality equal to 20 trinucleotides; (ii) The total number of each nucleotide A, C, G and T in R is equal to 15; (iii) R has no stop trinucleotides {TAA, TAG, TGA} and no periodic trinucleotides {AAA, CCC, GGG, TTT}; (iv) R is not a circular code. Its associated graph G(R) is cyclic (G(R) being not shown).