Codon Distribution in Error-Detecting Circular Codes

In 1957, Francis Crick et al. suggested an ingenious explanation for the process of frame maintenance. The idea was based on the notion of comma-free codes. Although Crick’s hypothesis proved to be wrong, in 1996, Arquès and Michel discovered the existence of a weaker version of such codes in eukaryote and prokaryote genomes, namely the so-called circular codes. Since then, circular code theory has invariably evoked great interest and made significant progress. In this article, the codon distributions in maximal comma-free, maximal self-complementary C3 and maximal self-complementary circular codes are discussed, i.e., we investigate in how many of such codes a given codon participates. As the main (and surprising) result, it is shown that the codons can be separated into very few classes (three, or five, or six) with respect to their frequency. Moreover, the distribution classes can be hierarchically ordered as refinements from maximal comma-free codes via maximal self-complementary C3 codes to maximal self-complementary circular codes.


Introduction
The genetic code as it is today is a product of a long evolutionary process. It can be seen as a kind of dictionary that translates information from the world of nucleic acids into the world of proteins. As such, it is involved in the transmission of information, the translation process, and thus, plays an essential role in the process that defines the central dogma of molecular biology. During this process, degeneracy is one of the most conserved features of the genetic code. It can be postulated that this is for a good reason, since degeneracy is the fundamental ingredient in any error-detecting and error-correcting system [1,2]. Therefore, for example, the self-referential model for the formation of the code is based on an original regionalization of characters through the concerted superposition of the two components of the encodings. With this approach, the degeneracy of the genetic code and clusters of similar amino acids corresponding to similar triplets should be explained [3].
The preservation of the genetic information is impossible without an error-correcting system (this can even be proven by the methods of information theory) and cannot be guaranteed just by DNA replication. There are different hypotheses for how such error-correction may happen. In [4], the so-called ambush hypothesis is examined. According to this hypothesis, off-frame stops terminate frameshifted translation, potentially decreasing energy and resource waste on nonfunctional proteins. Moreover, codons with more potential to form hidden stops (off-frame stops) have greater usage frequency and bias in their favor among synonymous codons. In [5], a model of an amino acid composed of a constant part and of a variable part is considered, and it was concluded that the kinetic energetic disturbance caused by the substitution of the variable part of an amino acid is minimized. In [6], error prevention and mitigation as forces in the evolution of genes and genomes are postulated. this is called the frequency class number of the codon with respect to a class of codes. After preparing definitions (see Section 2), the main results of the article are presented in Section 3. Surprisingly, it turns out that for each of the above classes of codes, there are very few frequency class numbers of codons with respect to it (see also [16,17] for the data). It is even more surprising that the frequency classes of codons for self-complementary C 3 codes are refinements of the classes for comma-free codes, and those for self-complementary circular codes are refinements for the corresponding classes of self-complementary C 3 codes. Thus, the number of different frequency classes of codons increases parallel to the decrease of error-detecting properties of the codes. The fact that the frequency classes of the codons for maximal self-complementary C 3 codes are refinements of the classes for maximal comma-free codes is a strong hint that the maximal self-complementary C 3 codes used in the current genetic code could have evolved from the maximal comma-free codes, which were very likely used in earlier times, since these two classes of codes are disjoint. This means that there is no obvious mathematical reason for this refinement property.
Our results strengthen the supposition that the modern codes originated from ancient (self-complementary) comma-free codes (see [23,24]) and, as a consequence, a weaker version of the Crick et al. theory.

Definitions and Notations
The genetic code is written with words of three letters, called codons, built over an alphabet: of four letters, nucleotide bases uracil (thymine), cytosine, adenine and guanine, in short U(T), C, A and G. In recent studies, e.g., [11,12,14,18,25], the structure of certain sub-codes of the genetic code that are assumed to play a role in nature were investigated. The first class of codes was suggested by Crick et al. in [7].

Definition 1.
A trinucleotide code X ⊆ B 3 is called comma-free if any given two codons x 1 , x 2 ∈ X, any sub-codon of the concatenation x 1 x 2 , except x 1 , x 2 themselves, does not belong to X. We will call a trinucleotide comma-free code X maximal if it contains exactly 20 codons.
Being comma-free means that a frameshift of one or two bases is detected after reading of three nucleotide bases (see Figure 1). We would like to mention at this point that our point of view of the frameshift problem is an information theoretical point of view. In living cells, a frameshift is of course also "detected" because of the mistranslated protein product that is produced and its potential phenotypic consequences. Clearly, a comma-free code cannot contain the periodical codons AAA, CCC, GGG or TTT since, for example, a frame shift in a sequence of adenines could not be detected. Moreover, for any codon B 1 B 2 B 3 from a comma-free code, the shifted codons B 2 B 3 B 1 and B 3 B 1 B 2 cannot be in the same code. For instance, if ACG is in the code, then CGA and GAC must not be in the same code, because they appear in frameshift 1 and 2 of the: The two shift operations are commonly denoted by α 1 and α 2 , i.e., α 1 [12]). Thus, any comma-free code can at most contain one codon out of the three ACG, CGA and GAC and similarly for any other codon. Thus, the maximal number of codons in a comma-free code is 20 = 64−4 3 , as required in the above Definition 1.
Maximal comma-free codes have been completely classified by computer calculations, and it turned out that there are exactly 408 such codes (see [15]).
At this point, we would like to discuss some examples of ancient codes, i.e., genetic code tables that were suggested as a predecessor of the current standard genetic code in some theory about the evolution of the genetic code. These codes coded only for a few amino acids and very often also used only some of the codons and not all. Surprisingly, most of these codes contained a large comma-free sub-code that codes for almost all of the amino acids or were even themselves comma-free. Note that any comma-free code can code for at most 13 amino acids ( [16]; see also Table 4 in [26]), while a circular code can code for at most 18 amino acids [27].

Example 1.
• In the generalized co-evolution theory by Di Giulio [20,21], the SNS code (the letter S stands here for the strong nucleotide bases C or G, in contrast to weak nucleotide bases A or T) was suggested and consists of the following codons coding for the seven amino acids valine, glutamine, alanine, asparagine, glycine and serine: • In the generalized co-evolution theory by Di Giulio [20,21], the NNS code was suggested and consists of the following codons coding for the seven amino acids valine, glutamine, alanine, asparagine, glycine and serine plus the stop signal: • The RNY code [19] consists of the eight amino acids glycine, threonine, asparagine, serine, valine, arginine, isoleucine and alanine and the 16 codons: This code is comma-free.
• The theory of Jolivet and Rothen [22] yields another primeval code, which codes for the amino acids tyrosine, alanine, phenylalanine, arginine, valine, asparagine, aspartate, leucine, glutamate, glycine, glutamine and isoleucine: As mentioned in the Introduction, this list of examples of predecessor codes is by far not complete, and most of these codes are based on biological concepts. The most important code is the RNY code [19], which has also been statistically observed in genes on the two-letter alphabet {R, Y}. It is comma-free, and it was already shown in [10] that the RNY code can be also constructed by looking at the preferential frame of the RNY codons in genes. In fact, in [10], the authors assigned to each codon a preferential frame (which then led to the discovery of the first maximal self-complementary C 3 -code), and taking the average frame of the RNY codons, it was pointed out by the authors that this is in fact Frame 0 (see [10], Table 3a-c and the corresponding discussion).
As we can see, the ancient codes discussed in the above Example 1 always contain a large sub-code that is comma-free and encodes almost all of the amino acids used in the code. This is impossible in the current standard genetic code, as we have seen above.
Since Crick's hypothesis on the usage of comma-free codes consequently had to fail in nature, the next class of codes that was investigated is the class of maximal self-complementary C 3 codes. Definition 2. We will call a set of codons X ⊆ B 3 a trinucleotide circular code if any word over the alphabet B written on a circle has at most one decomposition into words from X. By word written on a circle, it is intended that after the last letter, the word starts again (from its first letter). We will call a trinucleotide circular code X maximal if it contains exactly 20 codons.
Circular codes do not allow the detection a frameshift immediately as comma-free codes do, but eventually, after a few codons (see Figure 2). Thus, it is obvious that any comma-free code is also circular. The first maximal circular code that was found in nature by Arquès and Michel [10] is: AAT, ACC, ATC, ATT, CAG, CTC, CTG,  GAA, GAC, GAG, GAT, GCC, GGC, GGT, GTA,  GTC, GTT, TAC, TTC The code X 0 had even stronger properties. The first one says that the code is not just error-detecting in the normal reading frame, but also in shifts 1 and 2.
Definition 3. Let X ⊆ B 3 . We will say that X is a C 3 code if X, as well as X 1 and X 2 are circular, where X 1 := α 1 (X) and X 2 := α 2 (X).
Moreover, the code found by Arquès and Michel was also invariant under forming the anticodons of its members. Definition 4. Let X ⊆ B 3 . We will call X self-complementary if with each codon x ∈ X, its anticodon is also in X.
In the recent investigations of comma-free and circular codes, the group of permutations (bijective transformations) of bases played a significant role (see, for example, [12,29]). Recall that a permutation of the bases from B is just a bijective shuffling of the bases. Formally, the symmetric group on the set B is defined as: with the group operation of function composition. Bijective transformations π : B → B can be applied componentwise to x ∈ B 3 and, thus, induce a bijective map B 3 → B 3 , which we will denote also by π. Hence, π systematically exchanges bases in a codon or sequence of codons, and there are exactly 24 such transformations. The complementing map plays a very essential role: which assigns to each basis its complementary basis. An important property of permutations is the following (see [12,29]): Any permutation preserves comma-freeness and circularity, hence the C 3 property.
However, self-complementarity is not always preserved, but is by eight of the 24 permutations. These permutations, among which we find the complementing map c, were characterized in [12] and form a subgroup of the symmetric group S B . Finally, the so-called reversing permutation, which reverses a codon, i.e., , also preserves self-complementarity (see [12]). Note that the anticodon of a codon x can be expressed as ← − − c(x). Thus, we have: If X is a comma-free code, then its reversed code ← − X and its code of anticodons are both comma-free, as well; If X is a circular self-complementary code, then its reversed code ← − X and its code of anticodons ←−− c(X) are both circular self-complementary, as well.
We would like to draw the reader to very interesting works by Seligmann [30][31][32][33] that are related to the bijective transformations S B . In fact, it was shown by Seligmann that parts of actual DNA and RNA sequences are replicated by systematic exchanges of nucleotides, i.e., by applying one of the 24 bijective transformations to it. These sequences are called swinger sequences, and convergence between swinger sequences detected are based on classical PCR sequencing methods.

Distribution of Codons in Maximal Error-Detecting Codes
In this section, we will mainly consider three classes of error-detecting and error-correcting codes that have appeared in the development and theory of the genetic code: the class COM of all maximal comma-free codes, the class CIRC of all maximal circular self-complementary codes and the class C3 of all maximal self-complementary C 3 codes. We are interested in the frequency of codons appearing in such codes, i.e., we will determine for each codon the number of codes from the above three classes in which it appears. We start with the following: Definition 5. Let x ∈ B 3 be a codon, and let K be either the class COM of all maximal comma-free codes, or the class CIRC of all maximal circular self-complementary codes, or the class C3 of all maximal self-complementary C 3 codes. Then: denotes the number of codes K from the class K, such that x belongs to K. The number u K (x) is called the frequency class number of x with respect to K.
A first easy observation is that for any codon x and any of the above classes K, the frequency number of x with respect to K is the same as the frequency number of the anticodon ← − − c(x) with respect to K, as well as that of the reversed codon ← − x of x with respect to K (for comma-free codes this follows from Equations (2) and (3) above): Moreover, by the maximality of the codes in all of the above classes, we also have: for all codons x and classes K = COM, K = C3, K = CIRC. Recall that α 1 (x) and α 2 (x) are the circular permutations of x.
We now show all equivalence numbers of codons with respect to the three classes of codes CIRC, C3 and COM. Recall first that there are exactly 408 maximal comma-free codes and that clearly any codon contained in a comma-free code either consists of three different bases or has exactly two identical bases. Thus, the cases in the following theorem cover all possible codons. Theorem 6. Let x ∈ B 3 be a codon and K = COM the class of all maximal comma-free codes. Then, the following statements are true: In particular, there are only three different frequency class numbers 112, 136 and 184 for codons with respect to the class COM of all maximal comma-free codes.
The following Table 1 illustrates the result from the above Theorem 6 showing the frequency class numbers of codons with respect to COM. Recall that the trivial codons AAA, CCC, GGG, TTT can never be part of any error-detecting system; hence, there are only 60 codons shown in the next table. Codons colored in the same color have the same frequency class number. Theorem 6 was obtained by computer calculations, but proofs of some parts of Theorem 6 can be found in Appendix A. However, it is easy to see why there are only three different frequency class numbers. The reason for this is of a group theoretic nature. Since any permutation π ∈ S B carries a maximal comma-free code into a maximal comma-free code, it follows that for any codon x ∈ B 3 , we have u COM (x) = u COM (π(x)). Thus, all codons consisting of three different bases must have the same frequency class number with respect to COM. Moreover, those with two identical bases in positions 1 and 3 must have the same frequency class number, and finally, the codons with two identical bases in positions 1 and 2, as well as in positions 2 and 3 must have the same frequency class numbers. The latter follows from Equation (4).
The following Table 2 illustrates the result from Theorem 6 above showing the distribution of codons with respect to COM in the standard genetic code table.
The next theorem gives the same characterization of frequency numbers with respect to the class CIRC of maximal circular self-complementary codes. Note that the number of such codes is exactly 528. Moreover, note that any codon has either two identical bases, and then, the third basis is the complementary one (Cases (1) and (2) in the next theorem), or the third basis is not the complementary one (Cases (3) and (4) below), or the codon has three different bases and in these cases, two of them must be complementary to each other (Cases (5) and (6)). Thus, the case distinction in the following theorem covers all possible codons. . Let x ∈ B 3 be a codon and K = CIRC the class of all maximal circular self-complementary codes. Then, the following holds: ., then: In particular, there are only the six different frequency class numbers, 0, 147, 154, 187, 234 and 264, for codons with respect to the class CIRC of all maximal circular self-complementary codes.
The following Table 3 illustrates the result from Theorem 7 above showing the frequency class numbers of codons with respect to CIRC. Again, recall that the trivial codons AAA, CCC, GGG, TTT can never be part of any error-detecting system; hence, there are only 60 codons shown in the next table. As for Theorem 6, the results of Theorem 7 were found by computer calculations, but a mathematical proof of some parts of the theorem can be found in Appendix B. Again, group theory shows why there are only a few different frequency class numbers. However, this time, there are only eight transformations π ∈ B that carry self-complementary circular codes into self-complementary circular codes. These eight transformations were classified as the dihedral group L in [12]. Thus, for these eight transformations and any codon x ∈ B 3 , we have u CIRC (x) = u CIRC (π(x)). Thus, all codons consisting of three different bases must have the same frequency class number with respect to CIRC if they can be mapped onto each other by a permutation from L. The same holds for those codons with two identical bases in two out of the three positions.
The following Table 4 illustrates the result from Theorem 7 above showing the distribution of codons with respect to CIRC in the standard genetic code table.
A surprising fact is that the frequency class numbers of codons with respect to the class CIRC of all maximal circular self-complementary codes is a refinement of the frequency class numbers of codons with respect to the class COM of all maximal comma-free codes. This is not at all clear, since the two classes COM and CIRC are disjoint. No maximal comma-free code is self-complementary. We will come back to this point after the next theorem and its illustration.
We finally show the frequency class numbers of codons with respect to the class C3 of all maximal self-complementary C 3 codes. Note that the number of such codes is exactly 216. Moreover, note that as above, the case distinction in the following theorem covers all possible codons. . Let x ∈ B 3 be a codon and K = C3 the class of all maximal self-complementary C 3 codes. Then, the following holds: In particular, there are only the five different frequency class numbers, 0, 59, 72, 98 and 108, for codons with respect to the class C3 of all maximal self-complementary C 3 codes.
As for Theorems 6 and 7, the results in Theorem 8 were discovered by computer calculation; however, some parts of the above theorem are proven in Appendix C. As for Theorem 7, the action of the dihedral group L on the set C3 explains why there are only a few different frequency classes.
The following Tables 5 and 6 illustrate the result from the above Theorem 8 showing the frequency class numbers of codons with respect to C3 and their distribution in the standard genetic code table. Recall once more that the trivial codons AAA, CCC, GGG, TTT can never be part of any error-detecting system; hence, there are only 60 codons shown in the next table. Since the maximal self-complementary C 3 codes are a subset of the set of maximal circular self-complementary codes, it is clear that the splitting of codons with respect to their frequency class numbers cannot be significantly different. Nevertheless, as we can see, it is not completely the same. The Classes (3) and (4) from Theorem 7 merge to one class in Theorem 8 (Class (3)), due to the additional C 3 -property. Thus, the frequency class numbers of codons with respect to the class CIRC is a refinement of the ones with respect to the class C3.

Results, Discussion and Conclusions
In this work, we have investigated the frequency class numbers of codons with respect to the three important classes of error-detecting codes that play a role in the theory of the genetic code: the class COM of all maximal comma-free codes, the class C3 of all maximal self-complementary C 3 codes and finally, the class CIRC of all maximal self-complementary circular codes. The results show two surprising facts. Firstly, for each of the classes COM, C3 and CIRC, there are only very few frequency class numbers of codons. Secondly, the frequency class numbers yield partitions of the set of codons that become finer when passing from the class COM via the class C3 to the class CIRC (see the Table 7 below for a visualization of this refinement). The existence of only a few frequency class numbers for each of the classes COM, C3 and CIRC is explained by a mathematical theory using group theory. Moreover, parts of the main Theorems 6-8 are given in the Appendix. The main result, however, is the refinement property shown in the table above. Since the class C3 of maximal self-complementary C 3 codes is a subclass of the class CIRC of all maximal self-complementary circular codes, the refinement property of the corresponding frequency class numbers is a consequence. However, the first refinement from the class COM of all maximal comma-free codes to the class C3 of all maximal self-complementary C 3 codes is a surprise. No maximal comma-free code is self-complementary; hence, the two classes COM and C3 (even COM and CIRC) are disjoint. That the frequency class numbers with respect to C3 are still a refinement of the frequency class numbers with respect to COM is a clear hint at a relation between the two classes of codes and supports the theory that the genetic code in its present form evolved from earlier ancient codes in a way that stronger error-detecting and error-correcting properties were weakened to codes that still allow error-detection and error-correction, but in a less effective form. Ancient codes used less codons and coded for less amino acids; hence, comma-free codes that detect a frameshift in a reading window of only three bases, hence immediately, could be incorporated. As soon as the genetic codes got more complex involving all codons and coding for a larger number of amino acids, the weaker circular codes that detect frameshifts eventually and in a larger reading window (of 13 nuclear bases) had to take over the error-detection and error-correction function.

Appendix A. Proof of Theorem 6
Proof. Clearly, any codon x ∈ B 3 \{AAA, CCC, GGG, TTT} is either of the form B 1 B 2 B 1 , or B 1 B 1 B 2 , or B 1 B 2 B 2 , or B 1 B 2 B 3 with B i ∈ B all different, i.e., either the codons consist of three different nucleic bases or two bases are identical. Thus, Cases (1)-(3) of Theorem 6 form a complete case distinction. Moreover, since any bijective transformation π ∈ S B , as well as the reversing transformation ← − preserve comma-freeness, it is obvious that any two codons x, x ∈ B 3 must have the same frequency class numbers u COM (x) = u COM (x ), provided they can be mapped onto each other by such transformations. We now first prove that this is indeed the case in all three cases of Theorem 6.
(1) Let x, x ∈ B 3 be two codons of the form such that B 1 = B 2 and B 1 = B 2 . Obviously, the bijective transformation π ∈ S B with: such that B 1 = B 2 and B 1 = B 2 . As above, the bijective transformation π ∈ S B with: Similar to the other two cases, we define the bijective transformation δ ∈ S B the following way: We have thus shown that there are only three possible frequency numbers for codons with respect to the class COM of maximal comma-free codes. The exact values of these frequency class numbers have been found by computer calculations. However, for Case (3), we even have a proof. Given a codon x = B 1 B 2 B 3 ∈ B 3 with all B i different, it is readily seen that the shifted codons α 1 (x) and α 2 (x) are of the same form. Thus, we have u COM (x) = u COM (α 1 (x)) = u COM (α 2 (x)). Equation (5) implies that u COM (x) = 408 : 3 = 136. Moreover, there is a relation between the frequency numbers u 1 , u 2 of codons from Case (1) and Case (2). Given a codon x = B 1 B 2 B 1 ∈ B 3 from Case (1) the shifted codons α 1 (x) = B 2 B 1 B 1 and α 2 (x) = B 1 B 1 B 2 are codons of the form described in Case (2). Thus, it follows that: In particular, if u 2 = 112, then u 1 = 408 − 2 · 112 = 184.

Appendix B. Proof of Theorem 7
Proof. It is easy to see that the cases described in Theorem 7 give a complete case distinction for the set of codons B 3 \{AAA, CCC, GGG, UUU}. In fact, any such codon has to be of one of the forms described in (1)- (6). As in the proof of Theorem 6, it suffices to show that codons of the same form can be mapped onto each other by some bijective transformation or the reversing transformation ← − in order to show that the corresponding frequency class numbers are the same. However, since we are dealing with the class CIRC of all maximal self-complementary circular codes, we need to find bijective transformations that preserve self-complementarity. These permutations have been classified in [12] as a subgroup L of the permutation group S B . In fact, the group L consists of the following eight transformations: We now show that indeed in any of the cases of Theorem 7, (1)-(6), we can find such a permutation π from L.
(1) If x = B 1 c(B 1 )B 1 ∈ B 3 and x = B 1 c(B 1 )B 1 ∈ B 3 , then either π = id (in case B 1 = B 1 ) or one of the three permutations c, p, r (in case B 1 = B 1 ) does the job. (2) As in Case (1), either the identity id or one of the three permutations c, p, r does the job in combination with the reversing transformation ← −.
As in Theorem 6, the exact values of the frequency class numbers have been determined by computer calculations. However, in Case (1), it is also easy to see that the frequency class number is zero. To see this, let x = B 1 c(B 1 )B 1 , B 1 ∈ B be a codon in a maximal self-complementary circular code. It follows that also its anticodon x = c(B 1 )B 1 c(B 1 ) is in the code due to self-complementarity. However, then the word xx = B 1 c(B 1 )B 1 c(B 1 )B 1 c(B 1 ) has two decompositions when read on the circle. This contradicts the circularity of the code (compare also [12]). Thus, no codon of the form B 1 c(B 1 )B 1 , B 1 ∈ B can be contained in a code from CIRC, and therefore, its frequency class number is zero. Moreover, as in the proof of Theorem 6, one can show that the frequency class numbers u 1 -u 6 from the Cases (1)-(6) of Theorem 7 satisfy the following equations: u 1 + 2 · u 2 = 528, and 2 · u 3 + u 4 = 528 as well as 2 · u 6 + u 5 = 528 In particular, it follows that u 2 = 528−0 2 = 264.

Appendix C. Proof of Theorem 8
Proof. As in the proof of Theorem 7, we can show that the cases in Theorem 8 form a complete case distinction and that the frequency class numbers are the same for codons of the same forms described in (1)-(5) of Theorem 7. The exact values are once more determined by computer calculations. In fact, we have for the frequency class numbers u 1 to u 5 from the cases of Theorem 8 that: 2 · u 1 + u 2 = 216 and 3u 3 = 216 as well as u 4 + 2 · u 5 = 216 In particular, we have u 1 = 216−0 2 = 108 and u 3 = 216 3 = 72.