Single-Frame, Multiple-Frame and Framing Motifs in Genes

We study the distribution of new classes of motifs in genes, a research field that has not been investigated to date. A single-frame motif SF has no trinucleotide in reading frame (frame 0) that occurs in a shifted frame (frame 1 or 2), e.g., the dicodon AAACAA is SF as the trinucleotides AAA and CAA do not occur in a shifted frame. A motif which is not single-frame SF is multiple-frame MF. Several classes of MF motifs are defined and analysed. The distributions of single-frame SF motifs (associated with an unambiguous trinucleotide decoding in the two 5′–3′ and 3′–5′ directions) and 5′ unambiguous motifs 5′U (associated with an unambiguous trinucleotide decoding in the 5′–3′ direction only) are analysed without and with constraints. The constraints studied are: initiation and stop codons, periodic codons {AAA,CCC,GGG,TTT}, antiparallel complementarity and parallel complementarity. Taken together, these results suggest that the complementarity property involved in the antiparallel (DNA double helix, RNA stem) and parallel sequences could also be fundamental for coding genes with an unambiguous trinucleotide decoding in the two 5′–3′ and 3′–5′ directions or the 5′–3′ direction only. Furthermore, the single-frame motifs SF with a property of trinucleotide decoding and the framing motifs F (also called circular code motifs; first introduced by Michel (2012)) with a property of reading frame decoding may have been involved in the early life genes to build the modern genetic code and the extant genes. They could have been involved in the stage without anticodon-amino acid interactions or in the Implicated Site Nucleotides (ISN) of RNA interacting with the amino acids. Finally, the SF and MF dipeptides associated with the SF and MF dicodons, respectively, are studied and their importance for biology and the origin of life discussed.


Introduction
The reading frame coding with trinucleotide sets is a fascinating problem, both theoretical and experimental. Before the discovery of the genetic code, a first code was proposed by Gamow [1] by considering the "key-and-lock" relation between various amino acids, and the rhomb shaped "holes" formed by various nucleotides in the DNA. The proposed model will later prove to be false. A few years later, a class of trinucleotide codes, called comma-free codes, was proposed by Crick et al. [2] for explaining how the reading of a sequence of trinucleotides could code amino acids. In particular, how the correct reading frame can be retrieved and maintained. The four nucleotides {A,C,G,T} as well as the 16 dinucleotides {AA, . . . ,TT} are simple codes which are not appropriate for coding 20 amino acids. However, trinucleotides induce a redundancy in their coding. Thus, Crick et al. [2] conjectured that only 20 trinucleotides among the 64 possible trinucleotides {AAA, . . . ,TTT} code for the 20 amino acids. Such a bijective code implies that the coding trinucleotides are found only in one frame-the comma-freeness property. The determination of a set of 20 trinucleotides forming a comma-free code has several necessary conditions: (ii) Two non-periodic permuted trinucleotides, i.e., two trinucleotides related by a circular permutation, e.g., ACG and CGA, must also be excluded from such a code. Indeed, the concatenation of ACG with itself, for instance, does not allow the reading frame to be retrieved as there are two possible decompositions: . . . ,ACG,ACG,ACG, . . . (original frame) and . . . A,CGA,CGA,CG . . . Therefore, by excluding the four periodic trinucleotides and by gathering the 60 remaining trinucleotides in 20 classes of three trinucleotides such that, in each class, the three trinucleotides are deduced from each other by a circular permutation, e.g., ACG, CGA and GAC, we see that a comma-free code can contain only one trinucleotide from each class and thus has at most 20 trinucleotides. This trinucleotide number is identical to the amino acid number, thus leading to a code assigning one trinucleotide per amino acid without ambiguity.
In the beginning 1960's, the discovery that the trinucleotide TTT, an excluded trinucleotide in a comma-free code, codes phenylalanine [3], led to the abandonment of the concepts both of a comma-free code [2] and a bijective code as the genetic code is degenerate [4][5][6] with a gene translation in one direction [7].
In 1996, a statistical analysis of occurrence frequencies of the 64 trinucleotides in the three frames of genes of both prokaryotes and eukaryotes showed that the trinucleotides are not uniformly distributed in these three frames [8]. By excluding the four periodic trinucleotides and by assigning each trinucleotide to a preferential frame (frame of its highest occurrence frequency), three subsets X = X 0 , X 1 and X 2 of 20 trinucleotides each are found in the frames 0 (reading frame), 1 (frame 0 shifted by one nucleotide in the 5 -3 direction, i.e., to the right) and 2 (frame 0 shifted by two nucleotides in the 5 -3 direction) in genes of both prokaryotes and eukaryotes. The same set X of trinucleotides was identified in average in genes (reading frame) of bacteria, archaea, eukaryotes, plasmids and viruses [9,10]. It contains the 20 following trinucleotides: and codes the 12 following amino acids (three and one letter notation): Life 2019, 9, 18 3 of 22 In eukaryotes, the slippery motif fits a consensus heptanucleotide X,XXY,YYZ, where XXX is any three identical nucleotides, YYY represents AAA or TTT, Z represents A, C or T, the commas separating the codons in reading frame [15,16]. The slippery motifs MF 1 = A, AAA, AAZ and MF 2 = T, TTT, TTZ are multiple-frame MF. Indeed, the codon AAA in reading frame also occurs in the shifted frames 1 and 2 in MF 1 , and similarly with the codon TTT in MF 2 . Alternative gene decoding is also possible with +1 programmed ribosomal frameshifting (+1 PRF) which has been particularly observed in Euplotes [17]. The identified slippery motif TTT, TAR where R = {A, G} is multiple-frame MF. The slippery motifs AAA, CCC, GGG and TTT may cause frameshifting during transcription, producing RNAs missing specific nucleotides when compared to template DNA [18,19]. The slippery motifs are not always multiple-frame while stressing that the spacer and the down-stream stimulatory motifs have been very poorly characterized [20] and could also be involved in such a multiple-frame definition. From a theoretical point of view, it is important to extend this concept by increasing the length of such multiple-frame slippery motifs and also by considering their different classes. If the multiple-frame motifs may be involved in ribosomal frameshifting, the single-frame motifs SF and the framing motifs F (also called circular code motifs; first introduced in Michel [21,22]) from the circular codes [8][9][10] (reviews in Michel [23]; Fimmel and Strüngmann [24]) may have been important in early life genes for constructing the modern genetic code and the extant genes (see Discussion). Several classes of MF motifs are defined: (i) a unidirectional multiple-frame motif 3 U MF has no trinucleotide in reading frame that occurs in a shifted frame after its reading (i.e., its position in the reading frame) but has at least one trinucleotide in reading frame that occurs in a shifted frame before its reading, e.g., the dicodon AACACA is 3 U MF as the trinucleotides AAC and (trivially) ACA do not occur in a shifted frame after their reading and as the trinucleotide ACA occurs in a shifted frame (precisely frame 1) before its reading; (ii) a unidirectional multiple-frame motif 5 U MF, the opposite, has no trinucleotide in reading frame that occurs in a shifted frame before its reading but has at least one trinucleotide in reading frame that occurs in a shifted frame after its reading, e.g., the dicodon ACACAA mirror of AACACA is 5 U MF as the trinucleotides (trivially) ACA and CAA do not occur in a shifted frame before their reading and as the trinucleotide ACA occurs in a shifted frame (precisely frame 2) after its reading; and (iii) a bidirectional multiple-frame motif BMF has at least one trinucleotide in reading frame that occurs in a shifted frame before its reading and has at least one trinucleotide in reading frame that occurs in a shifted frame after its reading (both 3 U MF and 5 U MF), e.g., the dicodons AAAAAA and ACACAC are BMF. A 5 unambiguous motif 5 U, is either a SF motif or a 3 U MF motif, e.g., the dicodons AAACAA (SF motif) and AACACA (3 U MF motif) belong to the class 5 U.
We will only investigate here the distribution of the single-frame motifs SF associated with an unambiguous trinucleotide decoding in the two 5 -3 and 3 -5 directions, and the 5 unambiguous motifs 5 U associated with an unambiguous trinucleotide decoding in the 5 -3 direction only, i.e., a less restrictive class of motifs. The distributions of SF and 5 U motifs will be analysed without and with constraints. The constraints studied are: (i) with initiation and stop codons; (ii) without periodic codons {AAA, CCC, GGG, TTT}; (iii) with antiparallel complementarity; and (iv) with parallel complementarity.
We will also investigate the particular case of motifs made up of two codons, i.e., the dicodons. The definitions of SF and MF dicodons will thus identify two new classes of dipeptides, the SF and MF dipeptides. The SF dipeptides are coded by dicodons with an unambiguous trinucleotide decoding, in contrast to the MF dipeptides which are coded by dicodons with an ambiguous trinucleotide decoding. The concept of SF and MF dipeptides might be of predictive value to studies of prebiotic metabolites [25]. Peptide evolution on the primitive earth is an active and exciting field of research with cyclic dipeptides [26] and selective formation of SerHis dipeptide via phosphorus activation [27,28]. Example 2. Let a framing motif F 1 = ...AGGTAATTACCAG... be constructed with the circular code X (1) identified in genes of bacteria, archaea, eukaryotes, plasmids and viruses [8][9][10].
(i) Such a framing motif F 1 can be obtained as follows. A sequence s of trinucleotides of X is generated and a substring is extracted at any position in this sequence s, i.e., the series of nucleotides on the right and the left of the substring are not considered. Let this substring be F 1 . (ii) This framing motif F 1 allows the reading frame to be retrieved ( Figure 1). We try the three possible decompositions w 0 , w 1 (shifted by one letter to the right) and w 2 (shifted by two letters to the right) of F 1 . With w 0 , AG is not a prefix of any trinucleotide of X, thus the frame associated with w 0 is impossible. With w 2 , AG is a suffix of CAG and GAG belonging to X, then GTA, ATT and ACC belong to X, followed by A which is a prefix of five trinucleotides of X. Thus at this position, the frame associated with w 2 is still possible and 2 + 3 × 3 + 1 = 12 nucleotides are read. The next letter G leads to AG which is not a prefix of any trinucleotide of X. Thus, a window of 12 + 1 = 13 nucleotides demonstrates that the frame associated with w 2 is impossible. With w 1 , A is a suffix of GAA and GTA belonging to X, then GGT, AAT, TAC, CAG, etc., belong to X. Thus, the reading frame of F 1 is associated with w 1 , i.e., the first letter A of w is the 3rd letter of a trinucleotide of X: the reading frame of the sequence s is retrieved: ...A,GGT,AAT,TAC,CAG, . . . (the comma showing the reading frame). (iii) We can prove mathematically that a windows of 13 nucleotides always retrieves the reading frame with the circular code X. Four framing motifs F need a window of 13 nucleotides with the circular code X as they are the four longest ambiguous words of length l = 12 nucleotides: F 1 = AGGTAATTACCA, F 2 = AGGTAATTACCT (with w 2 , the first two letters AG are suffix of CAG and GAG belonging to X, and the last letter T is prefix of TAC and TTC belonging to X), F 3 = TGGTAATTACCA (with w 2 , the first two letters TG are suffix of CTG belonging to X, and the last letter A is prefix of five trinucleotides of X) and F 4 = TGGTAATTACCT (with w 2 , the first two letters TG are suffix of CTG belonging to X, and the last letter T is prefix of TAC and TTC belonging to X). These four framing motifs F contain the two longest ambiguous words of length l = 11 nucleotides starting with a trinucleotide of X, i.e., when the suffixes of X are not considered: GGTAATTACCA and GGTAATTACCT (see last row in Table 1 in [21]). (iv) It is very important to stress that for all the other framing motifs F of the circular code X, i.e., different from F 1 , F 2 , F 3 and F 4 , the window for retrieving the reading frame is less than 13 nucleotides (see the growth function of the window as a function of the number of nucleotides in Figure 4 in [21]). It is also very important to recall that any motif of the circular code X is framing, i.e., it has the property of reading frame retrieval.
Life 2019, 9, 18 5 of 23 (i) Such a framing motif can be obtained as follows. A sequence of trinucleotides of is generated and a substring is extracted at any position in this sequence , i.e., the series of nucleotides on the right and the left of the substring are not considered. Let this substring be . (ii) This framing motif allows the reading frame to be retrieved ( Figure 1). We try the three possible decompositions , (shifted by one letter to the right) and (shifted by two letters to the right) of . With , AG is not a prefix of any trinucleotide of , thus the frame associated with is impossible. With , AG is a suffix of CAG and GAG belonging to , then GTA, ATT and ACC belong to , followed by A which is a prefix of five trinucleotides of . Thus at this position, the frame associated with is still possible and 2 + 3 × 3 + 1 = 12 nucleotides are read. The next letter G leads to AG which is not a prefix of any trinucleotide of . Thus, a window of 12 + 1 = 13 nucleotides demonstrates that the frame associated with is impossible. With , A is a suffix of GAA and GTA belonging to , then GGT, AAT, TAC, CAG, etc., belong to . Thus, the reading frame of is associated with , i.e., the first letter A of is the 3rd letter of a trinucleotide of : the reading frame of the sequence is retrieved: ...A,GGT,AAT,TAC,CAG,… (the comma showing the reading frame). (iii) We can prove mathematically that a windows of 13 nucleotides always retrieves the reading frame with the circular code . Four framing motifs need a window of 13 nucleotides with the circular code as they are the four longest ambiguous words of length = 12 nucleotides: = AGGTAATTACCA, = AGGTAATTACCT (with , the first two letters AG are suffix of CAG and GAG belonging to , and the last letter T is prefix of TAC and TTC belonging to ), = TGGTAATTACCA (with , the first two letters TG are suffix of CTG belonging to , and the last letter A is prefix of five trinucleotides of ) and = TGGTAATTACCT (with , the first two letters TG are suffix of CTG belonging to , and the last letter T is prefix of TAC and TTC belonging to ). These four framing motifs contain the two longest ambiguous words of length = 11 nucleotides starting with a trinucleotide of , i.e., when the suffixes of are not considered: GGTAATTACCA and GGTAATTACCT (see last row in Table 1 in [21]). (iv) It is very important to stress that for all the other framing motifs of the circular code , i.e., different from , , and , the window for retrieving the reading frame is less than 13 nucleotides (see the growth function of the window as a function of the number of nucleotides in Figure 4 in [21]). It is also very important to recall that any motif of the circular code is framing, i.e., it has the property of reading frame retrieval.     A single-frame motif has no trinucleotide in reading frame that occurs in a shifted frame, i.e., the trinucleotide decoding is unambiguous in the two 5 -3′ and 3 -5′ directions. Formally:   A multiple-frame motif , in contrast to a motif, has at least one trinucleotide in reading frame that occur in a shifted frame . Formally: Definition 10. A multiple-frame -motif (ambiguous trinucleotide decoding in at least one direction) is a -motif such that ∩ ≠⊘ for ∈ {1,2}, i.e., ∃ ∈ {1, … , } ∧ ∃ ∈ {1, … , − 1} ∧ ∃ ∈ {1,2}: = .

Reading frame A A A C A A
A multiple-frame motif MF, in contrast to a SF motif, has at least one trinucleotide t in reading frame that occur in a shifted frame f . Formally:

Definition 10.
A multiple-frame n -motif MF (ambiguous trinucleotide decoding in at least one direction) is a n -motif such that The unidirectional multiple-frame motifs U MF belong to a class of MF motifs where all the trinucleotides t f in a shifted frame f occur only before (3 U MF: 3 -5 direction) or only after (5 U MF: 5 -3 direction) the trinucleotides t in reading frame. Formally: Example 4. Let the dicodon be AACACA. The trinucleotides in reading frame are t 1 = AAC and t 2 = ACA, leading to T = {AAC, ACA}. The single trinucleotide in the shifted frame 1 is t 1 1 = ACA, leading to T 1 = {ACA}. The single trinucleotide in the shifted frame 2 is t 2 1 = CAC, leading to T 2 = {CAC}. As T ∩ T 1 = , AACACA is a multiple-frame dicodon MF. Furthermore, as t 2 = t 1 1 = ACA yields to the inequality 2 > 1, as t 1 = AAC = t 1 1 = ACA and as t 1 = AAC = t 2 1 = CAC, AACACA is a unidirectional multiple-frame dicodon 3 U MF ( Figure 3). AACACA is a multiple-frame dicodon . Furthermore, as = = yields to the inequality 2 > 1, as = ≠ = and as = ≠ = , AACACA is a unidirectional multiple-frame dicodon 3′ ( Figure 3).

Definition 12.
A unidirectional multiple-frame n-motif 5 U MF (ambiguous trinucleotide decoding in the Example 5. Let the dicodon be AAAAAC. The trinucleotides in reading frame are t 1 = AAA and t 2 = AAC, leading to T = {AAA, AAC}. The trinucleotides in the shifted frames 1 and 2 are t 1 1 = t 2 1 = AAA, leading to the trinucleotide sets T 1 = T 2 = {AAA}. As T ∩ T 1 = and T ∩ T 2 = , AAAAAC is a multiple-frame dicodon MF. Furthermore, as t 1 = t 1 1 = t 2 1 = AAA yields to the two inequalities 1 ≤ 1 and as Figure 4).  ACACAA is a multiple-frame dicodon . Furthermore, as = = yields to the inequality 1 ≤ 1, as = ≠ = and as = ≠ = , ACACAA is a unidirectional multiple-frame dicodon 5′ ( Figure 5). The reasoning could be immediate by noting that the dicodon ACACAA is mirror of AACACA (compare with Example 4).

Reading frame A A A A A C
( Figure 5). The reasoning could be immediate by noting that the dicodon ACACAA is mirror of AACACA (compare with Example 4).

Definition 13.
A bidirectional multiple-frame n-motif BMF (ambiguous trinucleotide decoding in the two 5 -3 and 3 -5 directions) is both a 5 U MF and 3 U MF n-motif.

Reading frame A A A A A A
(associated with Example 7). The dicodon AAAAAA is bidirectional multiple-frame BMF.
Example 8. Let the dicodon be ACACAC. The trinucleotides in reading frame are t 1 = ACA and t 2 = CAC, leading to T = {ACA, CAC}. The single trinucleotide in the shifted frame 1 is t 1 1 = CAC, leading to T 1 = {CAC}. The single trinucleotide in the shifted frame 2 is t 2 1 = ACA, leading to T 2 = {ACA}. As T ∩ T 1 = and T ∩ T 2 = , ACACAC is a multiple-frame dicodon MF. Furthermore, as t 1 = t 2 1 = ACA yields to the inequality 1 ≤ 1 and as t 2 = t 1 1 = CAC yields to the inequality 2 > 1, ACACAC is a bidirectional multiple-frame dicodon BMF (Figure 7).  In this paper, by varying ∈ ℕ * , we will investigate two distributions: the single-frame -motifs with an unambiguous trinucleotide decoding in the two 5 -3′ and 3 -5′ directions (see Definition 9), and the 5′ unambiguous -motifs 5′ with an unambiguous trinucleotide decoding in the 5 -3′ direction only which are defined formally as follows: For ∈ ℕ * , we have the obvious relations: For ∈ ℕ * , the occurrence probability ( ) of single-frame -motifs will be computed according to Similarly, for ∈ ℕ * , the occurrence probability 5′ ( ) of 5' unambiguous -motifs 5′ will be computed as follows Remark 1. Obviously, 5′ ( ) > ( ) whatever . However, it will be interesting to compare these two probability distributions by varying .

Single-Frame 1-Motifs
Reading frame In this paper, by varying n ∈ N * , we will investigate two distributions: the single-frame n-motifs SF with an unambiguous trinucleotide decoding in the two 5 -3 and 3 -5 directions (see Definition 9), and the 5 unambiguous n-motifs 5 U with an unambiguous trinucleotide decoding in the 5 -3 direction only which are defined formally as follows: Definition 14. A 5 unambiguous n-motif 5 U (unambiguous trinucleotide decoding in the 5 -3 direction only) is either a SF n-motif or a 3 U MF n-motif , i.e., neither a 5 U MF n-motif nor a BMF n-motif. For n ∈ N * , we have the obvious relations: For n ∈ N * , the occurrence probability PbSFM(n) of single-frame n-motifs SF will be computed according to Similarly, for n ∈ N * , the occurrence probability Pb5 U M(n) of 5 unambiguous n-motifs 5 U will be computed as follows Remark 1. Obviously, Pb5 U M(n) > PbSFM(n) whatever n. However, it will be interesting to compare these two probability distributions by varying n.

Single-Frame 1-Motifs
It is a trivial case. Each of the 64 codons (1-motifs, n = 1) are obviously single-frame motifs SF, by definition (non-existence of a shifted frame). Thus, the probabilities of SF and 5 U 1-motifs are equal to PbSFM(1) = Pb5 U M(1) = 1.

Remark 2.
For n ≥ 3, the 3 U MF and 5 U MF n-motifs can have two different shifted trinucleotides in the two frames 1 and 2, in contrast to the 2-motifs (see Tables 2 and 3). For example, with the tricodon AACAAAACC, the trinucleotides in reading frame are t 1 = AAC, t 2 = AAA and t 3 = ACC leading to T = {AAA, AAC, ACC}. The trinucleotides in the shifted frame 1 are t 1 1 = ACA and t 1 2 = AAA, leading to T 1 = {AAA, ACA}. The trinucleotides in the shifted frame 2 are t 2 1 = CAA and t 2 2 = AAC, leading to T 2 = {AAC, CAA}. As T ∩ T 1 = and T ∩ T 2 = , AACAAAACC is a multiple-frame tricodon MF. Furthermore, as t 1 = t 2 2 = AAC yields to the inequality 1 ≤ 2, as t 2 = t 1 2 = AAA yields to the inequality 2 ≤ 2 and as t 3 = ACC / ∈ T 1 ∪ T 2 , AACAAAACC is a unidirectional multiple-frame tricodon 5 U MF with two different trinucleotides in the two frames 1 and 2, i.e., AAA in frame 1 and AAC in frame 2.

Single-Frame n-Motifs
The determination of probability PbSFM(n) of single-frame n-motifs SF for n ≥ 3 (tricodons, tetracodons, etc.) cannot be done by hand. For n ∈ {3, . . . , 6} (tricodons up to hexacodons), exact values of probability PbSFM(n) can be obtained by computer calculus (see Table 4). For n = 6, the computation of SF motifs among the 64 6 = 68, 719, 476, 736 hexacodons with a parallel program with 8 threads takes about 7 days on a standard PC. For n ≥ 7 (heptacodons, octocodons, etc.), the probability PbSFM(n) is obtained by computer simulation. Simulated values of PbSFM(n) are obtained by generating 1,000,000 random n-motifs for each n. In order to evaluate this approach by computer simulation, simulated values of PbSFM(n) for n ∈ {2, . . . , 6} are also given in Table 4. Exact and simulated values of PbSFM(n) are identical at 10 −3 , demonstrating the reliability of the simulation approach.  The probability Pb5 U M(n) of 5 unambiguous n-motifs 5 U for n ≥ 3 is computed similarly.
While the proportion of multiple-frame 2-motifs MF (Definition 10) is minimal (5.1% = 100% − 94.9% for dicodons, Section 2.6), Figure 8 shows that their propagation will drastically reduce the proportion of SF n-motifs when the trinucleotide length n increases. There are almost no more SF motifs with a length of 14 trinucleotides (PbSFM (14) < 1%) and the number of MF motifs becomes already higher than the number of SF motifs with a length of six trinucleotides (Figure 8). motifs with a length of 14 trinucleotides ( (14) < 1%) and the number of motifs becomes already higher than the number of motifs with a length of six trinucleotides (Figure 8). Thus, only short genes, i.e., with up to five trinucleotides, have a higher proportion of singleframe motifs compared to the multiple-frame motifs. Thus, primitive translation, without the extant complex ribosome, could only generate short peptides without frameshift errors.
Thus with the 5′ motifs, there is a length increase of 20 − 14 = 6 trinucleotides in the trinucleotide decoding. The maximum probability difference 5′ ( ) − ( ) is 22.0% at length = 8 trinucleotides.  Thus, only short genes, i.e., with up to five trinucleotides, have a higher proportion of single-frame motifs compared to the multiple-frame motifs. Thus, primitive translation, without the extant complex ribosome, could only generate short peptides without frameshift errors.

5 Unambiguous Motifs
I then compared the probability PbSFM(n) (Equation (3)) of single-frame n-motifs SF (Definition 9) and the probability Pb5 U M(n) (Equation (4)) of 5 unambiguous n-motifs 5 U (Definition 14). Figure 9 shows the decreasing probability Pb5 U M(n) of 5 U n-motifs when the trinucleotide length n increases. As expected (see Remark 1), its decrease is slower than that of SF n-motifs. There are almost no more 5 U motifs with a length of 20 trinucleotides (Pb5 U M(20) < 1%). Thus with the 5 U motifs, there is a length increase of 20 − 14 = 6 trinucleotides in the trinucleotide decoding. The maximum probability difference Pb5 U M(n) − PbSFM(n) is 22.0% at length n = 8 trinucleotides.
Life 2019, 9,18 11 of 23 motifs with a length of 14 trinucleotides ( (14) < 1%) and the number of motifs becomes already higher than the number of motifs with a length of six trinucleotides (Figure 8). Thus, only short genes, i.e., with up to five trinucleotides, have a higher proportion of singleframe motifs compared to the multiple-frame motifs. Thus, primitive translation, without the extant complex ribosome, could only generate short peptides without frameshift errors.
Thus with the 5′ motifs, there is a length increase of 20 − 14 = 6 trinucleotides in the trinucleotide decoding. The maximum probability difference 5′ ( ) − ( ) is 22.0% at length = 8 trinucleotides.  The 5 unambiguous n-motifs, a less restrictive class of motifs with an unambiguous trinucleotide decoding in the 5 -3 direction only, can generate a slightly longer peptides without frameshift error compared to the single-frame motifs.
I now evaluate the single-frame motifs SF and the 5 unambiguous motifs 5 U with constraints.

Single-Frame and 5 Unambiguous Motifs with Initiation and Stop Codons
The single-frame n-motifs SF and the 5 unambiguous motifs 5 U are investigated with an initiation codon ATG and a stop codon {TAA, TAG, TGA}. The case n = 1 does not exist. For n = 2, there are only three dicodons: ATGTAA, ATGTAG and ATGTGA which are all obviously SF. Thus, the probabilities of SF and USF 2-motifs are obviously PbSFM(2) = Pb5 U M(2) = 1. Figure 10 shows that the proportions of SF and 5 U motifs with initiation and stop codons are lower than their respective non-constrained motifs.
Genes with initiation and stop codons do not increase translation fidelity compared to non-constrained genes (according to this approach).
Life 2019, 9,18 12 of 23 The 5′ unambiguous -motifs, a less restrictive class of motifs with an unambiguous trinucleotide decoding in the 5 -3′ direction only, can generate a slightly longer peptides without frameshift error compared to the single-frame motifs.
I now evaluate the single-frame motifs and the 5′ unambiguous motifs 5′ with constraints.

Single-Frame and 5' Unambiguous Motifs with Initiation and Stop Codons
The single-frame -motifs and the 5′ unambiguous motifs 5′ are investigated with an initiation codon ATG and a stop codon { , , }. The case = 1 does not exist. For = 2, there are only three dicodons: ATGTAA, ATGTAG and ATGTGA which are all obviously . Thus, the probabilities of and 2-motifs are obviously (2) = 5′ (2) = 1. Figure 10 shows that the proportions of and 5′ motifs with initiation and stop codons are lower than their respective non-constrained motifs.  (3)) of single-frame -motifs (blue curve from Figure 8) and decreasing probability 5′ ( ) (Equation (4)) of 5′ unambiguous -motifs 5′ (cyan curve from Figure 9) by varying the length between 1 and 10 trinucleotides. With initiation and stop codons, decreasing probability ( ) of -motifs (magenta curve) and decreasing probability 5′ ( ) of -motifs 5′ (orange curve) by varying the length between 2 and 10 trinucleotides.
Genes with initiation and stop codons do not increase translation fidelity compared to nonconstrained genes (according to this approach).

Single-Frame and 5′ Unambiguous Motifs without Periodic Codons
The single-frame motifs and the 5′ unambiguous motifs 5′ are now studied without periodic codons { , , , }. As expected, Figure 11 shows that the proportions of and 5′ motifs without periodic codons are higher than their respective non-constrained motifs.  (3)) of single-frame n -motifs SF (blue curve from Figure 8) and decreasing probability Pb5 UM(n) (Equation (4)) of 5 unambiguous n -motifs 5 U (cyan curve from Figure 9) by varying the length n between 1 and 10 trinucleotides. With initiation and stop codons, decreasing probability PbSFM(n) of n -motifs SF (magenta curve) and decreasing probability Pb5 UM(n) of n -motifs 5 U (orange curve) by varying the length n between 2 and 10 trinucleotides.

Single-Frame and 5 Unambiguous Motifs without Periodic Codons
The single-frame motifs SF and the 5 unambiguous motifs 5 U are now studied without periodic codons {AAA, CCC, GGG, TTT}. As expected, Figure 11 shows that the proportions of SF and 5 U motifs without periodic codons are higher than their respective non-constrained motifs.
Genes without periodic codons slightly increase frame translation fidelity compared to non-constrained genes (according to this approach).

Single-Frame and 5 Unambiguous Motifs with Antiparallel Complementarity
The single-frame 2n-motifs SF and the 5 unambiguous 2n-motifs 5 U are now investigated with the following antiparallel complementary sequence: t 1 t 2 · · · t n C(t n ) · · · C(t 2 )C(t 1 ) where the trinucleotide antiparallel complementarity map C applied to a trinucleotide t is recalled in Definition 1. As an example, if t 1 t 2 t 3 = ACGTGCAAT then the antiparallel complementary sequence studied is ACGTGCAATATTGCACGT. Note that the trinucleotide length of such motifs is even. Classical antiparallel complementary structures are the DNA double helix and the RNA stem. Interesting results are observed. As expected, the two probability curves PbSFM(n) of SF motifs and Pb5 U M(n) of 5 U motifs with antiparallel complementarity are identical ( Figure 12). The proof is based on the following and f = f . Furthermore, antiparallel complementarity increases the proportion of SF motifs but decreases the proportion of 5 U motifs, compared to their respective non-constrained motifs.  (3)) of single-frame -motifs (blue curve from Figure 8) and decreasing probability 5′ ( ) (Equation (4)) of 5′ unambiguous -motifs 5′ (cyan curve from Figure 9) by varying the length between 1 and 10 trinucleotides. Without periodic codons { , , , }, decreasing probability ( ) of -motifs (magenta curve) and decreasing probability 5′ ( ) of -motifs 5′ (orange curve) by varying the length between 1 and 10 trinucleotides.
Genes without periodic codons slightly increase frame translation fidelity compared to nonconstrained genes (according to this approach).
Genes without periodic codons slightly increase frame translation fidelity compared to nonconstrained genes (according to this approach).

Single-Frame and 5′ Unambiguous Motifs with Antiparallel Complementarity
The single-frame 2 -motifs and the 5′ unambiguous 2 -motifs 5′ are now investigated with the following antiparallel complementary sequence: ⋯ ( ) ⋯ ( ) ( ) where the trinucleotide antiparallel complementarity map applied to a trinucleotide is recalled in Definition 1. As an example, if = then the antiparallel complementary sequence studied is . Note that the trinucleotide length of such motifs is even. Classical antiparallel complementary structures are the DNA double helix and the RNA stem. Interesting results are observed. As expected, the two probability curves ( ) of motifs and 5′ ( ) of 5′ motifs with antiparallel complementarity are identical ( Figure 12). The proof is based on the following property: if = with > ( 3′ motif) then ( ) = = = with ≤ (5′ motif) and ≠ . Furthermore, antiparallel complementarity increases the proportion of motifs but decreases the proportion of 5′ motifs, compared to their respective non-constrained motifs.  (3)) of single-frame -motifs (blue curve from Figure 8) and decreasing probability 5′ ( ) (Equation (4)) of 5′ unambiguous -motifs 5′ Figure 12. Decreasing probability PbSFM(n) (Equation (3)) of single-frame n -motifs SF (blue curve from Figure 8) and decreasing probability Pb5 UM(n) (Equation (4)) of 5 unambiguous n -motifs 5 U (cyan curve from Figure 9) by varying the length n between 1 and 14 trinucleotides. With antiparallel complementarity, decreasing probabilities PbSFM(n) and Pb5 UM(n) of 2n -motifs SF and 5 U (two identical curves in magenta) by varying the length n between 1 and 7 trinucleotides.
The "antiparallel complementary" genes have a higher proportion of single-frame motifs compared to the non-complementary genes. Thus, primitive translation associated with a DNA property could generate a greater number of peptides without frameshift errors.

Single-Frame Motifs and 5 Unambiguous with Parallel Complementarity
The single-frame 2n-motifs SF and the 5 unambiguous 2n-motifs 5 U are now analysed with the following parallel complementary sequence: t 1 t 2 . . . t n D(t 1 )D(t 2 ) . . . D(t n ) where the trinucleotide parallel complementarity map D applied to a trinucleotide t is recalled in Definition 1. As an example, if t 1 t 2 t 3 = ACGTGCAAT then the parallel complementary sequence studied is ACGTGCAATTGCACGTTA. Note that the trinucleotide length of such motifs is also even. Interesting results are also observed. The two probability curves PbSFM(n) of SF motifs with parallel complementarity and Pb5 U M(n) of 5 U motifs without constraints are superposable ( Figure 13).
Parallel complementarity increases the proportions of both SF motifs and 5 U motifs compared to their respective non-constrained motifs.
The single-frame 2 -motifs and the 5′ unambiguous 2 -motifs 5′ are now analysed with the following parallel complementary sequence: … ( ) ( ) … ( ) where the trinucleotide parallel complementarity map applied to a trinucleotide is recalled in Definition 1. As an example, if = then the parallel complementary sequence studied is . Note that the trinucleotide length of such motifs is also even. Interesting results are also observed. The two probability curves ( ) of motifs with parallel complementarity and 5′ ( ) of 5′ motifs without constraints are superposable ( Figure 13).
"Parallel complementary" genes have a slightly higher proportion of single-frame motifs compared to the "antiparallel complementary" genes (compare the magenta curves in Figures 12 and  13). The biological meaning is not yet explained.

Framing Motifs
There are framing motifs which are single-frame or multiple-frame .

Proposition 1. A framing motif can be single-frame .
Proof. Take the following motif = . The motif can be generated by the code = { , , , , }. By Theorem 1, it is easy to verify that the graph ( ) is acyclic, and thus is circular. Furthermore, the set of trinucleotides in reading frame is = , the set of trinucleotides in the shifted frame 1 is = { , , , , } and the set of trinucleotides Figure 13. Decreasing probability PbSFM(n) (Equation (3)) of single-frame n -motifs SF (blue curve from Figure 8) and decreasing probability Pb5 UM(n) (Equation (4)) of 5 unambiguous n -motifs 5 U (cyan curve from Figure 9) by varying the length n between 1 and 14 trinucleotides. With parallel complementarity, decreasing probability PbSFM(n) of 2n -motifs SF (magenta curve) and decreasing probability Pb5 UM(n) of 2n -motifs 5 U (orange curve) by varying the length n between 1 and 7 trinucleotides.
"Parallel complementary" genes have a slightly higher proportion of single-frame motifs compared to the "antiparallel complementary" genes (compare the magenta curves in Figures 12  and 13). The biological meaning is not yet explained.

Framing Motifs
There are framing motifs F which are single-frame SF or multiple-frame MF.

Proposition 1.
A framing motif F can be single-frame SF.
Proof. Take the following motif m = GAACTCCCGATATGGCTC. The motif m can be generated by the code X = {ATA, CCG, CTC, GAA, TGG}. By Theorem 1, it is easy to verify that the graph G(X) is acyclic, and thus X is circular. Furthermore, the set of trinucleotides in reading frame is T = X, the set of trinucleotides in the shifted frame 1 is T 1 = {AAC, CGA, GGC, TAT, TCC} and the set of trinucleotides in the shifted frame 2 is T 2 = {ACT, ATG, CCC, GAT, GCT}. We have T ∩ T 1 = and T ∩ T 2 = . Thus, the motif m is both framing F and single-frame SF.

Proposition 2. A framing motif F can be multiple-frame MF.
Proof. Take the following motif m = ATTGAGCGAGCCTGTCAG. The motif m can be generated by the code X = {ATT, CAG, CGA, GAG, GCC, TGT}. By Theorem 1, it is easy to verify that the graph G(X) is acyclic, and thus X is circular. Furthermore, we have the trinucleotide sets T = X, T 1 = {AGC, CCT, GAG, GTC, TTG} and T 2 = {AGC, CTG, GCG, TCA, TGA} leading to T ∩ T 1 = {GAG} and T ∩ T 2 = . Thus, the motif m is both framing F and multiple-frame MF, precisely unidirectional multiple-frame 5 U MF.
There are single-frame motifs SF or multiple-frame motifs MF which are not framing F.

Proposition 3.
A single-frame motif SF can be non-framing F.
A simple parameter measuring the expansion intensity I e (m) of reading frame retrieval of a circular code motif m can be defined as follows: I e (m) = l(m) l max (X) (5) where l(m), l(m) ≥ 1, is the trinucleotide length of the motif m and l max (X), 1 ≤ l max (X) ≤ 8, is the length of a longest path in the associated graph G(X) of a trinucleotide circular code X ⊆ B 3 . Note that 1 8 ≤ I e (m) ≤ l(m) and if l(m) ≥ l max (X) then 1 ≤ I e (m) ≤ l(m).
A second parameter measuring both the expansion and variety intensity I ev (m) of a circular code motif m can also be defined as follows: where I e (m) is defined in Equation (5) and card(T (m)), 1 ≤ card(T (m)) ≤ 20, is the cardinality of the set T (m) (Notation 2) of trinucleotides (in reading frame f = 0) of m. Note that 1 8 ≤ I ev (m) ≤ 20l(m) and if l(m) ≥ l max (X) then 1 ≤ I ev (m) ≤ 20l(m). Thus, for the circular code motifs m of a given trinucleotide length l(m), the intensity I ev (m) of reading frame retrieval increases according to their cardinality card(T (m)).
For a sequence s containing several circular code motifs m, the formulas (5) and (6) can be expressed as follows: with the hypothesis that l max (X) is identical for the motifs m, a realistic case when the motifs m are obtained from a same studied trinucleotide circular code X, and thus: Note also that the formulas I e (s) and I ev (s) can also be normalized in order to weight the different lengths of sequences s.
Five dipeptides GlyAla, GlyVal, PheSer, ProLeu and ProArg are the most strongly coded, each by five MF dicodons (Table 12), e.g., GlyAla is coded by one 3 U MF dicodon GGCGCG (Table 7), and four 5 U MF dicodons GGGGCA, GGGGCC, GGGGCG and GGGGCT ( Table 9). The SF and MF dipeptides could have particular spatial structures and biological functions in extant and primitive proteins which remain to be identified. Table 11. Multi-frame dipeptide boolean matrix. The 114 = 121 − 4 − 3 MF dipeptides, the four pairs (stop codon, amino acid) and the three pairs (amino acid, stop codon) coded by the 208 = 16 + 2 × 96 multiple-frame dicodons BMF (Definition 13, Table 1), 3 UMF (Definition 11, Table 2) and 5 UMF (Definition 12, Table 3). The rows and columns are associated with the first and second amino acid, respectively, in the dipeptide. The value of 1 means a MF dipeptide coded by at least a multiple-frame dicodon MF (MF true). The value of 0 stands for a SF dipeptide coded by a single-frame dicodon SF (MF false). For example, the value of AlaCys is 0 (absent in Tables 5, 7 and 9) and the value of CysAla is 1 (7th row in Table 7).

Site 2nd
A Ala Cys Asp Glu Phe Gly His Ile Lys Leu Met Asn Pro Gln Arg Ser Thr Val Trp Tyr Ter Sum Table 12. Multi-frame dipeptide occurrence matrix. The 114 = 121 − 4 − 3 MF dipeptides, the four pairs (stop codon, amino acid) and the three pairs (amino acid, stop codon) coded by the 208 = 16 + 2 × 96 multiple-frame dicodons BMF (Definition 13, Table 1), 3 UMF (Definition 11, Table 2) and 5 UMF (Definition 12, Table 3). The rows and columns are associated with the first and second amino acid, respectively, in the dipeptide. The values between 1 and 5 give the number of times a MF dipeptide is coded by multiple-frame dicodons MF. The value of 0 stands for a SF dipeptide coded by a single-frame dicodon SF. For example, the value of AlaCys is 0 (absent in Tables 5, 7 and 9), the value of CysAla is 1 (7th row in Table 7) and the value of AlaArg if 4 (one occurrence: 1st row in Table 5 and three occurrences: 1st row in Table 9).

Discussion
For the first time to our knowledge, new definitions of motifs in genes are presented. The single-frame motifs SF (unambiguous trinucleotide decoding in the two 5 -3 and 3 -5 directions) and the multiple-frame motifs MF (ambiguous trinucleotide decoding in at least one direction) form a partition of genes. Several classes of MF motifs are defined and analysed: (i) unidirectional multiple-frame motifs 3 U MF (ambiguous trinucleotide decoding in the 3 -5 direction only); (ii) unidirectional multiple-frame motifs 5 U MF (ambiguous trinucleotide decoding in the 5 -3 direction only); and (iii) bidirectional multiple-frame motifs BMF (ambiguous trinucleotide decoding in the two 5 -3 and 3 -5 directions). The distribution of the single-frame motifs SF and the 5 unambiguous motifs 5 U (unambiguous trinucleotide decoding in the 5 -3 direction only) are studied without and with constraints.
The proportion of SF motifs drastically decreases with their trinucleotide length. The SF motifs become absent (<1%) when their length ≥14 trinucleotides and the number of MF motifs becomes already higher than the number of SF motifs when their length ≥6 trinucleotides. As expected, the proportion of 5 U motifs decreases more slowly than that of SF motifs. The 5 U motifs become absent (<1%) when their length ≥20 trinucleotides. Thus with the 5 U motifs, there is a length increase of 20 − 14 = 6 trinucleotides in the trinucleotide decoding.
The proportions of SF and 5 U motifs with initiation and stop codons are lower than their respective non-constrained motifs. In contrasts, their proportions in motifs without periodic codons {AAA, CCC, GGG, TTT} are higher than their respective non-constrained motifs. The proportions of SF and 5 U motifs with antiparallel complementarity are identical. Antiparallel complementarity increases the proportion of SF motifs but decreases the proportion of 5 U motifs, compared to their respective non-constrained motifs. The proportions of SF motifs with parallel complementarity and 5 U motifs without constraints follow a similar distribution. Finally, parallel complementarity increases the proportions of both SF motifs and 5 U motifs compared to their respective non-constrained motifs. Taken together, these results suggest that the complementarity property involved in the antiparallel (DNA double helix, RNA stem) and parallel sequences could also be fundamental for coding genes with unambiguous trinucleotide decoding, strictly in the two 5 -3 and 3 -5 directions (SF motifs) or conserved in the 5 -3 direction but relaxed-lost in the 3 -5 direction (5 U motifs).
The single-frame motifs SF with a property of trinucleotide decoding and the framing motifs F with a property of reading frame decoding could have operated in the primitive soup for constructing the modern genetic code and the extant genes [31]. They could have been involved in the stage without anticodon-amino acid interactions to form peptides from prebiotically amino acids [32]. They could also have been related in the Implicated Site Nucleotides (ISN) of RNA interacting with the amino acids at the primitive step of life (review in [33]). According to a great number of biological experiments, the ISN structure contains nucleotides in fixed and variable positions, as well as an important trinucleotide for interacting with the amino acid (see e.g., the recent review in [34]). However, the general structure of the aptamers binding amino acids, in particular its nucleotide length, its amino acid binding loop and its nucleotide position, is still an open problem. Similar arguments could hold for the ribonucleopeptides which could be implicated in a primitive T box riboswitch functioning as an aminoacyl-tRNA synthetase and a peptidyl-transferase ribozyme [35]. The single-frame motifs SF and the framing motifs F with their properties to decode the trinucleotides and the reading frame could have been necessary for the evolutionary construction of the modern genetic code.