Next Article in Journal
Making Molecules with Clay: Layered Double Hydroxides, Pentopyranose Nucleic Acids and the Origin of Life
Previous Article in Journal
Low-Digit and High-Digit Polymers in the Origin of Life
Article
Single-Frame, Multiple-Frame and Framing Motifs in Genes
Theoretical Bioinformatics, ICube, CNRS, University of Strasbourg, 300 Boulevard Sébastien Brant, 67400 Illkirch, France
Received: 17 November 2018 / Accepted: 31 January 2019 / Published: 10 February 2019

Abstract

:
We study the distribution of new classes of motifs in genes, a research field that has not been investigated to date. A single-frame motif SF has no trinucleotide in reading frame (frame 0) that occurs in a shifted frame (frame 1 or 2), e.g., the dicodon AAACAA is S F as the trinucleotides AAA and CAA do not occur in a shifted frame. A motif which is not single-frame S F is multiple-frame M F . Several classes of M F motifs are defined and analysed. The distributions of single-frame S F motifs (associated with an unambiguous trinucleotide decoding in the two 5 3 and 3 5 directions) and 5′ unambiguous motifs 5 U (associated with an unambiguous trinucleotide decoding in the 5 3 direction only) are analysed without and with constraints. The constraints studied are: initiation and stop codons, periodic codons { A A A , C C C , G G G , T T T } , antiparallel complementarity and parallel complementarity. Taken together, these results suggest that the complementarity property involved in the antiparallel (DNA double helix, RNA stem) and parallel sequences could also be fundamental for coding genes with an unambiguous trinucleotide decoding in the two 5 3 and 3 5 directions or the 5 3 direction only. Furthermore, the single-frame motifs S F with a property of trinucleotide decoding and the framing motifs F (also called circular code motifs; first introduced by Michel (2012)) with a property of reading frame decoding may have been involved in the early life genes to build the modern genetic code and the extant genes. They could have been involved in the stage without anticodon-amino acid interactions or in the Implicated Site Nucleotides (ISN) of RNA interacting with the amino acids. Finally, the S F and M F dipeptides associated with the S F and M F dicodons, respectively, are studied and their importance for biology and the origin of life discussed.
Keywords:
single-frame motifs; multiple-frame motifs; framing motifs; gene coding; antiparallel and parallel sequences; early life genes

1. Introduction

The reading frame coding with trinucleotide sets is a fascinating problem, both theoretical and experimental. Before the discovery of the genetic code, a first code was proposed by Gamow [1] by considering the “key-and-lock” relation between various amino acids, and the rhomb shaped “holes” formed by various nucleotides in the DNA. The proposed model will later prove to be false. A few years later, a class of trinucleotide codes, called comma-free codes, was proposed by Crick et al. [2] for explaining how the reading of a sequence of trinucleotides could code amino acids. In particular, how the correct reading frame can be retrieved and maintained. The four nucleotides {A,C,G,T} as well as the 16 dinucleotides {AA,…,TT} are simple codes which are not appropriate for coding 20 amino acids. However, trinucleotides induce a redundancy in their coding. Thus, Crick et al. [2] conjectured that only 20 trinucleotides among the 64 possible trinucleotides {AAA,…,TTT} code for the 20 amino acids. Such a bijective code implies that the coding trinucleotides are found only in one frame—the comma-freeness property. The determination of a set of 20 trinucleotides forming a comma-free code has several necessary conditions:
(i) A periodic trinucleotide from the set {AAA,CCC,GGG,TTT} must be excluded from such a code. Indeed, the concatenation of AAA with itself, for instance, does not allow the (original) reading frame to be retrieved as there are three possible decompositions: …,AAA,AAA,AAA,… (original frame), …A,AAA,AAA,AA… and …AA,AAA,AAA,A…, the commas showing the adopted decomposition.
(ii) Two non-periodic permuted trinucleotides, i.e., two trinucleotides related by a circular permutation, e.g., ACG and CGA, must also be excluded from such a code. Indeed, the concatenation of ACG with itself, for instance, does not allow the reading frame to be retrieved as there are two possible decompositions: …,ACG,ACG,ACG,… (original frame) and …A,CGA,CGA,CG
Therefore, by excluding the four periodic trinucleotides and by gathering the 60 remaining trinucleotides in 20 classes of three trinucleotides such that, in each class, the three trinucleotides are deduced from each other by a circular permutation, e.g., ACG, CGA and GAC, we see that a comma-free code can contain only one trinucleotide from each class and thus has at most 20 trinucleotides. This trinucleotide number is identical to the amino acid number, thus leading to a code assigning one trinucleotide per amino acid without ambiguity.
In the beginning 1960’s, the discovery that the trinucleotide TTT, an excluded trinucleotide in a comma-free code, codes phenylalanine [3], led to the abandonment of the concepts both of a comma-free code [2] and a bijective code as the genetic code is degenerate [4,5,6] with a gene translation in one direction [7].
In 1996, a statistical analysis of occurrence frequencies of the 64 trinucleotides in the three frames of genes of both prokaryotes and eukaryotes showed that the trinucleotides are not uniformly distributed in these three frames [8]. By excluding the four periodic trinucleotides and by assigning each trinucleotide to a preferential frame (frame of its highest occurrence frequency), three subsets X = X 0 , X 1 and X 2 of 20 trinucleotides each are found in the frames 0 (reading frame), 1 (frame 0 shifted by one nucleotide in the 5 3 direction, i.e., to the right) and 2 (frame 0 shifted by two nucleotides in the 5 3 direction) in genes of both prokaryotes and eukaryotes. The same set X of trinucleotides was identified in average in genes (reading frame) of bacteria, archaea, eukaryotes, plasmids and viruses [9,10]. It contains the 20 following trinucleotides:
X = { A A C , A A T , A C C , A T C , A T T , C A G , C T C , C T G , G A A , G A C , G A G , G A T , G C C , G G C , G G T , G T A , G T C , G T T , T A C , T T C }
and codes the 12 following amino acids (three and one letter notation):
X = { A l a ,   A s n ,   A s p ,   G l n ,   G l u ,   G l y ,   I l e ,   L e u ,   P h e ,   T h r ,   T y r ,   V a l } = { A , N , D , Q ,   E ,   G ,   I ,   L ,   F ,   T ,   Y ,   V } .
This set X has a strong mathematical property. Indeed, X is a maximal C 3 self-complementary trinucleotide circular code [8].
The reading frame coding with trinucleotide codes (sets of words) in general terms, i.e., not particularly the genetic code, is a concept which has been studied in Michel [11,12]. We extend it to the motifs (words of codes), a theoretical domain which has been ignored according to our knowledge. Genes (protein coding regions) can be partitioned into two disjoint classes of motifs: the single-frame motifs S F with an unambiguous trinucleotide decoding in the two 5 3 and 3 5 directions, and the multiple-frame motifs M F with an ambiguous trinucleotide decoding in at least one direction. A single-frame motif S F has no trinucleotide in reading frame (frame 0) that occurs in a shifted frame (frame 1 or 2). In contrast, a multiple-frame motif M F has at least one trinucleotide in reading frame that occurs in a shifted frame. Some well-known M F motifs are involved in ribosomal frameshifting. The expression of some viral and cellular genes utilizes a -1 programmed ribosomal frameshifting (-1 PRF) [13,14]. This -1 PRF sequence is based on three elements: (i) a slippery motif composed of seven nucleotides at which the change in reading frame occurs; (ii) a spacer motif, usually less than 12 nucleotides; and (iii) a down-stream (3′) stimulatory motif, usually a pseudoknot or a stem-loop. In eukaryotes, the slippery motif fits a consensus heptanucleotide X,XXY,YYZ, where XXX is any three identical nucleotides, YYY represents AAA or TTT, Z represents A, C or T, the commas separating the codons in reading frame [15,16]. The slippery motifs M F 1 = A , A A A , A A Z and M F 2 = T , T T T , T T Z are multiple-frame M F . Indeed, the codon AAA in reading frame also occurs in the shifted frames 1 and 2 in M F 1 , and similarly with the codon TTT in M F 2 . Alternative gene decoding is also possible with +1 programmed ribosomal frameshifting (+1 PRF) which has been particularly observed in Euplotes [17]. The identified slippery motif T T T , T A R where R = { A , G } is multiple-frame M F . The slippery motifs AAA, CCC, GGG and TTT may cause frameshifting during transcription, producing RNAs missing specific nucleotides when compared to template DNA [18,19]. The slippery motifs are not always multiple-frame while stressing that the spacer and the down-stream stimulatory motifs have been very poorly characterized [20] and could also be involved in such a multiple-frame definition. From a theoretical point of view, it is important to extend this concept by increasing the length of such multiple-frame slippery motifs and also by considering their different classes. If the multiple-frame motifs may be involved in ribosomal frameshifting, the single-frame motifs S F and the framing motifs F (also called circular code motifs; first introduced in Michel [21,22]) from the circular codes [8,9,10] (reviews in Michel [23]; Fimmel and Strüngmann [24]) may have been important in early life genes for constructing the modern genetic code and the extant genes (see Discussion).
Several classes of M F motifs are defined: (i) a unidirectional multiple-frame motif 3 U M F has no trinucleotide in reading frame that occurs in a shifted frame after its reading (i.e., its position in the reading frame) but has at least one trinucleotide in reading frame that occurs in a shifted frame before its reading, e.g., the dicodon AACACA is 3 U M F as the trinucleotides AAC and (trivially) ACA do not occur in a shifted frame after their reading and as the trinucleotide ACA occurs in a shifted frame (precisely frame 1) before its reading; (ii) a unidirectional multiple-frame motif 5 U M F , the opposite, has no trinucleotide in reading frame that occurs in a shifted frame before its reading but has at least one trinucleotide in reading frame that occurs in a shifted frame after its reading, e.g., the dicodon ACACAA mirror of AACACA is 5 U M F as the trinucleotides (trivially) ACA and CAA do not occur in a shifted frame before their reading and as the trinucleotide ACA occurs in a shifted frame (precisely frame 2) after its reading; and (iii) a bidirectional multiple-frame motif B M F has at least one trinucleotide in reading frame that occurs in a shifted frame before its reading and has at least one trinucleotide in reading frame that occurs in a shifted frame after its reading (both 3 U M F and 5 U M F ), e.g., the dicodons AAAAAA and ACACAC are B M F . A 5′ unambiguous motif 5 U , is either a S F motif or a 3 U M F motif, e.g., the dicodons AAACAA ( S F motif) and AACACA ( 3 U M F motif) belong to the class 5 U .
We will only investigate here the distribution of the single-frame motifs S F associated with an unambiguous trinucleotide decoding in the two 5 3 and 3 5 directions, and the 5′ unambiguous motifs 5 U associated with an unambiguous trinucleotide decoding in the 5 3 direction only, i.e., a less restrictive class of motifs. The distributions of S F and 5 U motifs will be analysed without and with constraints. The constraints studied are: (i) with initiation and stop codons; (ii) without periodic codons { A A A , C C C , G G G , T T T } ; (iii) with antiparallel complementarity; and (iv) with parallel complementarity.
We will also investigate the particular case of motifs made up of two codons, i.e., the dicodons. The definitions of S F and M F dicodons will thus identify two new classes of dipeptides, the S F and M F dipeptides. The S F dipeptides are coded by dicodons with an unambiguous trinucleotide decoding, in contrast to the M F dipeptides which are coded by dicodons with an ambiguous trinucleotide decoding. The concept of S F and M F dipeptides might be of predictive value to studies of prebiotic metabolites [25]. Peptide evolution on the primitive earth is an active and exciting field of research with cyclic dipeptides [26] and selective formation of SerHis dipeptide via phosphorus activation [27,28].

2. Method

2.1. Recall of Biological Definitions

Notation 1.
Let us denotes the nucleotide 4-letter alphabet B = { A , C , G , T } where A stands for adenine, C stands for cytosine, G stands for guanine and T stands for thymine. The trinucleotide set over B is denoted by B 3 = { A A A , , T T T } . The set of non-empty words (words, respectively) over B is denoted by B + ( B * , respectively).
Definition 1.
According to the complementary property of the DNA double helix, the nucleotide complementarity map C : B B is defined by C ( A ) = T , C ( C ) = G , C ( G ) = C , C ( T ) = A . According to the complementary and antiparallel properties of the DNA double helix, the trinucleotide antiparallel complementarity map C : B 3 B 3 is defined by C ( l 0 l 1 l 2 ) = C ( l 2 ) C ( l 1 ) C ( l 0 ) for all l 0 , l 1 , l 2 B . The trinucleotide parallel complementarity map D : B 3 B 3 is defined by D ( l 0 l 1 l 2 ) = C ( l 0 ) C ( l 1 ) C ( l 2 ) for all l 0 , l 1 , l 2 B .
Example 1.
C ( A C G ) = C G T and D ( A C G ) = T G C .

2.2. Recall of Circular Code Definitions

Definition 2.
A set S   B + is a code if, for each x 1 , , x n , y 1 , , y m S , n , m 1 , the condition x 1 x n = y 1 y m implies n = m and x i = y i for i = 1 , , n .
Definition 3.
Any non-empty subset of the code B 3 is a code and called trinucleotide code.
Definition 4.
A trinucleotide code X B 3 is circular if, for each x 1 , , x n , y 1 , , y m X , n , m 1 , r B * , s B + , the conditions s x 2 x n r = y 1 y m and x 1 = r s imply n = m , r = ε (empty word) and x i = y i for i = 1 , , n .
We briefly recall the proof used here to determine whether a code is circular or not, with the most recent and powerful approach which relates an oriented (directed) graph to a trinucleotide code.
Definition 5.
[29]. Let X B 3 be a trinucleotide code. The directed graph G ( X ) = ( V ( X ) , E ( X ) ) associated with X has a finite set of vertices V ( X ) and a finite set of oriented edges E ( X ) (ordered pairs [ v , w ] where v , w X ) defined as follows:
{ V ( X ) = { N 1 , N 3 , N 1 N 2 , N 2 N 3 :   N 1 N 2 N 3 X } E ( X ) = { [ N 1 , N 2 N 3 ] , [ N 1 N 2 , N 3 ] :   N 1 N 2 N 3 X } .
The theorem below gives a relation between a trinucleotide code which is circular and its associated graph.
Theorem 1.
[29]. Let X B 3 be a trinucleotide code. The following statements are equivalent:
(i) 
The code X is circular.
(ii) 
The graph G ( X ) is acyclic.
Definition 6.
Circular code motifs (first introduced by Michel [21,22]), also called here framing motifs F , are motifs from the circular codes. They have the capacity to retrieve, maintain and synchronize the reading frame in genes.
Example 2.
Let a framing motif F 1 =   ...AGGTAATTACCAG... be constructed with the circular code X (1) identified in genes of bacteria, archaea, eukaryotes, plasmids and viruses [8,9,10].
(i) Such a framing motif F 1 can be obtained as follows. A sequence s of trinucleotides of X is generated and a substring is extracted at any position in this sequence s , i.e., the series of nucleotides on the right and the left of the substring are not considered. Let this substring be F 1 . (ii) This framing motif F 1 allows the reading frame to be retrieved (Figure 1). We try the three possible decompositions w 0 , w 1 (shifted by one letter to the right) and w 2 (shifted by two letters to the right) of F 1 . With w 0 , AG is not a prefix of any trinucleotide of X , thus the frame associated with w 0 is impossible. With w 2 , AG is a suffix of CAG and GAG belonging to X , then GTA, ATT and ACC belong to X , followed by A which is a prefix of five trinucleotides of X . Thus at this position, the frame associated with w 2 is still possible and 2 + 3 × 3 + 1 = 12 nucleotides are read. The next letter G leads to AG which is not a prefix of any trinucleotide of X . Thus, a window of 12 + 1 = 13 nucleotides demonstrates that the frame associated with w 2 is impossible. With w 1 , A is a suffix of GAA and GTA belonging to X , then GGT, AAT, TAC, CAG, etc., belong to X . Thus, the reading frame of F 1 is associated with w 1 , i.e., the first letter A of w is the 3rd letter of a trinucleotide of X : the reading frame of the sequence s is retrieved: ...A,GGT,AAT,TAC,CAG,… (the comma showing the reading frame). (iii) We can prove mathematically that a windows of 13 nucleotides always retrieves the reading frame with the circular code X . Four framing motifs F need a window of 13 nucleotides with the circular code X as they are the four longest ambiguous words of length l = 12 nucleotides: F 1 =   AGGTAATTACCA, F 2 =   AGGTAATTACCT (with w 2 , the first two letters AG are suffix of CAG and GAG belonging to X , and the last letter T is prefix of TAC and TTC belonging to X ), F 3 =   TGGTAATTACCA (with w 2 , the first two letters TG are suffix of CTG belonging to X , and the last letter A is prefix of five trinucleotides of X ) and F 4 =   TGGTAATTACCT (with w 2 , the first two letters TG are suffix of CTG belonging to X , and the last letter T is prefix of TAC and TTC belonging to X ). These four framing motifs F contain the two longest ambiguous words of length l = 11 nucleotides starting with a trinucleotide of X , i.e., when the suffixes of X are not considered: GGTAATTACCA and GGTAATTACCT (see last row in Table 1 in [21]). (iv) It is very important to stress that for all the other framing motifs F of the circular code X , i.e., different from F 1 , F 2 , F 3 and F 4 , the window for retrieving the reading frame is less than 13 nucleotides (see the growth function of the window as a function of the number of nucleotides in Figure 4 in [21]). It is also very important to recall that any motif of the circular code X is framing, i.e., it has the property of reading frame retrieval.

2.3. Definitions of Single-Frame and Multiple-Frame Motifs

Definition 7.
A n -motif, also called n -codon, is a series of trinucleotides t i in B 3 of trinucleotide length n , i { 1 , , n } , which defines the reading frame f = 0 , i.e., t 1 t 2 t n .
Definition 8.
The shifted frame f = 1 and f = 2 of a n -motif is a series of trinucleotides t i f in B 3 of trinucleotide length n 1 , i { 1 , , n 1 } , starting at the 2nd and 3rd nucleotide of t 1 = l 0 l 1 l 2 of the n -motif, i.e., at l 1 ( f = 1 ) and l 2 ( f = 2 ).
Notation 2.
Let T be the set of trinucleotides in reading frame f = 0 of a n -motif. Let T f be the set of trinucleotides in a shifted frame f { 1 , 2 } of a n -motif.
A single-frame motif S F has no trinucleotide t in reading frame that occurs in a shifted frame, i.e., the trinucleotide decoding is unambiguous in the two 5 3 and 3 5 directions. Formally:
Definition 9.
A single-frame n -motif S F (unambiguous trinucleotide decoding in the two 5 3 and 3 5 directions) is a n -motif such that T T f = for f { 1 , 2 } , i.e., t i t j f for i { 1 , , n } , for j { 1 , , n 1 } and for f { 1 , 2 } .
Example 3.
Let the dicodon be AAACAA ( 2 -motif). The trinucleotides in reading frame are t 1 = A A A and t 2 = C A A , leading to the trinucleotide set T = { A A A , C A A } . The single trinucleotide in the shifted frame 1 is t 1 1 = A A C , leading to the trinucleotide set T 1 = { A A C } . The single trinucleotide in the shifted frame 2 is t 1 2 = A C A , leading to the trinucleotide set T 2 = { A C A } . As T T 1 = and T T 2 = , AAACAA is a single-frame dicodon S F (Figure 2).
A multiple-frame motif M F , in contrast to a S F motif, has at least one trinucleotide t in reading frame that occur in a shifted frame f . Formally:
Definition 10.
A multiple-frame n -motif M F (ambiguous trinucleotide decoding in at least one direction) is a n -motif such that T T f for f { 1 , 2 } , i.e., i { 1 , , n } j { 1 , , n 1 } f { 1 , 2 } : t i = t j f .
The unidirectional multiple-frame motifs U M F belong to a class of M F motifs where all the trinucleotides t f in a shifted frame f occur only before ( 3 U M F : 3 5 direction) or only after ( 5 U M F : 5 3 direction) the trinucleotides t in reading frame. Formally:
Definition 11.
A unidirectional multiple-frame n -motif 3 U M F (ambiguous trinucleotide decoding in the 3 5 direction only) is a M F n -motif ( F F f for f { 1 , 2 } ) such that the condition t i = t j f implies i > j for i { 1 , , n } , for j { 1 , , n 1 } and for f { 1 , 2 } .
Example 4.
Let the dicodon be AACACA. The trinucleotides in reading frame are t 1 = A A C and t 2 = A C A , leading to T = { A A C , A C A } . The single trinucleotide in the shifted frame 1 is t 1 1 = A C A , leading to T 1 = { A C A } . The single trinucleotide in the shifted frame 2 is t 1 2 = C A C , leading to T 2 = { C A C } . As T T 1 ,AACACAis a multiple-frame dicodon M F . Furthermore, as t 2 = t 1 1 = A C A yields to the inequality 2 > 1 , as t 1 = A A C t 1 1 = A C A and as t 1 = A A C t 1 2 = C A C ,AACACAis a unidirectional multiple-frame dicodon 3 U M F (Figure 3).
Definition 12.
A unidirectional multiple-frame n -motif 5 U M F (ambiguous trinucleotide decoding in the 5 3 direction only) is a M F n -motif ( F F f for f { 1 , 2 } ) such that the condition t i = t j f implies i j for i { 1 , , n } , for j { 1 , , n 1 } and for f { 1 , 2 } .
Example 5.
Let the dicodon be AAAAAC. The trinucleotides in reading frame are t 1 = A A A and t 2 = A A C , leading to T = { A A A , A A C } . The trinucleotides in the shifted frames 1 and 2 are t 1 1 = t 1 2 = A A A , leading to the trinucleotide sets T 1 = T 2 = { A A A } . As T T 1 and T T 2 , AAAAAC is a multiple-frame dicodon M F . Furthermore, as t 1 = t 1 1 = t 1 2 = A A A yields to the two inequalities 1 1 and as t 2 = A A C t 1 1 = t 1 2 = A A A ,AAAAAC is a unidirectional multiple-frame dicodon 5 U M F (Figure 4).
Example 6.
Let the dicodon beACACAA. The trinucleotides in reading frame are t 1 = A C A and t 2 = C A A , leading to T = { A C A , C A A } . The single trinucleotide in the shifted frame 1 is t 1 1 = C A C , leading to T 1 = { C A C } . The single trinucleotide in the shifted frame 2 is t 1 2 = A C A , leading to T 2 = { A C A } . As T T 2 ,ACACAAis a multiple-frame dicodon M F . Furthermore, as t 1 = t 1 2 = A C A yields to the inequality 1 1 , as t 2 = C A A t 1 1 = C A C and as t 2 = C A A t 1 2 = A C A ,ACACAAis a unidirectional multiple-frame dicodon 5 U M F (Figure 5). The reasoning could be immediate by noting that the dicodon ACACAA is mirror of AACACA (compare with Example 4).
Definition 13.
A bidirectional multiple-frame n -motif B M F (ambiguous trinucleotide decoding in the two 5 3 and 3 5 directions) is both a 5 U M F and 3 U M F n -motif.
Example 7.
Let the trivial dicodon be AAAAAA. The trinucleotides in reading frame are t 1 = t 2 = A A A , leading to the trinucleotide set T = { A A A } . The trinucleotides in the shifted frames 1 and 2 are t 1 1 = t 1 2 = A A A , leading to the trinucleotide sets T 1 = T 2 = { A A A } . As T T 1 and T T 2 , AAAAAA is a multiple-frame dicodon M F . Furthermore, as t 1 = t 1 1 = t 1 2 = A A A yields to the two inequalities 1 1 and as t 2 = t 1 1 = t 1 2 = A A A yields to the two inequalities 2 > 1 , AAAAAA is a bidirectional multiple-frame dicodon B M F (Figure 6).
Example 8.
Let the dicodon be ACACAC. The trinucleotides in reading frame are t 1 = A C A and t 2 = C A C , leading to T = { A C A , C A C } . The single trinucleotide in the shifted frame 1 is t 1 1 = C A C , leading to T 1 = { C A C } . The single trinucleotide in the shifted frame 2 is t 1 2 = A C A , leading to T 2 = { A C A } . As T T 1 and T T 2 , ACACAC is a multiple-frame dicodon M F . Furthermore, as t 1 = t 1 2 = A C A yields to the inequality 1 1 and as t 2 = t 1 1 = C A C yields to the inequality 2 > 1 , ACACAC is a bidirectional multiple-frame dicodon B M F (Figure 7).
In this paper, by varying n * , we will investigate two distributions: the single-frame n -motifs S F with an unambiguous trinucleotide decoding in the two 5 3 and 3 5 directions (see Definition 9), and the 5′ unambiguous n -motifs 5 U with an unambiguous trinucleotide decoding in the 5 3 direction only which are defined formally as follows:
Definition 14.
A 5′ unambiguous n -motif 5 U (unambiguous trinucleotide decoding in the 5 3 direction only) is either a S F n -motif or a 3 U M F n -motif, i.e., neither a 5 U M F n -motif nor a B M F n -motif.
Example 9.
The dicodons AAACAA ( S F motif) andAACACA ( 3 U M F motif) belong to the class 5 U .

2.4. Occurrence Probabilities of Single-Frame n -Motifs S F and 5′ Unambiguous n -Motifs 5 U

Definition 15.
Let N b S F M ( n ) and N b M F M ( n ) be the numbers of n -motifs ( n * ) single-frame S F and multiple-frame M F , respectively. Let N b 5 U M F M ( n ) , N b 3 U M F M ( n ) and N b B M F M ( n ) be the numbers of multiple-frame n -motifs ( n * ) which are unidirectional 5 U M F , unidirectional 3 U M F and bidirectional B M F , respectively.
For n * , we have the obvious relations:
N b S F M ( n ) + N b M F M ( n ) = 64 n , N b M F M ( n ) = N b 5 U M F M ( n ) + N b 3 U M F M ( n ) + N b B M F M ( n ) .
For n * , the occurrence probability P b S F M ( n ) of single-frame n -motifs S F will be computed according to
P b S F M ( n ) = 1 N b M F M ( n ) 64 n .
Similarly, for n * , the occurrence probability P b 5 U M ( n ) of 5′ unambiguous n -motifs 5 U will be computed as follows
P b 5 U M ( n ) = P b S F M ( n ) + N b 3 U M F M ( n ) 64 n .
Remark 1.
Obviously, P b 5 U M ( n ) > P b S F M ( n ) whatever n . However, it will be interesting to compare these two probability distributions by varying n .

2.5. Single-Frame 1 -Motifs

It is a trivial case. Each of the 64 codons (1-motifs, n = 1 ) are obviously single-frame motifs S F , by definition (non-existence of a shifted frame). Thus, the probabilities of S F and 5 U 1 -motifs are equal to P b S F M ( 1 ) = P b 5 U M ( 1 ) = 1 .

2.6. Single-Frame 2 -Motifs

There are 64 2 = 4096 dicodons (2-motifs, n = 2 ). The complete study of dicodons which are single-frame S F and multiple-frame M F can be done by hand without difficulty. For the convenience of the reader, we give the complete list of M F dicodons: B M F (Definition 13, Table 1), 3 U M F (Definition 11, Table 2) and 5 U M F (Definition 12, Table 3).
The probability of S F 2 -motifs is equal to P b S F M ( 2 ) = 1 ( 16 + 2 × 96 ) / 64 2 = 0.9492 . The probability of 5 U 2 -motifs is equal to P b 5 U M ( 2 ) = P b S F M ( 2 ) + 96 / 64 2 = 0.9727 .
Remark 2.
For n 3 , the 3 U M F and 5 U M F n -motifs can have two different shifted trinucleotides in the two frames 1 and 2, in contrast to the 2 -motifs (see Table 2 and Table 3). For example, with the tricodon AACAAAACC, the trinucleotides in reading frame are t 1 = A A C , t 2 = A A A and t 3 = A C C leading to T = { A A A , A A C , A C C } . The trinucleotides in the shifted frame 1 are t 1 1 = A C A and t 2 1 = A A A , leading to T 1 = { A A A , A C A } . The trinucleotides in the shifted frame 2 are t 1 2 = C A A and t 2 2 = A A C , leading to T 2 = { A A C , C A A } . As T T 1 and T T 2 , AACAAAACC is a multiple-frame tricodon M F . Furthermore, as t 1 = t 2 2 = A A C yields to the inequality 1 2 , as t 2 = t 2 1 = A A A yields to the inequality 2 2 and as t 3 = A C C T 1 T 2 , AACAAAACC is a unidirectional multiple-frame tricodon 5 U M F with two different trinucleotides in the two frames 1 and 2, i.e., AAA in frame 1 and AAC in frame 2.

2.7. Single-Frame n -Motifs

The determination of probability P b S F M ( n ) of single-frame n -motifs S F for n 3 (tricodons, tetracodons, etc.) cannot be done by hand. For n { 3 , , 6 } (tricodons up to hexacodons), exact values of probability P b S F M ( n ) can be obtained by computer calculus (see Table 4). For n = 6 , the computation of S F motifs among the 64 6 = 68 , 719 , 476 , 736 hexacodons with a parallel program with 8 threads takes about 7 days on a standard PC. For n 7 (heptacodons, octocodons, etc.), the probability P b S F M ( n ) is obtained by computer simulation. Simulated values of P b S F M ( n ) are obtained by generating 1,000,000 random n -motifs for each n . In order to evaluate this approach by computer simulation, simulated values of P b S F M ( n ) for n { 2 , , 6 } are also given in Table 4. Exact and simulated values of P b S F M ( n ) are identical at 10 3 , demonstrating the reliability of the simulation approach.
The probability P b 5 U M ( n ) of 5′ unambiguous n -motifs 5 U for n 3 is computed similarly.

3. Results

3.1. Single-Frame Motifs

I first investigated the probability P b S F M ( n ) (Equation (3)) of single-frame n -motifs S F (Definition 9). The probability P b S F M ( 1 ) is equal to 1 (1-motifs, Section 2.5). The probability P b S F M ( 2 ) is equal to 94.9% (2-motifs, Section 2.6). The probability P b S F M ( n ) for n { 3 , , 6 } is given in Table 4. The probability P b S F M ( n ) for n 7 is obtained by computer simulation (Section 2.7).
While the proportion of multiple-frame 2 -motifs M F (Definition 10) is minimal ( 5.1 % = 100 % 94.9 % for dicodons, Section 2.6), Figure 8 shows that their propagation will drastically reduce the proportion of S F n -motifs when the trinucleotide length n increases. There are almost no more S F motifs with a length of 14 trinucleotides ( P b S F M ( 14 ) < 1 % ) and the number of M F motifs becomes already higher than the number of S F motifs with a length of six trinucleotides (Figure 8).
Thus, only short genes, i.e., with up to five trinucleotides, have a higher proportion of single-frame motifs compared to the multiple-frame motifs. Thus, primitive translation, without the extant complex ribosome, could only generate short peptides without frameshift errors.

3.2. 5′ Unambiguous Motifs

I then compared the probability P b S F M ( n ) (Equation (3)) of single-frame n -motifs S F (Definition 9) and the probability P b 5 U M ( n ) (Equation (4)) of 5′ unambiguous n -motifs 5 U (Definition 14). Figure 9 shows the decreasing probability P b 5 U M ( n ) of 5 U n -motifs when the trinucleotide length n increases. As expected (see Remark 1), its decrease is slower than that of S F n -motifs. There are almost no more 5 U motifs with a length of 20 trinucleotides ( P b 5 U M ( 20 ) < 1 % ). Thus with the 5 U motifs, there is a length increase of 20 14 = 6 trinucleotides in the trinucleotide decoding. The maximum probability difference P b 5 U M ( n ) P b S F M ( n ) is 22.0% at length n = 8 trinucleotides.
The 5′ unambiguous n -motifs, a less restrictive class of motifs with an unambiguous trinucleotide decoding in the 5 3 direction only, can generate a slightly longer peptides without frameshift error compared to the single-frame motifs.
I now evaluate the single-frame motifs S F and the 5′ unambiguous motifs 5 U with constraints.

3.3. Single-Frame and 5′ Unambiguous Motifs with Initiation and Stop Codons

The single-frame n -motifs S F and the 5′ unambiguous motifs 5 U are investigated with an initiation codon ATG and a stop codon { T A A , T A G , T G A } . The case n = 1 does not exist. For n = 2 , there are only three dicodons: ATGTAA, ATGTAG and ATGTGA which are all obviously S F . Thus, the probabilities of S F and U S F 2 -motifs are obviously P b S F M ( 2 ) = P b 5 U M ( 2 ) = 1 . Figure 10 shows that the proportions of S F and 5 U motifs with initiation and stop codons are lower than their respective non-constrained motifs.
Genes with initiation and stop codons do not increase translation fidelity compared to non-constrained genes (according to this approach).

3.4. Single-Frame and 5′ Unambiguous Motifs without Periodic Codons

The single-frame motifs S F and the 5′ unambiguous motifs 5 U are now studied without periodic codons { A A A , C C C , G G G , T T T } . As expected, Figure 11 shows that the proportions of S F and 5 U motifs without periodic codons are higher than their respective non-constrained motifs.
Genes without periodic codons slightly increase frame translation fidelity compared to non-constrained genes (according to this approach).

3.5. Single-Frame and 5′ Unambiguous Motifs with Antiparallel Complementarity

The single-frame 2 n -motifs S F and the 5′ unambiguous 2 n -motifs 5 U are now investigated with the following antiparallel complementary sequence: t 1 t 2 t n C ( t n ) C ( t 2 ) C ( t 1 ) where the trinucleotide antiparallel complementarity map C applied to a trinucleotide t is recalled in Definition 1. As an example, if t 1 t 2 t 3 = A C G T G C A A T then the antiparallel complementary sequence studied is A C G T G C A A T A T T G C A C G T . Note that the trinucleotide length of such motifs is even. Classical antiparallel complementary structures are the DNA double helix and the RNA stem. Interesting results are observed. As expected, the two probability curves P b S F M ( n ) of S F motifs and P b 5 U M ( n ) of 5 U motifs with antiparallel complementarity are identical (Figure 12). The proof is based on the following property: if t i = t j f with i > j ( 3 U M F motif) then C ( t i ) = t i = C ( t j f ) = t j f with i j ( 5 U M F motif) and f f . Furthermore, antiparallel complementarity increases the proportion of S F motifs but decreases the proportion of 5 U motifs, compared to their respective non-constrained motifs.
The “antiparallel complementary” genes have a higher proportion of single-frame motifs compared to the non-complementary genes. Thus, primitive translation associated with a DNA property could generate a greater number of peptides without frameshift errors.

3.6. Single-Frame Motifs and 5′ Unambiguous with Parallel Complementarity

The single-frame 2 n -motifs S F and the 5′ unambiguous 2 n -motifs 5 U are now analysed with the following parallel complementary sequence: t 1 t 2 t n D ( t 1 ) D ( t 2 ) D ( t n ) where the trinucleotide parallel complementarity map D applied to a trinucleotide t is recalled in Definition 1. As an example, if t 1 t 2 t 3 = A C G T G C A A T then the parallel complementary sequence studied is A C G T G C A A T T G C A C G T T A . Note that the trinucleotide length of such motifs is also even. Interesting results are also observed. The two probability curves P b S F M ( n ) of S F motifs with parallel complementarity and P b 5 U M ( n ) of 5 U motifs without constraints are superposable (Figure 13). Parallel complementarity increases the proportions of both S F motifs and 5 U motifs compared to their respective non-constrained motifs.
“Parallel complementary” genes have a slightly higher proportion of single-frame motifs compared to the “antiparallel complementary” genes (compare the magenta curves in Figure 12 and Figure 13). The biological meaning is not yet explained.

3.7. Framing Motifs

There are framing motifs F which are single-frame S F or multiple-frame M F .
Proposition 1.
A framing motif F can be single-frame S F .
Proof. Take the following motif m = G A A C T C C C G A T A T G G C T C . The motif m can be generated by the code X = { A T A , C C G , C T C , G A A , T G G } . By Theorem 1, it is easy to verify that the graph G ( X ) is acyclic, and thus X is circular. Furthermore, the set of trinucleotides in reading frame is T = X , the set of trinucleotides in the shifted frame 1 is T 1 = { A A C , C G A , G G C , T A T , T C C } and the set of trinucleotides in the shifted frame 2 is T 2 = { A C T , A T G , C C C , G A T , G C T } . We have T T 1 = and T T 2 = . Thus, the motif m is both framing F and single-frame S F .
Proposition 2.
A framing motif F can be multiple-frame M F .
Proof. Take the following motif m = A T T G A G C G A G C C T G T C A G . The motif m can be generated by the code X = { A T T , C A G , C G A , G A G , G C C , T G T } . By Theorem 1, it is easy to verify that the graph G ( X ) is acyclic, and thus X is circular. Furthermore, we have the trinucleotide sets T = X , T 1 = { A G C , C C T , G A G , G T C , T T G } and T 2 = { A G C , C T G , G C G , T C A , T G A } leading to T T 1 = { G A G } and T T 2 = . Thus, the motif m is both framing F and multiple-frame M F , precisely unidirectional multiple-frame 5 U M F .
There are single-frame motifs S F or multiple-frame motifs M F which are not framing F .
Proposition 3.
A single-frame motif S F can be non-framing F .
Proof. Take the following motif m = G A C A A A T A A G T G G T A T G A . The motif m can be generated by the code X = { A A A , G A C , G T A , G T G , T A A , T G A } . We have the trinucleotide sets T = X , T 1 = { A A G , A A T , A C A , T A T , T G G } and T 2 = { A G T , A T A , A T G , C A A , G G T } leading to T T 1 = and T T 2 = . However, as X contains the periodic trinucleotide AAA, X is not circular. Thus, the motif m is single-frame S F but not framing F .
Proposition 4.
A multiple-frame motif M F can be non-framing F .
Proof. Take the following motif m = G G A C C A T A C A T C C G G A C T . The motif m can be generated by the code X = { A C T , A T C , C C A , C G G , G G A , T A C } . We have the trinucleotide sets T = X , T 1 = { A C A , C A T , G A C , G G A , T C C } and T 2 = { A C C , A T A , C A T , C C G , G A C } leading to T T 1 = { G G A } and T T 2 = . However, as X contains the two permuted trinucleotides ACT and TAC, X is not circular. Thus, the motif m is multiple-frame M F , precisely unidirectional multiple-frame 5 U M F , but not framing F .
Genes which are both framing F and single-frame S F retrieve the reading frame and code for a unique peptide as the shifted frames would lead to a different peptide product.

3.8. A New Class of Theoretical Parameters Relating the Circular Codes and Their Circular Code Motifs

The idea is to define a new class of parameters in order to measure the intensity I ( m ) of a motif m of a circular code to retrieve the reading frame. Thus, we have to associate information from the circular code theory with information from words (motifs).
In the circular code theory, the most important and the simplest parameter is the length l m a x ( X ) of a longest path (maximal arrow-length of a path) in the associated graph G ( X ) of a circular code X (see Definition 5). Note that the longest path l m a x ( X ) has a finite length as the graph G ( X ) is acyclic (Theorem 1). The longest path l m a x ( X ) can classify the circular codes, from the strong comma-free codes with l m a x ( X ) = 1 and the comma-free codes with l m a x ( X ) = 2 up to the general circular codes with a maximal longest path l m a x ( X ) = 8 when X B 3 (i.e., for the trinucleotide circular codes) [29]. It is also related to the reading frame number n X of X , i.e., the number of nucleotides to retrieve the reading frame. This reading frame number n X can also be used to classify the circular codes, from the strong comma-free codes with n X = 2 nucleotides and the comma-free codes with n X = 3 nucleotides up to the general circular codes with a maximal number n X = 13 nucleotides when X B 3 [30]. However, this parameter n X needs to know the structure of the longest path l m a x ( X ) which is one of the four cases: b 1 d 1 b k , b 1 d 1 d k , d 1 b 1 b k and d 1 b 1 d k where the nucleotide b i B and the dinucleotide d i B 2 for any i (see Definition 5). In summary, for the circular codes X B 3 , the longest path l m a x ( X ) belongs to the interval 1 l m a x ( X ) 8 and the reading frame number n X belongs to the interval 2 n X 13 nucleotides. The definition of the reading frame number n X can still be generalized to arbitrary sequences, i.e., not entirely consisting of trinucleotides from X [30]. For these two reasons, i.e., the knowledge of the structure of l m a x ( X ) and the generalized definition of n X , the parameter n X , mentioned here to take date, will not be considered here.
A motif m of a code, circular or not, can be characterized by its length l ( m ) , given here in trinucleotides for convenience, for measuring its expansion; and its cardinality card ( T ( m ) ) of the set T ( m ) (see Notation 2) of trinucleotides (in reading frame f = 0 ) of m for measuring its variety (complexity). In the case of a motif m of a trinucleotide circular code X B 3 , 1 card ( T ( m ) ) 20 .
It is important to stress the following condition: T ( m ) X with a trinucleotide circular code X B 3 . The case T ( m ) = X is associated with a trinucleotide circular code X constructed from the motif m .
A simple parameter measuring the expansion intensity I e ( m ) of reading frame retrieval of a circular code motif m can be defined as follows:
I e ( m ) = l ( m ) l m a x ( X )
where l ( m ) , l ( m ) 1 , is the trinucleotide length of the motif m and l m a x ( X ) , 1 l m a x ( X ) 8 , is the length of a longest path in the associated graph G ( X ) of a trinucleotide circular code X B 3 . Note that 1 8 I e ( m ) l ( m ) and if l ( m ) l m a x ( X ) then 1 I e ( m ) l ( m ) .
A second parameter measuring both the expansion and variety intensity I e v ( m ) of a circular code motif m can also be defined as follows:
I e v ( m ) = card ( T ( m ) ) × I e ( m )
where I e ( m ) is defined in Equation (5) and card ( T ( m ) ) , 1 card ( T ( m ) ) 20 , is the cardinality of the set T ( m ) (Notation 2) of trinucleotides (in reading frame f = 0 ) of m . Note that 1 8 I e v ( m ) 20 l ( m ) and if l ( m ) l m a x ( X ) then 1 I e v ( m ) 20 l ( m ) . Thus, for the circular code motifs m of a given trinucleotide length l ( m ) , the intensity I e v ( m ) of reading frame retrieval increases according to their cardinality card ( T ( m ) ) .
For a sequence s containing several circular code motifs m , the formulas (5) and (6) can be expressed as follows:
I e ( s ) = m s I e ( m ) = m s l ( m ) l m a x ( X )
with the hypothesis that l m a x ( X ) is identical for the motifs m , a realistic case when the motifs m are obtained from a same studied trinucleotide circular code X , and thus:
I e v ( s ) = m s I e v ( m ) = m s card ( T ( m ) ) × l ( m ) l m a x ( X ) .
Note also that the formulas I e ( s ) and I e v ( s ) can also be normalized in order to weight the different lengths of sequences s .

3.9. M F Dipeptides

The series of multi-frame motifs M F starts with the dicodons. We will now focus on the M F dipeptides which are two consecutive amino acids coded by the M F dicodons. The 16 bidirectional multiple-frame dicodons B M F (Table 1) code 16 B M F dipeptides according to the universal genetic code (Table 5). They include the four obvious B M F dipeptides GlyGly (GGGGGG), LysLys (AAAAAA), PhePhe (TTTTTT) and ProPro (CCCCCC). 15 amino acids out of 20 are involved in these 16 B M F dipeptides (Table 6): Ala, Arg, Cys, Glu, Gly, His, Ile, Leu, Lys, Phe, Pro, Ser, Thr, Tyr and Val (except Asn, Asp, Gln, Met and Trp), each amino acid occurring once in a position of a B M F dipeptide, except Arg occurring twice in a position of a B M F dipeptide: ArgAla, ArgGlu, AlaArg and GluArg.
The 96 unidirectional multiple-frame dicodons 3 U M F (Table 2) code 83 3 U M F dipeptides and four pairs (stop codon, amino acid): TAGArg, TAGGly, TGAGlu and TerLys where Ter can be the two stop codons TAA and TGA (Table 7). All the 20 amino acids are involved in the 83 3 U M F dipeptides (Table 8). All the 20 amino acids occur in the first position of 3 U M F dipeptides. Five amino acids Asn, Asp, Gln, Met and Trp do not occur in their second position which are the five amino acids not involved in the B M F dipeptides. In the 83 3 U M F dipeptides, Pro and Gly are involved 20 and 19 times, respectively, while Met and Trp only twice and once, respectively.
The 96 unidirectional multiple-frame dicodons 5 U M F (Table 3) code 40 5 U M F dipeptides and three pairs (amino acid, stop codon): IleTer where Ter can be the two stop codons TAA and TAG, PheTer where Ter can be the three stop codons TAA, TAG and TGA, and ValTGA (Table 9). All the 20 amino acids are involved in the 40 5 U M F dipeptides (Table 10). Five amino acids are Asn, Asp, Gln, Met and Trp do not occur in the first position of 5 U M F dipeptides which are the five amino acids not involved in the B M F dipeptides. All the 20 amino acids occur in their second position. In the 40 5 U M F dipeptides, two amino acids Lys and Phe are involved eight times while Asn only once.
The 114 = 121 4 3 M F dipeptides among 400, i.e., 28.5%, are coded by 208 = 16 + 2 × 96 M F dicodons ( B M F , 3 U M F , 5 U M F ) among 4096, i.e., 5.1% (Table 11). As a consequence, 286 S F dipeptides, i.e., 71.5%, are coded by 3888 single-frame dicodons S F , i.e., 94.9%. There is also a strong asymmetry between the number of M F dipeptides coded by one direction or other direction: 83 3 U M F dipeptides (Table 7) versus 40 5 U M F dipeptides (Table 9). This asymmetry may be related to the gene translation in the 5 3 direction, the 3 U M F dicodons having an unambiguous trinucleotide decoding in the 5 3 direction.
Five dipeptides GlyAla, GlyVal, PheSer, ProLeu and ProArg are the most strongly coded, each by five M F dicodons (Table 12), e.g., GlyAla is coded by one 3 U M F dicodon GGCGCG (Table 7), and four 5 U M F dicodons GGGGCA, GGGGCC, GGGGCG and GGGGCT (Table 9). The S F and M F dipeptides could have particular spatial structures and biological functions in extant and primitive proteins which remain to be identified.

4. Discussion

For the first time to our knowledge, new definitions of motifs in genes are presented. The single-frame motifs S F (unambiguous trinucleotide decoding in the two 5 3 and 3 5 directions) and the multiple-frame motifs M F (ambiguous trinucleotide decoding in at least one direction) form a partition of genes. Several classes of M F motifs are defined and analysed: (i) unidirectional multiple-frame motifs 3 U M F (ambiguous trinucleotide decoding in the 3 5 direction only); (ii) unidirectional multiple-frame motifs 5 U M F (ambiguous trinucleotide decoding in the 5 3 direction only); and (iii) bidirectional multiple-frame motifs B M F (ambiguous trinucleotide decoding in the two 5 3 and 3 5 directions). The distribution of the single-frame motifs S F and the 5′ unambiguous motifs 5 U (unambiguous trinucleotide decoding in the 5 3 direction only) are studied without and with constraints.
The proportion of S F motifs drastically decreases with their trinucleotide length. The S F motifs become absent ( < 1 % ) when their length 14 trinucleotides and the number of M F motifs becomes already higher than the number of S F motifs when their length 6 trinucleotides. As expected, the proportion of 5 U motifs decreases more slowly than that of S F motifs. The 5 U motifs become absent ( < 1 % ) when their length 20 trinucleotides. Thus with the 5 U motifs, there is a length increase of 20 14 = 6 trinucleotides in the trinucleotide decoding.
The proportions of S F and 5 U motifs with initiation and stop codons are lower than their respective non-constrained motifs. In contrasts, their proportions in motifs without periodic codons { A A A , C C C , G G G , T T T } are higher than their respective non-constrained motifs. The proportions of S F and 5 U motifs with antiparallel complementarity are identical. Antiparallel complementarity increases the proportion of S F motifs but decreases the proportion of 5 U motifs, compared to their respective non-constrained motifs. The proportions of S F motifs with parallel complementarity and 5 U motifs without constraints follow a similar distribution. Finally, parallel complementarity increases the proportions of both S F motifs and 5 U motifs compared to their respective non-constrained motifs. Taken together, these results suggest that the complementarity property involved in the antiparallel (DNA double helix, RNA stem) and parallel sequences could also be fundamental for coding genes with unambiguous trinucleotide decoding, strictly in the two 5 3 and 3 5 directions ( S F motifs) or conserved in the 5 3 direction but relaxed-lost in the 3 5 direction ( 5 U motifs).
The single-frame motifs S F with a property of trinucleotide decoding and the framing motifs F with a property of reading frame decoding could have operated in the primitive soup for constructing the modern genetic code and the extant genes [31]. They could have been involved in the stage without anticodon-amino acid interactions to form peptides from prebiotically amino acids [32]. They could also have been related in the Implicated Site Nucleotides (ISN) of RNA interacting with the amino acids at the primitive step of life (review in [33]). According to a great number of biological experiments, the ISN structure contains nucleotides in fixed and variable positions, as well as an important trinucleotide for interacting with the amino acid (see e.g., the recent review in [34]). However, the general structure of the aptamers binding amino acids, in particular its nucleotide length, its amino acid binding loop and its nucleotide position, is still an open problem. Similar arguments could hold for the ribonucleopeptides which could be implicated in a primitive T box riboswitch functioning as an aminoacyl-tRNA synthetase and a peptidyl-transferase ribozyme [35]. The single-frame motifs S F and the framing motifs F with their properties to decode the trinucleotides and the reading frame could have been necessary for the evolutionary construction of the modern genetic code.

Funding

The author received no funding for this study.

Acknowledgment

I thank Denise Marie Besch for her support.

Conflicts of Interest

The author declares no competing interests.

Abbreviations

S F single-frame motif (unambiguous trinucleotide decoding in the two 5 3 and 3 5 directions)
M F multiple-frame motif
U M F unidirectional multiple-frame motif
3 U M F unidirectional multiple-frame motif (ambiguous trinucleotide decoding in the 3 5 direction only)
5 U M F unidirectional multiple-frame motif (ambiguous trinucleotide decoding in the 5 3 direction only)
B M F bidirectional multiple-frame motif (ambiguous trinucleotide decoding in the two 5 3 and 3 5 directions)
5 U 5′ unambiguous motif (unambiguous trinucleotide decoding in the 5 3 direction only)
F framing motif (also called circular code motif)

References

  1. Gamow, G. Possible relation between deoxyribonucleic acid and protein structures. Nature 1954, 173, 318. [Google Scholar] [CrossRef]
  2. Crick, F.H.C.; Griffith, J.S.; Orgel, L.E. Codes without commas. Proc. Natl. Acad. Sci. USA 1957, 43, 416–421. [Google Scholar] [CrossRef]
  3. Nirenberg, M.W.; Matthaei, J.H. The dependence of cell-free protein synthesis in E. coli upon naturally occurring or synthetic polyribonucleotides. Proc. Natl. Acad. Sci. USA 1961, 47, 1588–1602. [Google Scholar] [CrossRef]
  4. Crick, F.H.C.; Leslie Barnett Brenner, S.; Watts-Tobin, R.J. General nature of the genetic code for proteins. Nature 1961, 192, 1227–1232. [Google Scholar] [CrossRef]
  5. Khorana, H.G.; Büchi, H.; Ghosh, H.; Gupta, N.; Jacob, T.M.; Kössel, H.; Morgan, R.; Narang, S.A.; Ohtsuka, E.; Wells, R.D. Polynucleotide synthesis and the genetic code. Cold Spring Harb. Symp. Quant. Biol. 1966, 31, 39–49. [Google Scholar] [CrossRef]
  6. Nirenberg, M.; Caskey, T.; Marshall, R.; Brimacombe, R.; Kellogg, D.; Doctor, B.; Hatfield, D.; Levin, J.; Rottman, F.; Pestka, S.; et al. The RNA code and protein synthesis. Cold Spring Harb. Symp. Quant. Biol. 1966, 31, 11–24. [Google Scholar] [CrossRef]
  7. Salas, M.; Smith, M.A.; Stanley, W.M.; Wahba, A.J.; Ochoa, S. Direction of reading of the genetic message. J. Biol. Chem. 1965, 240, 3988–3995. [Google Scholar]
  8. Arquès, D.G.; Michel, C.J. A complementary circular code in the protein coding genes. J. Theor. Biol. 1996, 182, 45–58. [Google Scholar] [CrossRef]
  9. Michel, C.J. The maximal C3 self-complementary trinucleotide circular code X in genes of bacteria, eukaryotes, plasmids and viruses. J. Theor. Biol. 2015, 380, 156–177. [Google Scholar] [CrossRef]
  10. Michel, C.J. The maximal C3 self-complementary trinucleotide circular code X in genes of bacteria, archaea, eukaryotes, plasmids and viruses. Life 2017, 7, 20. [Google Scholar] [CrossRef]
  11. Michel, C.J. A genetic scale of reading frame coding. J. Theor. Biol. 2014, 355, 83–94. [Google Scholar] [CrossRef]
  12. Michel, C.J. An extended genetic scale of reading frame coding. J. Theor. Biol. 2015, 365, 164–174. [Google Scholar] [CrossRef]
  13. Dinman, J.D. Programmed ribosomal frameshifting goes beyond viruses. Microbe 2006, 1, 521–527. [Google Scholar] [CrossRef]
  14. Farabaugh, P.J. Programmed translational frameshifting. Annu. Rev. Genet. 1996, 30, 507–528. [Google Scholar] [CrossRef]
  15. Caliskan, N.; Peske, F.; Rodnina, M.V. Changed in translation: MRNA recoding by -1 programmed ribosomal frameshifting. Trends Biochem. Sci. 2015, 40, 265–274. [Google Scholar] [CrossRef]
  16. Napthine, S.; Ling, R.; Finch, L.K.; Jones, J.D.; Bell, S.; Brierley, I.; Firth, A.E. Protein-directed ribosomal frameshifting temporally regulates gene expression. Nat. Commun. 2017, 8, 15582. [Google Scholar] [CrossRef]
  17. Wang, R.; Xiong, J.; Wang, W.; Miao, W.; Liang, A. High frequency of +1 programmed ribosomal frameshifting in Euplotes octocarinatus. Sci. Rep. 2016, 6, 21139. [Google Scholar] [CrossRef]
  18. El Houmami, N.; Seligmann, H. Evolution of nucleotide punctuation marks: From structural to linear signals. Front. Genet. 2017, 8, 36. [Google Scholar] [CrossRef]
  19. Seligmann, H. Codon expansion and systematic transcriptional deletions produce tetra-, pentacoded mitochondrial peptides. J. Theor. Biol. 2015, 387, 154–165. [Google Scholar] [CrossRef]
  20. Baranov, P.V.; Atkins, J.F.; Yordanova, M.M. Augmented genetic decoding: Global, local and temporal alterations of decoding processes and codon meaning. Nat. Rev. Genet. 2015, 16, 517–529. [Google Scholar] [CrossRef]
  21. Michel, C.J. Circular code motifs in transfer and 16S ribosomal RNAs: A possible translation code in genes. Comput. Biol. Chem. 2012, 37, 24–37. [Google Scholar] [CrossRef]
  22. Michel, C.J. Circular code motifs in transfer RNAs. Comput. Biol. Chem. 2013, 45, 17–29. [Google Scholar] [CrossRef]
  23. Michel, C.J. A 2006 review of circular codes in genes. Comput. Math. Appl. 2008, 55, 984–988. [Google Scholar] [CrossRef]
  24. Fimmel, E.; Strüngmann, L. Mathematical fundamentals for the noise immunity of the genetic code. Biosystems 2018, 164, 186–198. [Google Scholar] [CrossRef]
  25. Luisi, P.L. Prebiotic metabolic networks? Mol. Syst. Biol. 2014, 10, 729. [Google Scholar] [CrossRef]
  26. Ying, J.; Lin, R.; Xu, P.; Wu, Y.; Liu, Y.; Zhao, Y. Prebiotic formation of cyclic dipeptides under potentially early Earth conditions. Sci. Rep. 2018, 8, 936. [Google Scholar] [CrossRef]
  27. Shu, W.; Yu, Y.; Chen, S.; Yan, X.; Liu, Y.; Zhao, Y. Selective formation of Ser-His dipeptide via phosphorus activation. Orig. Life Evol. Biospheres 2018, 48, 213–222. [Google Scholar] [CrossRef]
  28. Wieczorek, R.; Adamala, K.; Gasperi, T.; Polticelli, F.; Stano, P. Small and random peptides: An unexplored reservoir of potentially functional primitive organocatalysts. The case of Seryl-Histidine. Life 2017, 7, 19. [Google Scholar] [CrossRef]
  29. Fimmel, E.; Michel, C.J.; Strüngmann, L. n-Nucleotide circular codes in graph theory. Philos. Trans. R. Soc. A Math. Phys. Eng. Sci. 2016, 374, 20150058. [Google Scholar] [CrossRef]
  30. Fimmel, E.; Michel, C.J.; Starman, M.; Strüngmann, L. Self-complementary circular codes in coding theory. Theory Biosci. 2018, 137, 51–65. [Google Scholar] [CrossRef]
  31. Kun, Á.; Radványi, Á. The evolution of the genetic code: Impasses and challenges. Biosystems 2018, 164, 217–225. [Google Scholar] [CrossRef] [PubMed]
  32. Johnson, D.B.F.; Wang, L. Imprints of the genetic code in the ribosome. Proc. Natl. Acad. Sci. USA 2010, 107, 8298–8303. [Google Scholar] [CrossRef] [PubMed]
  33. Yarus, M. The genetic code and RNA-amino acid affinities. Life 2017, 7, 13. [Google Scholar] [CrossRef] [PubMed]
  34. Zagrovic, B.; Bartonek, L.; Polyansky, A.A. RNA-protein interactions in an unstructured context. FEBS Lett. 2018, 592, 2901–2916. [Google Scholar] [CrossRef] [PubMed]
  35. Saad, N.Y. A ribonucleopeptide world at the origin of life. J. Syst. Evol. 2018, 56, 1–13. [Google Scholar] [CrossRef]
Figure 1. Retrieval of the reading frame of the word w =   ...AGGTAATTACCAG... constructed with the circular code X (1). Among the three possible factorizations w 0 , w 1 and w 2 , only one factorization w 1 into trinucleotides of X is possible leading to ...A,GGT,AAT,TAC,CAG,… (the comma showing the reading frame). Thus, the first letter A of w is the third letter of a trinucleotide of X and the reading frame of the word is retrieved.
Figure 1. Retrieval of the reading frame of the word w =   ...AGGTAATTACCAG... constructed with the circular code X (1). Among the three possible factorizations w 0 , w 1 and w 2 , only one factorization w 1 into trinucleotides of X is possible leading to ...A,GGT,AAT,TAC,CAG,… (the comma showing the reading frame). Thus, the first letter A of w is the third letter of a trinucleotide of X and the reading frame of the word is retrieved.
Life 09 00018 g001
Figure 2. (associated with Example 3). The dicodon AAACAA is single-frame S F .
Figure 2. (associated with Example 3). The dicodon AAACAA is single-frame S F .
Life 09 00018 g002
Figure 3. (associated with Example 4). The dicodon AACACA is unidirectional multiple-frame 3 U M F .
Figure 3. (associated with Example 4). The dicodon AACACA is unidirectional multiple-frame 3 U M F .
Life 09 00018 g003
Figure 4. (associated with Example 5). The dicodon AAAAAC is unidirectional multiple-frame 5 U M F .
Figure 4. (associated with Example 5). The dicodon AAAAAC is unidirectional multiple-frame 5 U M F .
Life 09 00018 g004
Figure 5. (associated with Example 6). The dicodon ACACAA is unidirectional multiple-frame 5 U M F .
Figure 5. (associated with Example 6). The dicodon ACACAA is unidirectional multiple-frame 5 U M F .
Life 09 00018 g005
Figure 6. (associated with Example 7). The dicodon AAAAAA is bidirectional multiple-frame B M F .
Figure 6. (associated with Example 7). The dicodon AAAAAA is bidirectional multiple-frame B M F .
Life 09 00018 g006
Figure 7. (associated with Example 8). The dicodon ACACAC is bidirectional multiple-frame B M F .
Figure 7. (associated with Example 8). The dicodon ACACAC is bidirectional multiple-frame B M F .
Life 09 00018 g007
Figure 8. Decreasing probability P b S F M ( n ) (Equation (3)) of single-frame n -motifs S F (blue curve) and increasing probability 1 P b S F M ( n ) of multiple-frame n -motifs M F (red curve) by varying the length n between 1 and 20 trinucleotides.
Figure 8. Decreasing probability P b S F M ( n ) (Equation (3)) of single-frame n -motifs S F (blue curve) and increasing probability 1 P b S F M ( n ) of multiple-frame n -motifs M F (red curve) by varying the length n between 1 and 20 trinucleotides.
Life 09 00018 g008
Figure 9. Decreasing probability P b S F M ( n ) (Equation (3)) of single-frame n -motifs S F (blue curve from Figure 8) and decreasing probability P b 5 U M ( n ) (Equation (4)) of 5′ unambiguous n -motifs 5 U (cyan curve) by varying the length n between 1 and 20 trinucleotides.
Figure 9. Decreasing probability P b S F M ( n ) (Equation (3)) of single-frame n -motifs S F (blue curve from Figure 8) and decreasing probability P b 5 U M ( n ) (Equation (4)) of 5′ unambiguous n -motifs 5 U (cyan curve) by varying the length n between 1 and 20 trinucleotides.
Life 09 00018 g009
Figure 10. Decreasing probability P b S F M ( n ) (Equation (3)) of single-frame n -motifs S F (blue curve from Figure 8) and decreasing probability P b 5 U M ( n ) (Equation (4)) of 5′ unambiguous n -motifs 5 U (cyan curve from Figure 9) by varying the length n between 1 and 10 trinucleotides. With initiation and stop codons, decreasing probability P b S F M ( n ) of n -motifs S F (magenta curve) and decreasing probability P b 5 U M ( n ) of n -motifs 5 U (orange curve) by varying the length n between 2 and 10 trinucleotides.
Figure 10. Decreasing probability P b S F M ( n ) (Equation (3)) of single-frame n -motifs S F (blue curve from Figure 8) and decreasing probability P b 5 U M ( n ) (Equation (4)) of 5′ unambiguous n -motifs 5 U (cyan curve from Figure 9) by varying the length n between 1 and 10 trinucleotides. With initiation and stop codons, decreasing probability P b S F M ( n ) of n -motifs S F (magenta curve) and decreasing probability P b 5 U M ( n ) of n -motifs 5 U (orange curve) by varying the length n between 2 and 10 trinucleotides.
Life 09 00018 g010
Figure 11. Decreasing probability P b S F M ( n ) (Equation (3)) of single-frame n -motifs S F (blue curve from Figure 8) and decreasing probability P b 5 U M ( n ) (Equation (4)) of 5′ unambiguous n -motifs 5 U (cyan curve from Figure 9) by varying the length n between 1 and 10 trinucleotides. Without periodic codons { A A A , C C C , G G G , T T T } , decreasing probability P b S F M ( n ) of n -motifs S F (magenta curve) and decreasing probability P b 5 U M ( n ) of n -motifs 5 U (orange curve) by varying the length n between 1 and 10 trinucleotides.
Figure 11. Decreasing probability P b S F M ( n ) (Equation (3)) of single-frame n -motifs S F (blue curve from Figure 8) and decreasing probability P b 5 U M ( n ) (Equation (4)) of 5′ unambiguous n -motifs 5 U (cyan curve from Figure 9) by varying the length n between 1 and 10 trinucleotides. Without periodic codons { A A A , C C C , G G G , T T T } , decreasing probability P b S F M ( n ) of n -motifs S F (magenta curve) and decreasing probability P b 5 U M ( n ) of n -motifs 5 U (orange curve) by varying the length n between 1 and 10 trinucleotides.
Life 09 00018 g011
Figure 12. Decreasing probability P b S F M ( n ) (Equation (3)) of single-frame n -motifs S F (blue curve from Figure 8) and decreasing probability P b 5 U M ( n ) (Equation (4)) of 5′ unambiguous n -motifs 5 U (cyan curve from Figure 9) by varying the length n between 1 and 14 trinucleotides. With antiparallel complementarity, decreasing probabilities P b S F M ( n ) and P b 5 U M ( n ) of 2 n -motifs S F and 5 U (two identical curves in magenta) by varying the length n between 1 and 7 trinucleotides.
Figure 12. Decreasing probability P b S F M ( n ) (Equation (3)) of single-frame n -motifs S F (blue curve from Figure 8) and decreasing probability P b 5 U M ( n ) (Equation (4)) of 5′ unambiguous n -motifs 5 U (cyan curve from Figure 9) by varying the length n between 1 and 14 trinucleotides. With antiparallel complementarity, decreasing probabilities P b S F M ( n ) and P b 5 U M ( n ) of 2 n -motifs S F and 5 U (two identical curves in magenta) by varying the length n between 1 and 7 trinucleotides.
Life 09 00018 g012
Figure 13. Decreasing probability P b S F M ( n ) (Equation (3)) of single-frame n -motifs S F (blue curve from Figure 8) and decreasing probability P b 5 U M ( n ) (Equation (4)) of 5′ unambiguous n -motifs 5 U (cyan curve from Figure 9) by varying the length n between 1 and 14 trinucleotides. With parallel complementarity, decreasing probability P b S F M ( n ) of 2 n -motifs S F (magenta curve) and decreasing probability P b 5 U M ( n ) of 2 n -motifs 5 U (orange curve) by varying the length n between 1 and 7 trinucleotides.
Figure 13. Decreasing probability P b S F M ( n ) (Equation (3)) of single-frame n -motifs S F (blue curve from Figure 8) and decreasing probability P b 5 U M ( n ) (Equation (4)) of 5′ unambiguous n -motifs 5 U (cyan curve from Figure 9) by varying the length n between 1 and 14 trinucleotides. With parallel complementarity, decreasing probability P b S F M ( n ) of 2 n -motifs S F (magenta curve) and decreasing probability P b 5 U M ( n ) of 2 n -motifs 5 U (orange curve) by varying the length n between 1 and 7 trinucleotides.
Life 09 00018 g013
Table 1. The 16 bidirectional multiple-frame dicodons B M F (Definition 13).
Table 1. The 16 bidirectional multiple-frame dicodons B M F (Definition 13).
DicodonFrame 1Frame 2DicodonFrame 1Frame 2DicodonFrame 1Frame 2DicodonFrame 1Frame 2
AAAAAAAAAAAACACACAACACACGAGAGAAGAGAGTATATAATATAT
ACACACCACACACCCCCCCCCCCCGCGCGCCGCGCGTCTCTCCTCTCT
AGAGAGGAGAGACGCGCGGCGCGCGGGGGGGGGGGGTGTGTGGTGTGT
ATATATTATATACTCTCTTCTCTCGTGTGTTGTGTGTTTTTTTTTTTT
Table 2. The 96 unidirectional multiple-frame dicodons 3 U M F (Definition 11), N being any nucleotide.
Table 2. The 96 unidirectional multiple-frame dicodons 3 U M F (Definition 11), N being any nucleotide.
DicodonFrame 1Frame 2DicodonFrame 1Frame 2DicodonFrame 1Frame 2DicodonFrame 1Frame 2
CAAAAAAAAAAACCACACCAC CGAGAGGAG CTATATTAT
GAAAAAAAAAAAGCACACCAC GGAGAGGAG GTATATTAT
TAAAAAAAAAAATCACACCAC TGAGAGGAG TTATATTAT
NCAAAA AAAACCCCCCCCCCCAGCGCGGCG ATCTCTTCT
NGAAAA AAAGCCCCCCCCCCCGGCGCGGCG GTCTCTTCT
NTAAAA AAATCCCCCCCCCCCTGCGCGGCG TTCTCTTCT
AACACAACA NACCCC CCCAGGGGGGGGGGGATGTGTTGT
GACACAACA NGCCCC CCCCGGGGGGGGGGGCTGTGTTGT
TACACAACA NTCCCC CCCTGGGGGGGGGGGTTGTGTTGT
AAGAGAAGA ACGCGCCGC NAGGGG GGGATTTTTTTTTTT
CAGAGAAGA CCGCGCCGC NCGGGG GGGCTTTTTTTTTTT
TAGAGAAGA TCGCGCCGC NTGGGG GGGGTTTTTTTTTTT
AATATAATA ACTCTCCTC AGTGTGGTG NATTTT TTT
CATATAATA CCTCTCCTC CGTGTGGTG NCTTTT TTT
GATATAATA GCTCTCCTC GGTGTGGTG NGTTTT TTT
Table 3. The 96 unidirectional multiple-frame dicodons 5 U M F (Definition 12), N being any nucleotide.
Table 3. The 96 unidirectional multiple-frame dicodons 5 U M F (Definition 12), N being any nucleotide.
DicodonFrame 1Frame 2DicodonFrame 1Frame 2DicodonFrame 1Frame 2DicodonFrame 1Frame 2
AAAAACAAAAAACACACC CACGAGAGC GAGTATATC TAT
AAAAAGAAAAAACACACG CACGAGAGG GAGTATATG TAT
AAAAATAAAAAACACACT CACGAGAGT GAGTATATT TAT
AAAACNAAA CCCCCACCCCCCGCGCGA GCGTCTCTA TCT
AAAAGNAAA CCCCCGCCCCCCGCGCGG GCGTCTCTG TCT
AAAATNAAA CCCCCTCCCCCCGCGCGT GCGTCTCTT TCT
ACACAA ACACCCCANCCC GGGGGAGGGGGGTGTGTA TGT
ACACAG ACACCCCGNCCC GGGGGCGGGGGGTGTGTC TGT
ACACAT ACACCCCTNCCC GGGGGTGGGGGGTGTGTT TGT
AGAGAA AGACGCGCA CGCGGGGANGGG TTTTTATTTTTT
AGAGAC AGACGCGCC CGCGGGGCNGGG TTTTTCTTTTTT
AGAGAT AGACGCGCT CGCGGGGTNGGG TTTTTGTTTTTT
ATATAA ATACTCTCA CTCGTGTGA GTGTTTTANTTT
ATATAC ATACTCTCC CTCGTGTGC GTGTTTTCNTTT
ATATAG ATACTCTCG CTCGTGTGG GTGTTTTGNTTT
Table 4. Probability P b S F M ( n ) (%) of single-frame n -motifs S F for n { 1 , , 6 } . Exact and simulated values of P b S F M ( n ) are identical at 10 3 .
Table 4. Probability P b S F M ( n ) (%) of single-frame n -motifs S F for n { 1 , , 6 } . Exact and simulated values of P b S F M ( n ) are identical at 10 3 .
Probability   P b S F M ( n )   ( % )
n -Motifs Number   64 n Exact ValuesSimulated Values
164100
2409694.9294.93
3262,14485.2285.20
416,777,21672.3572.37
51,073,741,82458.0758.08
668,719,476,73644.0744.08
Table 5. The 16 B M F dipeptides coded by the 16 bidirectional multiple-frame dicodons B M F (Definition 13, Table 1).
Table 5. The 16 B M F dipeptides coded by the 16 bidirectional multiple-frame dicodons B M F (Definition 13, Table 1).
ARAlaArgGCGCGCGGGlyGlyGGGGGGLSLeuSerCTCTCTSLSerLeuTCTCTC
CVCysValTGTGTGHTHisThrCACACAPPProProCCCCCCTHThrHisACACAC
ERGluArgGAGAGAIYIleTyrATATATRAArgAlaCGCGCGVCValCysGTGTGT
FFPhePheTTTTTTKKLysLysAAAAAAREArgGluAGAGAGYITyrIleTATATA
Table 6. Occurrence number of the 15 amino acids in the 1st and 2nd positions of the 16 B M F dipeptides (Table 5).
Table 6. Occurrence number of the 15 amino acids in the 1st and 2nd positions of the 16 B M F dipeptides (Table 5).
ACEFGHIKLPRSTVY
AlaCysGluPheGlyHisIleLysLeuProArgSerThrValTyrSum
1st site11111111112111116
2nd site11111111112111116
Sum22222222224222232
Table 7. The 83 3 U M F dipeptides and the four pairs (stop codon, amino acid) coded by the 96 unidirectional multiple-frame dicodons 3 U M F (Definition 11, Table 2).
Table 7. The 83 3 U M F dipeptides and the four pairs (stop codon, amino acid) coded by the 96 unidirectional multiple-frame dicodons 3 U M F (Definition 11, Table 2).
AFAlaPheGCTTTTISIleSerATCTCTRVArgValCGTGTG
AGAlaGlyGCGGGGKGLysGlyAAGGGGSASerAlaAGCGCG
AHAlaHisGCACACKRLysArgAAGAGASFSerPheAGTTTT, TCTTTT
AKAlaLysGCAAAALCLeuCysCTGTGT, TTGTGTSGSerGlyTCGGGG
ALAlaLeuGCTCTCLFLeuPheCTTTTTSHSerHisTCACAC
APAlaProGCCCCCLGLeuGlyCTGGGG, TTGGGGSKSerLysTCAAAA
CACysAlaTGCGCGLKLeuLysCTAAAA, TTAAAASPSerProAGCCCC, TCCCCC
CFCysPheTGTTTTLPLeuProCTCCCCSRSerArgTCGCGC
CPCysProTGCCCCLYLeuTyrCTATAT, TTATATSVSerValAGTGTG
DFAspPheGATTTTMCMetCysATGTGTTerETerGluTGAGAG
DIAspIleGATATAMGMetGlyATGGGGTerGTerGlyTAGGGG
DPAspProGACCCCNFAsnPheAATTTTTerKTerLysTAAAAA, TGAAAA
DTAspThrGACACANIAsnIleAATATATerRTerArgTAGAGA
EGGluGlyGAGGGGNPAsnProAACCCCTFThrPheACTTTT
EKGluLysGAAAAANTAsnThrAACACATGThrGlyACGGGG
FPPheProTTCCCCPFProPheCCTTTTTKThrLysACAAAA
FSPheSerTTCTCTPGProGlyCCGGGGTLThrLeuACTCTC
GAGlyAlaGGCGCGPHProHisCCACACTPThrProACCCCC
GEGlyGluGGAGAGPKProLysCCAAAATRThrArgACGCGC
GFGlyPheGGTTTTPLProLeuCCTCTCVFValPheGTTTTT
GKGlyLysGGAAAAPRProArgCCGCGCVGValGlyGTGGGG
GPGlyProGGCCCCQGGlnGlyCAGGGGVKValLysGTAAAA
GVGlyValGGTGTGQKGlnLysCAAAAAVPValProGTCCCC
HFHisPheCATTTTQRGlnArgCAGAGAVSValSerGTCTCT
HIHisIleCATATAREArgGluCGAGAGVYValTyrGTATAT
HPHisProCACCCCRFArgPheCGTTTTWGTrpGlyTGGGGG
IFIlePheATTTTTRGArgGlyAGGGGG, CGGGGGYFTyrPheTATTTT
IKIleLysATAAAARKArgLysAGAAAA, CGAAAAYPTyrProTACCCC
IPIleProATCCCCRPArgProCGCCCCYTTyrThrTACACA
Table 8. Occurrence number of the 20 amino acids in the first and second positions of the 83 3 U M F dipeptides and the four pairs (stop codon, amino acid) (Table 7).
Table 8. Occurrence number of the 20 amino acids in the first and second positions of the 83 3 U M F dipeptides and the four pairs (stop codon, amino acid) (Table 7).
ACDEFGHIKLMNPQRSTVWY
AlaCysAspGluPheGlyHisIleLysLeuMetAsnProGlnArgSerThrValTrpTyrTerSum
1st site63422634262463686613487
2nd site320314133312300140633302087
Sum954516196714924203121199154174
Table 9. The 40 5 U M F dipeptides and the three pairs (amino acid, stop codon) coded by the 96 unidirectional multiple-frame dicodons 5 U M F (Definition 12, Table 3).
Table 9. The 40 5 U M F dipeptides and the three pairs (amino acid, stop codon) coded by the 96 unidirectional multiple-frame dicodons 5 U M F (Definition 12, Table 3).
ARAlaArgGCGCGA, GCGCGG, GCGCGTKNLysAsnAAAAAC, AAAAAT
CVCysValTGTGTA, TGTGTC, TGTGTTKRLysArgAAAAGA, AAAAGG
ERGluArgGAGAGGKSLysSerAAAAGC, AAAAGT
ESGluSerGAGAGC, GAGAGTKTLysThrAAAACA, AAAACC, AAAACG, AAAACT
FCPheCysTTTTGC, TTTTGTLSLeuSerCTCTCA, CTCTCC, CTCTCG
FFPhePheTTTTTCPHProHisCCCCAC, CCCCAT
FLPheLeuTTTTTA, TTTTTGPLProLeuCCCCTA, CCCCTC, CCCCTG, CCCCTT
FSPheSerTTTTCA, TTTTCC, TTTTCG, TTTTCTPPProProCCCCCA, CCCCCG, CCCCCT
FTerPheTerTTTTAA, TTTTAG, TTTTGAPQProGlnCCCCAA, CCCCAG
FWPheTrpTTTTGGPRProArgCCCCGA, CCCCGC, CCCCGG, CCCCGT
FYPheTyrTTTTAC, TTTTATRAArgAlaCGCGCA, CGCGCC, CGCGCT
GAGlyAlaGGGGCA, GGGGCC, GGGGCG, GGGGCTRDArgAspAGAGAC, AGAGAT
GDGlyAspGGGGAC, GGGGATREArgGluAGAGAA
GEGlyGluGGGGAA, GGGGAGSLSerLeuTCTCTA, TCTCTG, TCTCTT
GGGlyGlyGGGGGA, GGGGGC, GGGGGTTHThrHisACACAT
GVGlyValGGGGTA, GGGGTC, GGGGTG, GGGGTTTQThrGlnACACAA, ACACAG
HTHisThrCACACC, CACACG, CACACTVCValCysGTGTGC
ITerIleTerATATAA, ATATAGVTerValTerGTGTGA
IYIleTyrATATACVWValTrpGTGTGG
KILysIleAAAATA, AAAATC, AAAATTYITyrIleTATATC, TATATT
KKLysLysAAAAAGYMTyrMetTATATG
KMLysMetAAAATG
Table 10. Occurrence number of the 20 amino acids in the first and second positions of the 40 5 U M F dipeptides and the three pairs (amino acid, stop codon) (Table 9).
Table 10. Occurrence number of the 20 amino acids in the first and second positions of the 40 5 U M F dipeptides and the three pairs (amino acid, stop codon) (Table 9).
ACDEFGHIKLMNPQRSTVWY
AlaCysAspGluPheGlyHisIleLysLeuMetAsnProGlnArgSerThrValTrpTyrTerSum
1st site11027512710050312302043
2nd site22221122132112442222343
Sum33248634842162754524386
Table 11. Multi-frame dipeptide boolean matrix. The 114 = 121 4 3 M F dipeptides, the four pairs (stop codon, amino acid) and the three pairs (amino acid, stop codon) coded by the 208 = 16 + 2 × 96 multiple-frame dicodons B M F (Definition 13, Table 1), 3 U M F (Definition 11, Table 2) and 5 U M F (Definition 12, Table 3). The rows and columns are associated with the first and second amino acid, respectively, in the dipeptide. The value of 1 means a M F dipeptide coded by at least a multiple-frame dicodon M F ( M F true). The value of 0 stands for a S F dipeptide coded by a single-frame dicodon S F ( M F false). For example, the value of AlaCys is 0 (absent in Table 5, Table 7 and Table 9) and the value of CysAla is 1 (7th row in Table 7).
Table 11. Multi-frame dipeptide boolean matrix. The 114 = 121 4 3 M F dipeptides, the four pairs (stop codon, amino acid) and the three pairs (amino acid, stop codon) coded by the 208 = 16 + 2 × 96 multiple-frame dicodons B M F (Definition 13, Table 1), 3 U M F (Definition 11, Table 2) and 5 U M F (Definition 12, Table 3). The rows and columns are associated with the first and second amino acid, respectively, in the dipeptide. The value of 1 means a M F dipeptide coded by at least a multiple-frame dicodon M F ( M F true). The value of 0 stands for a S F dipeptide coded by a single-frame dicodon S F ( M F false). For example, the value of AlaCys is 0 (absent in Table 5, Table 7 and Table 9) and the value of CysAla is 1 (7th row in Table 7).
Site2ndACDEFGHIKLMNPQRSTVWY
1st AlaCysAspGluPheGlyHisIleLysLeuMetAsnProGlnArgSerThrValTrpTyrTerSum
AAla0000111011001010000007
CCys1000100000001000010004
DAsp0000100100001000100004
EGlu0000010010000011000004
FPhe0100100001001001001118
GGly1011110010001000010008
HHis0000100100001000100004
IIle0000100010001001000116
KLys0000010110110011100008
LLeu0100110010001001000107
MMet0100010000000000000002
NAsn0000100100001000100004
PPro0000111011001110000008
QGln0000010010000010000003
RArg1011110010001000010008
SSer1000111011001010010009
TThr0000111011001110000008
VVal0100110010001001001119
WTrp0000010000000000000001
YTyr0000100100101000100005
Ter0001010010000010000004
Sum4423151445135211528654243121
Table 12. Multi-frame dipeptide occurrence matrix. The 114 = 121 4 3 M F dipeptides, the four pairs (stop codon, amino acid) and the three pairs (amino acid, stop codon) coded by the 208 = 16 + 2 × 96 multiple-frame dicodons B M F (Definition 13, Table 1), 3 U M F (Definition 11, Table 2) and 5 U M F (Definition 12, Table 3). The rows and columns are associated with the first and second amino acid, respectively, in the dipeptide. The values between 1 and 5 give the number of times a M F dipeptide is coded by multiple-frame dicodons M F . The value of 0 stands for a S F dipeptide coded by a single-frame dicodon S F . For example, the value of AlaCys is 0 (absent in Table 5, Table 7 and Table 9), the value of CysAla is 1 (7th row in Table 7) and the value of AlaArg if 4 (one occurrence: 1st row in Table 5 and three occurrences: 1st row in Table 9).
Table 12. Multi-frame dipeptide occurrence matrix. The 114 = 121 4 3 M F dipeptides, the four pairs (stop codon, amino acid) and the three pairs (amino acid, stop codon) coded by the 208 = 16 + 2 × 96 multiple-frame dicodons B M F (Definition 13, Table 1), 3 U M F (Definition 11, Table 2) and 5 U M F (Definition 12, Table 3). The rows and columns are associated with the first and second amino acid, respectively, in the dipeptide. The values between 1 and 5 give the number of times a M F dipeptide is coded by multiple-frame dicodons M F . The value of 0 stands for a S F dipeptide coded by a single-frame dicodon S F . For example, the value of AlaCys is 0 (absent in Table 5, Table 7 and Table 9), the value of CysAla is 1 (7th row in Table 7) and the value of AlaArg if 4 (one occurrence: 1st row in Table 5 and three occurrences: 1st row in Table 9).
Site2ndACDEFGHIKLMNPQRSTVWY
1st AlaCysAspGluPheGlyHisIleLysLeuMetAsnProGlnArgSerThrValTrpTyrTerSum
AAla00001110110010400000010
CCys1000100000001000040007
DAsp0000100100001000100004
EGlu0000010010000022000006
FPhe02002000020010050012318
GGly50231400100010000500022
HHis0000100100001000400007
IIle0000100010001001000228
KLys00000103201200324000018
LLeu02001200200010040002014
MMet0100010000000000000002
NAsn0000100100001000100004
PPro00001130150042500000022
QGln0000010010000010000003
RArg40231200200010000100016
SSer10002110140020100100014
TThr00001120110012100000010
VVal02001100100010010011110
WTrp0000010000000000000001
YTyr0000100300101000100007
Ter0001010020000010000005
Sum1174717197917132219418151111276208

© 2019 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Back to TopTop