Discovery of Proteomic Code with mRNA Assisted Protein Folding

The 3x redundancy of the Genetic Code is usually explained as a necessity to increase the mutation-resistance of the genetic information. However recent bioinformatical observations indicate that the redundant Genetic Code contains more biological information than previously known and which is additional to the 64/20 definition of amino acids. It might define the physico-chemical and structural properties of amino acids, the codon boundaries, the amino acid co-locations (interactions) in the coded proteins and the free folding energy of mRNAs. This additional information, which seems to be necessary to determine the 3D structure of coding nucleic acids as well as the coded proteins, is known as the Proteomic Code and mRNA Assisted Protein Folding.


Introduction
Mapping between messages in nucleic acid and protein alphabet is a fascinating story, a story that still unfolding. It is about to understand the rules of information transfer between DNA and proteins. First of all it is not only a biochemical puzzle and much of the early methods for devising codes came from combinatorics, information theory. Four and 20 (number of bases and amino acids) seems to be magical numbers with amazingly many possible mathematical connections between them [1].
The existence of a Genetic Code became obvious immediately after the discovery of DNA structure [2,3]. The first suggestion for a code came from George Gamow, not even a biologist, but a physicist who became most famous as the chief proponent of the Big Bang theory in cosmology. He proposed a Diamond Code [4][5][6] where DNA acted directly as a template for assembling amino acids into proteins.

The 3D structure of mRNA
Evolution often occurs in stages, one step is followed by a plateau before the next step is possible to take. It is true even for the development of scientific thinking and understanding. The 50-es and 60-es were very fertile for biology and the foundation of the recent molecular biology was laid. It took 30+ years to recognize, that the discovery of DNA was not the discovery of The secret of Life, the Life has much more secrets. Some early ideas were and some still are wrong, very wrong. The most embarrassing mistake of the modern genomics was probably the concept of sense and non-sense DNA strands. This was a concept deeply rooted in the mind of even the most brilliant scientists. The possibility of whole genome sequencing finally opened the gates (in late 90-es) to get rid of this nonsense idea and begin to see that both DNA strands are equally important for protein expression. It is no longer feasible, that one strand is expressed, while the other (complementary) strand serves only as a reproductive template (backup). Complementarity of bases is fundamental for helix formation of dsDNA, but more than that it makes possible formation of much more complex and sophisticated 3D structures than the known monotone helix. The unique, signature-structure of tRNA is well recognized and accepted, meanwhile the possibility and significance of mRNA 3D structure formation is widely denied [11]. The mRNA is passing the translation machinery on the surface of ribosomes, codon after codon, like a tape passes the magnets of a tape recorder. More linear mRNA is expected to be more suited for translation because a structured mRNA only slows down the machinery.
The number of mRNA images in Nucleic Acid Structure Database (NDB) is very limited [12]. Fortunately the known rules of codon base pairing makes it possible to predict the thermodynamically most likely structures of any nucleic acid [13]. (Such tools and predictions are not yet available for proteins). It is widely accepted today, that mRNAs have secondary structure even if there are numerous methodological considerations how to measure the folding energy content of nucleic acids [14][15][16].
The functional significance of the mRNA secondary structure is not known, therefore wobble base replacement is a widely used and accepted method to eliminate the folding energy of mRNA and achieve higher translational efficiency [17][18].

Physico-chemical definition of codon boundaries
Frame-shift is a major concern regarding the translation. Nirenberg's Genetic Code seems not to give any protection against the possible occurrence of frame-shifts even if it gives some promises to reduce the catastrophic consequences of wrong codon readings. However a second look at the base composition of codons (64 as it is), or the usage-weighed variants from different Species Specific Codon Usage Tables [19], reveals that it is not completely random. The first and third codon positions contain significantly more G and C bases, than the second (middle) codon positions. This bias has remarkable consequences. There are 3 H bonds between C and G (dG=-1524 kcal/1k bases) while only two between A and T bases (dG=-365 kcal/1k bases). Therefore the GC content is the major determinant of folding energy, and by that way the mRNA's 3D structure. The higher GC content indicates that the 1 st and 3 rd codon residues have significantly larger effect on the mRNA structure than the 2 nd one. Indeed, wobble bases were found to be the most important codon residues to determine mRNA structure [14].
This codon related periodic variation of GC content means that there is a periodic pattern of folding energy (dG) along the mRNA, which distinguishes the central codon base from the 1 st and 3 rd and forms a physico-chemical barrier or boundary between the codons. This is a statistical rule which doesn't apply for every single codon, but still shows a general tendency that there is some potential protection against frame shifts in the Nirenberg's Genetic Code [14,[20][21]. Manipulation of wobble bases to eliminate mRNA secondary structures (and speed up the translation) will destroy this energy pattern and by that way increase the translation errors. Native synonymous codon usage (and codon bias) seems to prefer selection for translational accuracy versus velocity [22][23][24][25].

The Common Periodic Table of Codons and Amino Acids
There is even another completely separate line of evidence suggesting that codon positions are different, and the central codon position has a very special role. There always has been an effort to connect codons to their coded amino acids. The wobble base lost its importance because of its interchangeability. Most scientific efforts focused on to find stereo-chemical compatibility (spatial fitting) between the atomic geometry defined by 2 or 3 nucleic acid bases and the corresponding geometry defined by the residue of the coded amino acid [26,27]. Crick furiously attached these efforts stating that any connection between codons and amino acids is only accidental and there is no underlying chemical rationale [28]. On the other hand Carl Woese argued that the Genetic Code developed in a way that was very closely connected to the development of the amino acid repertoire, and that this close biochemical connection is a fundamental of specific protein-nucleic acid interactions [29].
A common regularity in an arrangement of codons and amino acids provides a strong support for the evolutionary connection between mRNA and coded proteins.
A Periodic Table of Codons (Table 1a) has been designed in which the codons are in regular locations. The Table has four fields (16 places in each), one with each of the four nucleotides (A, U, G, C) in the central codon position. Thus, AAA (lysine), UUU (phenylalanine), GGG (glycine) and CCC (proline) are positioned in the corners of the fields as the main codons (and amino acids). They are connected to each other by six axes (2 horizontal, 2 vertical, 2 axial). The resulting nucleic acid periodic table shows perfect axial symmetry for codons. The corresponding Amino Acid Table (Table  1b) also displays periodicity regarding the biochemical properties (charge and hydropathy) of the 20 amino acids, and the positions of the stop signals. Table 1 emphasizes the importance of the central nucleotide in the codons, and predicts that purines control the charge while pyrimidines determine the polarity of the amino acids [30]. In addition to this correlation between the codon sequence and the physico-chemical properties of the amino acids, there is a correlation between the central residue and the chemical structure of the amino acids. A central uridine correlates with the functional group -C(C) 2 -; a central cytosine correlates with a single carbon atom, in the C 1 position; a central adenine coincides with the functional groups -CC=N and -CC=O; and finally a central guanine coincides with the functional groups -CS, -C=O, and C=N, and with the absence of a side chain (glycine) ( Table 2).   Although there can be seen a quite large correlation between amino acid codons and amino acid properties (an interesting finding as such), one has to be cautious in saying that this correlation has a functional significance. Such significance could eventually exist there (e.g. in determining the structure of mRNA), but more evidence is needed. Before that this correlation still can be a fruitful hypothesis for further research.

Visualization of specific nucleic acid -protein interactions
The strong connection between codon structure and physicochemical properties of coded amino acids (the existence of The Common Periodic Table of Codons and Amino Acids) suggests that specific interaction between codons and coded amino acids might occur. There is no doubt, that specific interaction between nucleic acids and proteins is an absolute necessity for many vital functions, for example the regulation of gene expression. While should the codon / coded amino acid interaction be the only forbidden possibility to accomplish this function?
The interaction between restriction enzymes (RE) and their recognition sequences (RS) are known to be very specific and fortunately numerous such interactions are visualized and available from PDB. A review of the seven available crystallographic studies [31] showed that the amino acids coded by codons that are subsets of recognition sequences were often closely located to the RS itself and they were in many cases directly adjacent to the codon-like triplets in the RS. Fifty-five examples of this codon-amino acid co-localization were found and analyzed, which represents 41.5% of total 132 amino acids which are localized within 8 Å distance to the C1' atoms in the DNA. The average distance between the closest atoms in the codons and amino acids is 5.5 +/-0.2 Å (mean +/-S.E.M, n = 55), while the distance between the nitrogen and oxygen atoms of the co-localized molecules is significantly shorter, (3.4 +/-0.2 Å, p < 0.001, n = 15), when positively charged amino acids are involved. This is indicating that a direct interaction between nucleic-and amino acids might occur. We interpret these results in favor of Woese and suggest that the Genetic Code is "rational" and there is a stereo-specific relationship between the codons and the coded amino acids (Figure 1). But how well   this finding supports Woese is still open. Although the finding that amino acids and their codons in restriction enzymes are closely located can give ideas for studying the genetic code from the same perspective, it is not sifficient evidence that the same occurs in the general genetic code. However the close location of codons and coded amino acids in restriction enzymes is an indirect indication for the possible existence of specific affinity between coding-(mRNA) and coded-(protein) sequences, which has a special functional meaning for the concept of the RNA assisted protein folding.

Partial Complementary Coding of Co-locating Amino Acids
There was an idea published in early 80-s [32][33][34][35][36][37] suggesting that specifically interacting proteins are coded by complementary nucleic acids. The idea was, of course, rejected because it proposed that even the "non-sense" DNA strand might be expressed and makes some sense, which was an absurdity at that time. An early effort to confirm this theory, using (rather undeveloped) bioinformatical methods, failed [38]. There was a short come-back of this idea (called today as the Proteomic Code) and several research groups confirmed that proteins derived from complementary nucleic acid strands have specific, high affinity attraction to each other [39][40][41][42][43]. However it turned out that there is some problem with the consistency of the results: the method sometimes worked sometimes not.
We constructed a bioinformatics tool to collect data of co-locating amino acids from known protein structures, listed in the PDB, for statistical analyses [44]. (The immediate neighborhood on the same peptide chain was not counted as co-location). These analyses provided some rather novel observations. 1) Co-locating amino acids are physico-chemically compatible with each other, i.e. large and small, positive and negative, hydrophobe and hydrophobe amino acids are preferentially co-located with each other. The novelty of this observation is that physico-chemical rules apply already on residue level and do not necessarily need large, complex interfaces of interacting proteins [45].
2) Co-locating amino acids are preferentially coded by partially complementary codons, where the 1 st and 3 rd bases are complementary but the 2 nd may but not necessarily are complementary to each other ( Figure 2). The propensities for the 20x20=400 possible amino acid pairs were monitored in 81 different protein structures with the SeqX tool. The tool detected co-locations when two amino acids were within 6 Å distance of each other (neighbors on the same strand were excluded). The total number of co-locations was 34,630. Eight different complementary codes were constructed for the codons (2 optimal and 6 suboptimal). In the two optimal codes, all three codon residues (123) were complementary (C) or reverse complementary (RC) to each other. In the suboptimal codes, only two of three codon residues were C or RC to each other (12, 13, 23), while the third was not necessarily complementary (X). (For example, Complementary Code RC_1X3 means that the first and third codon letters are always complementary, but not the second and the possible codons are read in reverse orientation. P/N (positive/negative) ratio indicates the proportion of co-locating amino acids coded by the defined codon complementary rules. These observations lead us to conclude that there are significant additional functional and structural connections between codons and coded amino acids to that what was described by Nirenberg and is known as the Genetic Code. We formulated the recent concept of Proteomic Code to describe this additional connection (for review see [46,47]).

The Role and Predictability of Wobble bases
The 64/20 Genetic Code is redundant, mainly because the 3 rd codon bases, in most codons, are interchangeable without any consequence on the sequence of the coded protein. The only information expected from the DNA to the protein syntheses is the coding of amino acids, because it is believed, that the only information necessary to correct protein folding is only the correct amino acid sequence itself.
These believe is based on Anfinsen's thermodynamic principle [48] which states that the amino acid sequence contains all the necessary information to correct and unambiguous protein folding or, by other words, folding/refolding is in principle a reversible, thermodynamically driven process. In practice, however, the reversibility of denaturation is not observed in all cases. Most proteins in an environment close to their neutral, physiological conditions (neutral pH, ~ 150 mM ionic strength) cannot fold back after complete or partial unfolding. This is particularly true for large proteins [49]. Folding and refolding might be visualized as navigation of an ensemble of unfolded molecules through the bumpy energy landscape in search of the native state [50]. A fraction of molecules reaches the native state directly, whereas the remaining fraction gets kinetically trapped in metastable conformations. Therefore the general validity of Anfinsen's thermodynamic principle is the subject of returning criticism (like the Levinthal paradox [52] which refers to kinetic aspects of protein folding and does not disprove the existence nor the stability of the native state). However, even if no protein has yet been identified that cannot fold spontaneously under permissive conditions, most protein chemists agree, that, in many cases, additional molecular information, provided by chaperons, is necessary to guide or facilitate the protein folding under physiological (neutral) conditions. The annealing action of these (mainly protein) chaperons is described as providing supplemental information for systems that do not otherwise have a definite ground state. Extensive research into chaperonin assisted protein folding [50,51,[53][54][55] by indicates that chaperonins rescue misfolded proteins from kinetic traps and that the native state of the substrate protein is not altered by the chaperonin. A chaperon, for example, interact with it's denatured substrate protein (SP), removes it from it's kinetic trap by stretching it [51], stabilize elongated chains. Correct folding in the cytosol is achieved either on controlled chain release from these primary chaperons or after transfer of newly synthesized proteins to downstream chaperones, such as the chaperonins [54]. Protein chaperons are highly conserved, and the coexistence of these chaperones in the same cytosol suggests that certain chaperone-cochaperone interactions are also permitted.
By other words there is some 3x excess of information before translation, and there seems to be a shortage of information after translation. It is logical to assume, that folding information is stored in the redundant codons, more concretely in the wobble bases. The literature is actually rather rich with observations connecting the wobble bases to some structural feature of the coded proteins [56][57][58][59][60][61]. Even a wobble base centric sequence -structure database was constructed [62].
The preferential coding of co-locating amino acids by partially complementary nucleic acids, (for example by 5'>ANG>3'/3'<TNC<5' pattern) immediately suggests a role for the wobble bases. They are integrated parts of codons, defining amino acid co-locations. They are not randomly chosen, but logically selected: the wobble base of Xc codon (defining Xa amino acid) is defined by the first residue of codon Yc (which is coding Ya amino acid) if that two amino acids (Xa and Ya) are colocating and vice versa the 3 rd residue of Yc is defined by the 1 st base in codon Xc (A defines T & U, G defines C).
Protein structures contain many amino acid co-locations (immediate neighbors on the same chain are excluded). Suppose that preferential partial complementarity coding of amino acids is not a rarity, but it is a rule. In that case the signs of non-randomness of wobble base selection should be seen not only in a small subset of proteins but even in very large data sets, like the species specific codon usage frequency tables.
Statistical analyses of A, T, G, C frequencies at 1 st , 2 nd , and 3 rd codon positions in 113 species specific Codon Usage Frequency Tables and 87 protein structures showed strongly significant internal correlation between the frequency of nucleic acid bases at different codon positions. This strong relationship made it possible to predict the frequency of all possible wobble bases in all the 64 codons in all the 113 species (P<1.3E-64, N=113) and all the 87 proteins (p<1.1E-28, n=87) [63].
These strong correlations wouldn't be possible with random selection of wobble bases. Therefore we concluded, that synonymous codons are not interchangeable with each other without disturbing the internal order of bases in integrated codon systems like a native mRNA or a species specific Codon Usage Frequency Order.

Integrated Codon Systems
There are more than observations provided by theoretical and computational biology which are indicating, that native, natural proteins, as well as their coding sequences, are much more than the sequential collection of their building blocks. They are an integrated, interconnected system. 1) Wobble base mutations are expected to be "silent" without any consequences for the biological functions or phenotypes. They are often not. "Silent" polymorphism or mutation affects a) substrate specificity [64], b) drug pharmacokinetics and multidrug resistance in human cancer cells [65], c) mRNA stability and synthesis of the receptor [66], d) splicing [67,68], e) different functions [69][70][71].
2) It has recently become clear that the classical notion of the random nature of mutations does not hold for the distribution of mutations among genes: most collections of mutants contain more isolates with two or more mutations than predicted by the mutant frequency on the assumption of a random distribution of mutations. Excesses of multiples are seen in a wide range of organisms, including riboviruses, DNA viruses, prokaryotes, yeasts, and higher eukaryotic cell lines and tissues. In addition, such excesses are produced by DNA polymerases in vitro. These "multiples" appear to be generated by transient, localized hypermutations rather than by heritable mutator mutations. The components of multiples are sometimes scattered at random and sometimes display an excess of smaller distances between mutations [72,73].
3) A compensatory mutation occurs when the fitness loss caused by one mutation is remedied with a second mutation at a different site in the genome.
Often it occurs in the same gene, alters the protein sequence [74,75] but saves the protein's secondary and tertiary structure. Compensatory mutations (in the same genes) seem to work through restoring the protein's structure (allosterism) which saves the protein's function [76].
Uneven concentration of mutations of smaller distances and their compensatory character are further indications of the integration and interconnectivity of codons in the same gene and consequently, conservation of structurally critical amino acid connections (but the amino acids) in the coded proteins.
However it should be noticed, that the distribution of mutations in directed evolution experiments is so complex phenomenon that it cannot directly be used to support the proteomic code idea.  [47]). Residue contact maps (RCM) were obtained from the PBD files of the protein structures using the SeqX tool (left triangles). Energy dot plots (EDP) for the coding sequences were obtained using the mfold tool (right triangles). The two maps were aligned along a common left diagonal axis to facilitate visual comparison between the different possible representations. The black dots in the RCMs indicate amino acids that are within 6 Å of each other in the protein structure. The colored (grass-like) areas in the EDPs indicate the energetically mostly likely RNA interactions (color code in increasing order: yellow, green red, black). The coordinates indicate the number of amino acid and the corresponding nucleic acid residues.

The RNA-assisted Protein Folding
The preferential partial complementarity coding of co-locating amino acids (The Proteomic Code) suggests the possibility that mRNA and coded proteins may share some common structural features. Side by side comparison of 2D projections of mRNA and coded proteins seems to confirm this possibility (Figure 3, 4). Biological rules are, of course, always statistical rules, probabilities and tendencies. Nucleic acids as well as proteins have many possible configurations where one or a few are expected to dominate and define the main and characteristic configuration. Coding-and coded sequences might have their own range of more or less different folding potentials. However when a protein is generated on the surface of ribosome the coding-and coded sequences are very close to each other. This temporary intimate closeness is a possibility for coding sequences for transferring folding information to coded proteins, information that is additional to that these proteins already have in their amino acid sequences. Some ideas how it is possible are sketched in Figure 5 and 6). The RNA-assisted protein folding is an interesting theory which derived recently [77] from the much older theory of Proteomic Code [46] and is mainly based on bioinformatical observations. Laboratory evidence is still missing. Mutagenesis (for example) should in principle be able to give some light to this topic. It remains to see how the structural features of translational machinery allow this kind of mechanism. The involvement and role of transfer RNA (tRNA) is of course not to forget.

Evolutionary Aspects of the Redundant Genetic and Proteomic Codes.
The Common Periodic Table of Codons and Amino Acids is convincing evidence for the nonrandomness of codon / coded amino acid connection. This connection is further emphasized by the preferential complementary coding of co-locating amino acids. The predictability of wobble base frequency is another indicator of non-random wobble base selection and a functional network between codons. Order is traditionally seen as the result of development from the chaotic to organized, from simple to complex. This process is called evolution.  -a', b-b', c-c', which fold the mRNA into a T-like shape. During the translation process the mRNA unfolds on the surface of the ribosome, but subsequently refolds, accompanied by its translated and lengthening peptide (red dotted line, B-F). The result of translation is a temporary ribonucleotide complex, which dissociates into two T-shape-like structures: the original mRNA and the properly folded protein product (G). The red circles indicate the specific, temporary attachment points between the RNA and protein (for example a basic amino acid) while the blue circles indicate amino acids with exceptionally high affinity for the attachment points (for example acidic amino acids); these capture the amino acids at the attachment point and dissociate the ribonucleoprotein complex. Transfer-RNAs are of course important participants in translation, but they are not included in this scenario. (Figure is reproduced from [77]).
The theory of Ikehara [78,79] about the origins of gene, genetic code, protein and life is especially interesting regarding the Proteomic Code. Ikehara suggests (and support with experimental evidence) that "geneprotein" system, comprised of 64 codons and 20 amino acids developed successfully during -proteins are able to represent the 6 major (and characteristic) protein moieties/indices (hydropathy, α-helix, β-sheet and β-turn forms, acidic amino acid content and basic amino acid content) which are necessary for appropriate three-dimensional structure formation of globular, watersoluble proteins on the primitive earth. The [GADV]-proteins (even randomized) have catalytic properties and able to facilitate the syntheses of other [GADV]-proteins (also random).
The primeval genetic code continued to develop toward a more complex SNS-type primitive genetic code (S: G or C) containing 16 codons and encoding 10 amino acids (L, P, H, Q, R, V, A, D, E, G) before the recent 64 codon/20 amino acid-type genetic code became established.
Furthermore, Ikehara concluded from the analysis of microbial genes that newly-born genes are products of nonstop frames (NSF) on antisense strands of microbial GC-rich genes [GC-NSF(antisense)] and from SNS repeating sequences [(SNS)n] similar to the GC-NSF(antisense).
The similarity between GNC/SNS-type primitive codons (which are expressed even from the reverse-complement strands as GC-rich non-stop genes) and the Proteomic Code is obvious. Both concepts suggest and agree with each other regarding a) the connection between 2nd codon residue and the fundamental physicochemical properties of the coded amino acids, b) the importance of 1st and 3rd codon letters in determining the nucleic acid (as well as protein) structure, c) the importance of compositional difference between 1st, 3rd and central codon residues (to emphasize the codon boundaries), d) the importance of complementarity (even in the mRNA) in development of protein structure and function, e) the importance of GC at the 1st and 3rd codon positions (as the source of lower Gibbs energy (dG), than central codon positions have, where even AT are permitted). I think that the concept of GNC/SNS-type primitive codons and the Proteomic Code are convergent ideas, both reflecting the same fundamental aspects of the connection between nucleic acid and protein structure and function.

Conclusions
The Nirenberg's Code seems to have another face that was in the shadow some 40+ years. We start to see, that the Genetic Code actually might contain all characteristics that was expected and promised by the early theoretical code models.
1) Codon boundaries are physico-chemically defined to a certain degree, which theoretically should give some protection against frame-shifts.
2) Codon residues are not randomly assigned, but there is a connection between codon architecture and the physicochemical properties of the coded proteins.
3) Amino acids preferentially interact with their codons (studied in restrictions endonucleases). 4) Co-locating amino acids are preferentially coded by partially complementary codons which create inter-connectivity between structurally important amino acids. 5) Wobble bases are not randomly assigned at all, their frequency is statistically well predictable from the frequency of bases at other codon positions. 6) Wobble base redundancy makes it possible the development of codon integration in coding sequences which might be used for compensatory mutations. This is the second line of defense against mutations (after the known tolerance provided by the coding redundancy).
7) The internally inter-connected and integrated system of codons makes it possible that coding sequences provide a mold for structure forming of coded proteins and function as nucleic acid chaperons providing the missing molecular information to correct protein folding.
It is concluded that the redundant Genetic Code contains biological information which is additional to the 64/20 definition of amino acids. This additional information is used to define the 3D structure of coding nucleic acids and coded proteins and is called the Proteomic Code.