4. Discussion
Although universal genetic code is a major informatics factor governing the development and functioning of all organisms, there is no rational argument suggesting that it is the only form of natural code. For proper functioning, many life processes require other universal identifiers, enabling error-free recognition. A three-letter (positions) universal genetic code enough for translational reading may not be enough during charging tRNA with an appropriate amino acid via aminoacyl-tRNA synthetase. Preliminary numeric experiments indicate that, when we think about the assignment of the tRNA strand to a given amino acid class, the third position of the anticodon, position 34 of the tRNA strand (
Figure 1), is not as important as a dominating anticodon tandem; namely, positions 36 and 35 (
Figure 3). We propose to use the name “tandem” because nucleotides at these two positions usually occur together in the discussion of nucleotide importance. The 41st place out of 77, which position 34 takes in the correlation ranking (
Table 2), entails practical meaning in machine learning. For example, the classifier performing the tRNA classification task of correctly choosing one of the two amino acids coded by the same anticodon tandem (e.g., Gln and His) will not prefer the nucleot(s)ide in position 34, as is the rule in the translation of the universal genetic code. It will recognize other, statistically more important components. Thus, if we assume that training of a classifier manifests the features of a natural enzymatic process of tRNA recognition and that the final single result of the classification corresponds to the enzymatic load of amino acid, it leads to the conclusion that, in the process of tRNA charging, the empty tRNA transporter may also effectively expose nucleot(s)ide(s) other than the anticodon to aminoacyl-tRNA synthetase. Marginalization of the third position of the anticodon during amino acid load, a position that is very important in the translation process, may be related to the specificity of the nucleotide–protein interactions (H-bonds, salt bridges, and hydrophobic effect) in the area of the anticodon site during t-RNA attachment to the synthetase. Regardless of the loose by “wobble” effect, complementary specify in position 34 has to be maintained for future translation in the ribosome; however, in some tRNA classes, it may not be guaranteed by the local interaction of the synthetase with a single nucleotide in this position. In such cases, other well-defined nucleotides, even outside the anticodon loop and stem, have to be applied. They may also speed up the recognition process. Thus, it is expected that so-called “identity nucleotides” are a small set of nucleotides determining the identity of tRNA; more precisely, carrying chemical groups that often interact with amino acids on the synthetases.
Another example of the importance of the identity nucleotides is the 6-fold degenerate amino acids (Leu, Ser, and Arg), and tRNAs may expose the two different leading nucleotide tandems in the anticodon. In this case, identity nucleotides may reduce the uncertainty level in positions 35 and 36 and prohibit charging errors, in effect also speeding up the binding process.
On the other end of the tRNA strand, the proper fit of the tRNA acceptor site to the catalytic center of synthetase may also require additional identifying nucleotides, which may be especially essential in recognizing the tRNA for the same code amino acid via the synthetases off different classes (LysI and LysII).
The strand positions less important than 34, e.g., 8, 20, 33, and 74–76 (
Figure 3), may be related to the universal third-order structure of the tRNA strand and the acceptor site CCA-3′.
As nucleotide-specific interactions between aminoacyl-tRNA synthetases and their cognate tRNAs ensure accurate RNA recognition and prevent the binding of noncognate substrates, reducing further translational errors, the above examples highlight the importance of identity nucleotides.
Machine learning algorithms are computational models that allow computers to automatically improve their findings thanks to the experience gained while analyzing the training data sets. Their advantages are the fast processing of big data and objectivity, which is especially important in analyzing biological data. Machine learning classifiers are algorithms that automatically assign data points to classes. As such, they are great tools for modeling the processes of recognition decisions. A simple logistic classifier uses simple logistic regression to predict a binary variable (0, 1) assigned to a decision. This technique assumes that the relationship between the natural log of the odds ratio and the measurement variable described by the so-called predictor function is linear. In the discussed case, the predictor function quantitatively sums the presence of selected nucleotides, assigning them appropriate weights manifesting their importance.
The Weka SimpleLogistic classifier was verified in the preliminary numeric experiments as the best classifying algorithm in 10-fold cross-validation and 66% split tests and was finally chosen among others (
Table 3). The useful feature of this classifier is the automatic explicit indication of the most important attributes (nucleotides) and their values (weights). The number of algorithmically selected nucleotides depends on the stopping criterion of LogitBoost iterations. The chosen option of minimizing the training misclassification error results in fewer setups of selected attributes than in the case of AIC or cross-validation options.
The final classification task with a full training set indicates the important positions in the tRNA strand in recognizing the proper class. Their usage (
Figure 4a,b) correlates (cc = 0.63) with the ranks assigned by the CorrelationAttributeEval (data partially presented in
Table 2). The most useful are the anticodon loop area (34–37) and positions 2, 21, 48, and 73, localized in the Acc-stem, D-loop, V-loop, and 3′ end regions. It is assumed that the revealed contents of these places represent the nucleotides, which are necessary for full identification. Thus, they were named identity positions and nucleotides.
The revealed identity positions and nucleotides are presented in
Table 4. The shown class cases obey 2–8 places, filled with 14 different nucleot(s)ides, i.e., A, B, C, D, G, K, M, P, Q, U, 6, 8, and # (for symbol meanings, see
Supplementary Materials, Table S1). They also contain the empty positions (-) and the exclusion rules (~) indicating unfavorable staffing. The presented attributes (positions) with their values (nucleotides) are theoretically the most representative group of features properly determining the values of the linear learners (predictors) for the correct determination of the predicted classes. In some cases, there were two or three different fillings proposed at position 34, which was only indicated in the eight classes.
The identity positions use the most representative components of the consensus strand in the class of tRNAs loaded by a given amino acid (
Table 5). Only 13/129 (10%) positions do not meet consensus meaning. Global consensus, as in positions 74–76, was not useful for specific tRNA
aa recognition.
The identity nucleotides are not always fully arranged as a collective in a real single adaptor molecule, but they are at least partially visible in the strands of a given tRNA class (
Figure 5). Full sets of predicted class identity nucleotides occurred at least in one strand and were observed in 67% (16/24) of tRNA classes (
Figure 6). In other classes, the maximum occurrence exceeded 71% (5/7) of possible identity positions per strand (Glu).
Some ensembles of identity nucleotides, wider than the anticodon tandem, can serve as universal tRNA
aa class markers, i.e., they are entirely present in all strands of only one tRNA
aa class (
Table 6). The above findings raise questions about the direction of the evolution of these sites and the possible informational role and importance of specially marked amino acid transporters.
Some ensembles of identity nucleotides, even without those from positions 35 and 36, exhibit high-class specificity. A total of 16 such ensembles in 76 issues within 511 strands were observed as unique, i.e., observed only in one class (
Table 7), but not always. They contain 2–6 positions written with an 11-letter alphabet, which also obey empty positions or the attributes with the opposite impact. They may be a large fraction among identity nucleot(s)ides of a given class (
Figure 5), but they may not be common in all strands. Such unique extra-anticodon ensembles may conserve the former predicted classification in the case of a modified anticodon tandem. This mechanism could function during the evolution of the genetic code, producing its extra degeneration. There may be some tracers of this phenomenon, e.g., position 21A in Arg and Arg2 and position 48~- in Leu and Leu2 (
Table 4). This may also be the source of translation errors.
In the case of a classifier based on a machine learning algorithm correctly classifying representatives of classes occurring in nature, the probability of its decisions must be consistent with the probability of occurrence of real physicochemical processes naturally determining these classes. In analyzing the formation of tRNA classes, we assumed that these are the thermodynamic processes of overcoming the potential barrier during the fitting of the tRNA strand to the corresponding ligase. Thus, the central assumption of the proposed theoretical model of machine learning simulation of the tRNA binding to aminoacyl-tRNA synthetase is the correspondence between the biologically probable diversity of tRNAaa strands revealed by the AI classifier and the thermodynamic probability. This allows for the mathematical alignment of the logarithm of odds for classification task and logits for thermodynamically driven tRNA and synthetase binding (Equation (8)), i.e., ln(odds’i) = logit(pii), where odds’ is the measure of the success in the numeric experiment and pii is the probability of binding described by the Boltzmann distribution (Equation (4)). This allows for the expression of the change in the free energy of binding, ΔGii, by the predictor function fi, which may be treated as dimensionless energy (Equation (9)). Thus, the predictor, including the presence of selected nucleotides, becomes a mathematical model of free energy change related to this nucleotide, namely, its interaction with aaRS.
The overall picture for all classes of the energetic input, ΔG
ijk, of a given tRNA position to a total free energy change is presented in
Figure 7a. It shows the energetically rescaled values (Equation (10)) of the coefficients (par
ijk) of the linear predictor functions (Equation (1)) representing the relative effect of a given filling for the value of the predictor of a given class. The maximal value of a given predictor at a given nucleotide content leads to assigning a classified tRNA strand to a corresponding class. In
Figure 7b, a detailed example is shown for a single predictor of tRNA
Gln class. It calculates five positions of a negative energy input (attraction) and three positions of a positive input (repulsion) if filled with indicated nucleotides. In this work, the positive values of energy input are interpreted as exclusion rules, e.g., ΔG
ijk > 0 for nucleotide “A” in position 44, which implicates the rule “44~A”, i.e., the presence of “A” in this position testifies against tRNA
Gln class; all other nucleot(s)ides are neutral.
The nucleotide-independent part of free energy change, ΔG
0ii (
Figure 8), calculated according to the bias of predictor function (Equation (1)) and the theoretical formula (Equation (11)), was always positive, which corresponds to the repulsion. This term represents part of the energy unexplained strictly by the nucleotidic attributes of the simple logistic model. On the other hand, a simple analysis showed its moderate correlation (cc = 0.46) with the common net negative charge of amino acids of aminoacyl-tRNA synthetases at a pH of 7.0 (
Figure 18). As seen in
Figure 9a,b, tRNA exposes its negatively charged sugar–phosphate backbone toward an overwhelmingly negative enzyme, which may result in electrostatic repulsion. This repulsion is reduced in the cases of tRNA with the crucial third anticodon nucleotides, which have to bind stronger to be properly recognized (see
Figure 11). Thus, it is reasonable to assume that ΔG
0ii describes electrostatic repulsion between the tRNA backbone and the amino acids of synthetase.
The average change in the free energy of binding ΔG
ii was estimated at −8.6 to −4.0 k
BT (
Figure 10). A decreasing trend in the binding energy with the increase in the molecular mass of amino acid until LysII, then a slight increase, was observed. Its comparison (
Figure 11) with the energy barrier for the attachment, ΔG
forii, approximated by ΔG
0ii (Equation (13)), led to the estimation of the strength of binding, i.e., the energy required to reverse the process, ΔG
revii = ΔG
ii − ΔG
forii. The strongest attraction was determined for the tRNA
Gly class, ΔG
revii = −15.2 [k
BT]. The weakness attraction, ΔG
revii = −6.6 [k
BT], was found for the tRNA
LysII class.
Some simultaneous variations in the free energy change with the increase in the number of identity positions, N
ip (
Figure 12), were observed. Thus, the reversal energy limited only to the anticodon tandem, ΔG
revii(tan), increases (the attraction of the tandem decreases), but the rest of the reversal energy, without the anticodon tandem, ΔG
revii|tan, and part-forward of the energy, ΔG
forii, decrease (attraction increase). This counterplay of the outside-codon factors is not able to balance fully the tandem energy variation, so the total free change slightly increases (
Figure 13) with the number of identity positions.
Specifically, an unambiguous interpretation of the sequence-independent energy component ΔG
0ii as the long-range term ΔG
forii (Equation (13)) in the “crossing the energy barrier” model (
Figure 2) may cause some issues. There could be other sequence-independent interactions that are short ranged (but do not contribute to the sequence-specific recognition process). As the value ΔG
forii, calculated as ΔG
forii = ΔG
0ii, decreases with the number of identity positions N
ip (
Figure 12), this may suggest that for the appropriately large number of the identity nucleotides, the parameter b
i (Equation (11)) and the corresponding repulsion fall to zero independently of increasing the proximity of molecules. Thus, at first approximation, short-range repulsion can be neglected, and ΔG
forii mainly reflects the long-range interaction energy.
The “strong” nucleotides (G and C), with a theoretically possible three-hydrogen bond in interactions with other RNA or protein in positions 35 and 36, bind aminoacyl-tRNA synthetases stronger than “weak” nucleotides (A, U) (
Figure 14). This may suggest the important role of hydrogen bonding. The average data in the other positions for all analyzed classes were insufficient. The dependence of the total energy of the attraction on the actual number of possible hydrogen bonds, N
HB, in different identity ensembles of the tRNA
Glu class (
Figure 15) seems to confirm the above findings. At the constant anticodon tandem, the attraction of the other identity nucleotides increases with the actual number of hydrogen bonds, so the actual ΔG
revii’ decreases.
The examples of recalculations for glutaminyl-tRNA synthetase (
Figure 16a), histidyl-tRNA synthetase (
Figure 16b), and lysyl-tRNA synthetase (
Figure 16c) using consensus tRNA
aa strands suggest that for proper recognition, the change in binding free energy should decrease below a certain level, characteristic for a given class. The weaker bound or repelled tRNAs are not recognized. The energies of the real less bound strands, max ΔG
ii, for different classes, are shown in
Figure 17. They estimate the minimal binding energy levels.
The misacetylation as the source of mistranslation occurs approximately ten times less frequently than misreading [
19]. It should also be investigated as a potential source of translational errors. The data in
Figure 16c show that tRNA
Asn is the best candidate to be misrecognized by lysyl-tRNA synthetase of class II.
Generally, the energies presented in a holistic approach in
Figure 16a–c and
Figure 17 can be skewed and thus misleading due to overrepresentation or underrepresentation. To avoid this effect, the record duplicates were removed from a data set, and the unique strands representing a wide spectrum of possible tRNA occurrences in nature were considered (see
Section 2.1). In the theoretical model (Equation (3)), it was assumed that the [Aminoacyl-AMP aaRS] complexes for all coded amino acids are equally available and the classes in the machine learning classifier are equally accessible. These conditions, being reminiscent of the issue in statistical mechanics regarding the principle of equal a priori probabilities, may avoid additional skewness. The entropic component is not a topic here, but at an assumed constant temperature, it does not influence the change in the Gibbs free energy.
The coverage of identity nucleot(s)ides, i.e., the ratio of those that occurred to possibly occurred in a given class (please see the example in
Figure 5), varies with the maximal number of identity positions, N
IP (
Figure 19). It also decreases with the number of identity positions, which raises the question if it may be an evolutional trend. The coverage, similar to the abundance [
20], might be a useful parameter in the biophysical modeling of biological processes.
When simultaneously analyzing the field of the two parameters possibly related to the evolutionary history of genetic code, i.e., the molecular weight of charging amino acid, MW, and N
IP, a specific manner was found in which the tRNA
aa classes of aa belonging to the same metabolic families cover the area of discussed values (
Figure 20).
According to this picture, the tRNAaa for aa from the serine and pyruvate families is characterized by a lower range of MW and NIP values. On the other hand, the histidine and the aromatic family contain MW and NIP from the ranges of the higher values. The aspartate and glutamate families cover the range of moderate MW and the wide spectrum of NIP. One may conclude that the weak aa mass and the weakly identified tRNA, serine, and pyruvate families completely emerged at the beginning of the evolution of the code, much earlier than the final histidine and aromatic families. In this scheme, the other families were created throughout the entire period of the evolution.
Moreover, the average free energy change in a given tRNA
aa class increases with the molecular weight of the corresponding amino acid (
Figure 21). Two weight groups of amino acids were distinguished: below and above 150 [Dalton]. The initial trend for Gly-Met (blue) results in a fragile binding above the mass of methionine, which might stop the aminoacylation of tRNA. It is likely that this trend was evolutionarily changed to the stronger binding trend His-Trp (red) due to small mutations in the previously used stronger bound tRNA strands, even amplifying its binding. The consensus strand (
Table 5) of histidine (H) is the most similar to glutamic acid (E) and vice versa. The consensus strands are the closest in the entire analyzed set. The tRNA for histidine includes, as the only one, strongly attracts glycine at position −1 (
Table 4,
Figure 7a).
It is reasonable to expect that the tRNA of the amino acids with the later originated genetic code has more identity positions, the classes of earlier becoming code are better completed, and they transport lighter amino acids [
21]. The dependence of the coverage on the number of identity positions (
Figure 19) and the specific distribution of aa families on the parameters plane, MWxN
IP (
Figure 20), seem to support these expectations. As a result, the idea to treat N
IP as a determinant of the evolutionary progress can be postulated, and only alone anticodons or very short ensembles of identity nucleotides may play a role at the beginning of the evolution, i.e., a two-letter code evolution. However, their informative role became too low and came to be supported by the bigger sets of spatially distributed elements. This is especially clearly seen in the case of the anticodons of such pairs as Asp-Glu, Arg2-Ser2, Ile-Met, and Cys-Trp, where the third position is essential for proper translation and should be correctly recognized during amino acid load. These pairs could evolve from the two-letter coded amino acids loaded onto the strains containing the two separate subsets of the extra-anticodon unique identity nucleot(s)ides, which, at some stage of evolution, became processed by the two different synthetases. Then, unique extra-anticodon positions could avoid docking errors at the third position of the anticodon site. This enabled evolutional differentiation of the third position of the anticodon and the charging amino acids at the first and the second fixed positions of the anticodon tandem. A similar mechanism, obeying entire sets of identity nucleot(s)ides, could also permit the differentiation of the Lysyl-tRNA-synthetases classes (LysI and LysII). There are probably many other consequences of the existence of the identity nucleot(s)ides and their unique subsets within tRNA strands, e.g., determining the third-order structure, which requires a detailed analysis in the future. This is in favor of the positive answer to the central research question of this paper regarding the importance of identity nucleotides beyond anticodon.
The identity positions determined by the presented model (
Figure 22) cover 62% (60/97) of positions conserved in the three domains of life, as reported in a recent review [
22]. The 55 predicted positions in all classes are not cited in this review, and the 37 reported positions are not predicted. This may be due to the limitations of the literature study and the performance of the trained model (which might not always be the highest) at assumed parameters. Both factors can be improved in future research.
The independent results of the Gibbs free energy of tRNA-aaRS interactions are not known to the authors. The only known report of the estimation of long-range electrostatic attractions presented in the work [
23] reveals the order of a few k
BT, which is similar in magnitude to that observed in the results of our models. In the authors’ opinion, the presented application of the machine learning tool and the thermodynamics approach leads to interesting results that are hard to obtain using other methods, and this is why they are worth publishing.
The discussed importance of identity nucleotides makes the presented findings, such as universal markers, unique ensembles, or Gibbs free energy of tRNA-aaRS binding, important to the broader field of molecular biology or tRNA research.
Despite the present paper focusing on the tRNA, some evolutional aspects, among others indicating the number of identity positions as an important parameter related to evolution, are a good basis for the future study of the coevolution of tRNA and aaRS. Although one may expect that due to transcriptional and translational mechanisms, the synthetases evolve slower than the corresponding tRNA strands and could not be observed, the observed divergence of tRNA-ligase classes (LysI and LysII) is the first gate to this area.