A Suggestion of Converting Protein Intrinsic Disorder to Structural Entropy Using Shannon’s Information Theory

We propose a framework to convert the protein intrinsic disorder content to structural entropy (H) using Shannon’s information theory (IT). The structural capacity (C), which is the sum of H and structural information (I), is equal to the amino acid sequence length of the protein. The structural entropy of the residues expands a continuous spectrum, ranging from 0 (fully ordered) to 1 (fully disordered), consistent with Shannon’s IT, which scores the fully-determined state 0 and the fully-uncertain state 1. The intrinsically disordered proteins (IDPs) in a living cell may participate in maintaining the high-energy-low-entropy state. In addition, under this framework, the biological functions performed by proteins and associated with the order or disorder of their 3D structures could be explained in terms of information-gains or entropy-losses, or the reverse processes.


Introduction
Protein structures have played a central role in molecular biology ever since the "lock-and-key" model for enzymatic catalysis [1], along with the sequence-structure-function dogma [2] and the protein folding landscape theory [3]. However, many proteins from different organisms have been found to lack stable, well-folded tertiary structures in their native states. These proteins are termed the intrinsically disordered proteins (IDPs) or intrinsically disordered protein regions (IDPRs) [4]. The term "intrinsic" here means that the quantity is encoded in the protein primary amino acid sequence, on the same ground as the IDP denominator [5]. The abundance of IDPs and IDPRs, and the vital roles they play in cellular organisms and viruses, has been appreciated in a plethora of publications during the past two decades [6,7]. Despite the extensive discussions of IDPs in literature, "paradoxes" seem to present in IDP structures and functions, including reduced structural content versus enhanced structural heterogeneity, simplified folding landscapes versus complexified functionalities, etc. [8] The intrinsic disorder content of each residue in a protein can be predicted from its primary amino acid sequence. It is now well understood that the IDPs lack the hydrophobic cores necessary for successful folding [9] and that the charged residues (R, K, H, E, and D) are disorder-promoting [10]. An early study showed the association between the ratios of the numbers of charged over hydrophobic residues and the disorder content for a relatively small set of proteins [11]; however, this oversimplified approach was not predictive for larger protein sets. As representative present algorithms, the Predictor Of Natural Disordered Regions (PONDR) family predictors use artificial neural networks to integrate attributors, including the amino acid composition, sequence complexity, hydropathy, and net charge-and they yield reasonably accurate intrinsic disorder predictions [12].
Developing accurate intrinsic disorder predictors is an endless path [13], considering the astronomical size of the protein world-e.g., 20 100 different proteins of 100 amino acids-and the crowded and dynamic cellular environments surrounding the proteins. Quality assessment has been performed among different protein disorder predictors [14,15], which adopt the same standard for the intrinsic disorder score, representing the probability of being disordered-i.e., a fully ordered residue scores 0 (0%) and a fully disordered residue scores 1 (100%), respectively. The need for a unified guideline to report a protein intrinsic disorder has also been demanded [16]. In intrinsic disorder interpretations, ambiguities exist for both amino acids and proteins. For example, if an amino acid has an ID score of 0.5, is it an ordered or disordered residue? Similarly, for a protein with 50% disordered residues, is it an ordered or disordered protein?
In the present work, we suggest a conversion to the intrinsic disorder contents of the residues in the proteins into entropy contents, similar to the process developed by Shannon over 70 years ago [17]. In information theory (IT), the Shannon entropy uses a scoring system very similar to the ID predictors: A fully determined state scores 0 and a fully uncertain state scores 1, respectively. It is clear that the intrinsic disorder contents consist of the information that the protein carries. However, a closer examination must be taken to determine if the intrinsic disorder content could be treated as the disorder entropy, or structural entropy. Below we will show that the protein intrinsic disorder (D) cannot be regarded as the structural entropy (H) directly. In addition, a means for converting D to H was proposed that provides a conceptual framework that quantitatively defines the structural information carried by a protein in vitro.
Hereafter, we will use structural entropy-or simply entropy-in accordance with Shannon's insight that the information and entropy serve as measures of each other: Information is negative entropy, or negentropy (Equation (2) and see below). Note that the term "structural" defines a meaning for the information; however, the semantic aspect of information was originally discarded by Shannon [17] and later by some molecular biologists [18]. Still, other biologists argued that meanings or purposes are crucial to biological information [19,20]. Discussions on syntactics versus semantics in IT can be found in other publications [21], but are beyond the scope of this paper.
A fundamental difference between D and H is that the structural entropy, H, is an additive quantity, whereas the ID score, D, is not (see the Appendix in the supplementary material (SM)). That is, adding up structural entropies of all residues leads to the structural entropy of the entire protein, whereas we cannot infer the ID score of a protein by adding up the ID scores of all its residues. We suggest that the (intrinsic) structural entropy either of a single residue in the protein or of the entire protein can be directly calculated from a disorder predictor. Our suggestion indicates that it is feasible to quantitatively compare structural entropies of two proteins, regardless of their respective disorder percentages, which are rough and qualitative measures of the protein disorder.
Our suggestion may also help to understand protein structures and functions in the cells quantitatively and provide an easy answer to the paradoxes mentioned above: Reduction in information is equivalent to an increase in entropy or complexity. Under this framework, the processes in a living cell might be quantitatively viewed as the procedures for maintaining the low entropy levels.

Shannon's Equation: Structural Entropy (H), Information (I), and Capacity (C)
Shannon's equation [17] implies: where H(X) is defined as the entropy-or uncertainty-of X, which is any source of information with the number of states n and p i is the probability of the i-th state of X. K is an arbitrary scaling factor, which means that the base of the logarithm function could vary depending on the unit of the entropy. For example (set K = 1), a base of two corresponds to the unit of the binary digit (bit), and the natural base (ln) leads to the unit of the natural unit (nat). The maximum value of H(X) H max (X), determines the upper limit or the capacity of X, C(X), which Shannon considered the channel capacity for communications [17]. The second half of Equation (1), though not explicitly stated by Shannon, indicates that, for a given capacity C(X), the entropy H(X) could serve as a measure of the information I(X) that one has received-that is, As a consequence, the information change ∆I was termed negentropy (negative entropy) [22], which is what Maxwell's demon receives to generate order out of disorder [23].
To connect the intrinsic disorder contents obtained by regular disorder predictors, the structural entropy of a protein X with a length of L is defined as: where d i is the intrinsic disorder content of the i-th residue.
In the fully disordered state (d i = 1 for all residues), the protein exhibits the maximal entropy, expressed as the structural capacity of: which carries zero structural information I(X). When all residues are fully ordered (d i = 0 for all residues), H(X) = 0 since lim x→0 x log x = 0, and the I(X) reaches the maximum of L. Normally, 0 ≤ d i ≤ 1 and 0 ≤ H(X) [or I(X)] ≤ L. The criteria for using the logarithmic measure suggested by Shannon [17] then holds. In Equation (3), each residue is in a two-state system, with the probabilities x i in one state and (1 -x i ) in the other state, and the total number of states is n = 2 L . More details and potential alternatives to this approach have been summarized in the Appendix in Supplementary. Equation (4) indicates that the structural capacity (C) of a protein is its amino acid (AA) sequence length (L). In addition, it had been reported that C (protein length L was used in the original report, but since it is equivalent to C, only C will be used hereafter to prevent unexpected confusion) has an exponential distribution or a gamma distribution [24]. We used three models, the exponential, gamma, and power law distributions, to fit the structural capacities arranged in hierarchical ranks from the smallest to the largest (Table S1). The exponential model provides the best overall fit to the structural capacities in all proteomes studied in the present work. The gamma model, in some cases, produces a better fit than the exponential model, but it fails to recapture the small-C regions for the two animal proteomes (human and fruit fly). The power law model fits the small-C but fails at the large-C regions. Therefore, the exponential model is the best choice here to describe the protein structural capacity distributions, exemplified in Figure 1A,D. Fittings of all proteomes studied here are summarized in Table S1 in the SM.
An exponential fit of C implies a linear fit of logC (with Cs ranked hierarchically), which may therefore be a better measure of the protein-capacity distributions in proteomes [24]. Interestingly, with a bin size of ∆log 2 C = 1, Gaussian distributions can be observed in logC of all proteomes exemplified in Figure 1B,D. The underlying mechanisms of this Gaussian distribution have not been discussed in the literature; however, the distribution density can also be understood as the potential of mean force, or free energy, using the relationship: where R is the gas constant, T is the temperature, and ρ i is the density of proteins at the i-th histogram of protein capacities. A preferential free-energy well (insets in Figure 1B,D) may present in the proteome, confining the protein structural capacity distributions. The locations and depths of this well vary among organisms: For the human proteome, the well depth is 3.16 kcal/mol centered at C = 592, whereas for JCVI-Syn3.0 the well depth is 1.70 kcal/mol centered at C = 401.
where di is the intrinsic disorder content of the i-th residue.
In the fully disordered state (di = 1 for all residues), the protein exhibits the maximal entropy, expressed as the structural capacity of: which carries zero structural information I(X). When all residues are fully ordered (di = 0 for all residues), H(X) = 0 since lim → log = 0, and the I(X) reaches the maximum of L. Normally, 0 ≤ di ≤ 1 and 0 ≤ H(X) [or I(X)] ≤ L. The criteria for using the logarithmic measure suggested by Shannon [17] then holds. In Equation (3), each residue is in a two-state system, with the probabilities xi in one state and (1 -xi) in the other state, and the total number of states is n = 2 L . More details and potential alternatives to this approach have been summarized in the Appendix. In the left figures, the brown and green dots are the longest and shortest proteins of the proteomes, respectively. The red-dashed lines are the exponential fittings. The insets in B and D convert the probability distributions of proteins to the potential of mean forces (in kcal/mol) at 300 K using Equation (5). Distributions of C using different fitting models are summarized in Figure S4 and Table S1 of the Supplementary Material (SM).
Equation (4) indicates that the structural capacity (C) of a protein is its amino acid (AA) sequence length (L). In addition, it had been reported that C (protein length L was used in the original report, Figure 1. Distribution of the structural capacity C (left) and Gaussian distributions of log 2 C (right) in proteomes of (A,B). Homo sapiens (blue) and (C,D) JCVI-Syn3.0 (red). In the left figures, the brown and green dots are the longest and shortest proteins of the proteomes, respectively. The red-dashed lines are the exponential fittings. The insets in B and D convert the probability distributions of proteins to the potential of mean forces (in kcal/mol) at 300 K using Equation (5). Distributions of C using different fitting models are summarized in Figure S4 and Table S1 of the Supplementary Material (SM).

Entropy-to-Information Ratio, R
The extremely dynamic environment in the cell renders difficulties in detection or prediction of the changes of the structural entropies and structural information. However, the sum of both quantities equals the structural capacity that is the same as the AA sequence length, and the information gain always equals the entropy loss and vice versa. Hence, it is useful to know the relationships between the intrinsic structural entropies and the information that can be derived (Equation (3)) from the residual disorder contents, which, in turn, can be predicted based on the protein amino acid sequences.
Here, we define the entropy-to-information ratio, R, as R = H:I. R ranges from 0 (for a fully ordered protein with H = 0 and I = C) to ∞ (for a fully disordered protein with H = C and I = 0). When R > 1, the protein is entropy rich (H > I), and when R < 1, the protein is information rich (H < I).
We noticed no apparent correlation between C and R for all proteomes-e.g., the Pearson correlation coefficients between C and R is -0.03 for both H. sapiens and Arabidopsis thaliana, and a similar trend can be found in other proteomes studied in the present work. This lack of correlation permits the construction of a protein space with two attributes C and R: The former represents the sum and the latter represents the ratio of the structural entropy (H) and information (I), respectively. This space is termed the protein CR-space hereafter. In the CR-space, C i sets the upper limit and R i gives the intrinsic ratio between the two quantities of the total H i and I i of the i-th protein P i . Based on the Gaussian distribution in both log 2 C and log 2 R, the protein distributions in the CR-space are represented in Figure 2, for proteins expressed in representative prokaryotes (bacteria and archaea, Figure 2A [25] and protein data bank (PDB) [26] (Figure 2D). similar trend can be found in other proteomes studied in the present work. This lack of correlation permits the construction of a protein space with two attributes C and R: The former represents the sum and the latter represents the ratio of the structural entropy (H) and information (I), respectively. This space is termed the protein CR-space hereafter. In the CR-space, Ci sets the upper limit and Ri gives the intrinsic ratio between the two quantities of the total Hi and Ii of the i-th protein Pi. Based on the Gaussian distribution in both log2C and log2R, the protein distributions in the CR-space are represented in Figure 2, for proteins expressed in representative prokaryotes (bacteria and archaea, Figure 2A), eukaryotes ( Figure 2B), viruses ( Figure 2C), and datasets collected from the DisProt database [25] and protein data bank (PDB) [26] ( Figure 2D).

Figure 2.
Distributions of proteins in the CR-space from proteomes of (A) prokaryotes, (B) eukaryotes (C) viruses, and (D) protein sets from DisProt and PDB [25,26]. Each protein i is represented by a black dot. The horizontal axis is C i = H i + I i and the vertical axis is the ratio between H i and I i such that R i = H i /I i . The horizontal dashed line (red) corresponds to R i = 1; proteins above this line are entropy-rich with H i > I i , and those under this line are information-rich with H i < I i . The vertical dashed line serves as a reference, using the median capacity of the human proteome at C i = 417. The total H-to-I ratios (ΣH:ΣI) are given in the maps. Note that the boundaries of the horizontal (R) and vertical (C) axes vary across different proteomes and/or protein sets. Table 1 summarized the protein distributions in the CR-space of all proteomes and datasets, and more discussions will be given as follows. All data are represented using (C, R), where C is the structural capacity (vertical axis in Figures 2 and 4) and R is the H-to-I ratio (horizontal axis in Figures 2 and 4), respectively. The data for proteins with extreme C's or R's are given.
For those with multiple entries, a range for C or R and the total number of proteins is listed. The infinite (∞) values in R come from the totally disordered proteins with I = 0 and H = C, respectively. N represents the total number of different proteins in the proteome or dataset; for eukaryotic proteomes, only the primary protein at each gene locus is counted. The last column gives the total H divided by total I of all proteins in the proteome or dataset. No average R's are given because there are proteins with R of ∞ in all eukaryotes, datasets, and the archaea Lokiarchaeum and Nanoarchaeum equitans. Details of the names of organisms and datasets can be found in the Methods section. Figure 2A,B show the protein distributions of prokaryotic and eukaryotic proteomes, respectively. The total number of proteins vary from the smallest synthesized bacterium JCVI-Syn3.0 that has only 438 proteins [27] to the monocot plant Oryza sativa (rice) that has 48,782 proteins [28,29]. A significant difference between prokaryotes and eukaryotes is the overall R (ΣH:ΣI) of the proteomes: Values for prokaryotes range from 1.29 (Nanoarchaeum equitans) to 1.57 (Ignicoccus hospitalis) and values for eukaryotes range from 2.10 (Amborella trichopoda) to 2.53 (H. sapiens).

Prokaryotes vs. Eukaryotes
Table 1 also shows the extreme C and R of all proteomes. The most distinct difference between prokaryotes and eukaryotes is the number of totally disordered proteins that have R max = ∞, which is induced by I = 0 (H = C) of these proteins. In the prokaryotes, only one such protein can be found in Lokiarchaeum [30], which exhibits strong potential to represent the ancestral host archaeon that accidentally swallowed the bacterium (Rickettsiale being a potential relative [31]), which fortunately turned into a symbiont and eventually evolved to the mitochondria of eukaryotic cells. All eukaryotic proteomes, however, contain fully disordered proteins, from two in Saccharomyces cerevisiae to 37 in H. sapiens.
It should be mentioned that there are 25 seleno-proteins in the human proteome, which has an unusual type of residue of selenocysteine (one letter code U, or three letter code Sec), though the disorder contents of these proteins cannot be predicted by the current version of PONDR predictors (see the SM for a list of the human seleno-proteins). These proteins were not included in the final analysis of the current study. Figure 2C shows the protein distributions of two giant DNA viruses (giruses) and two RNA viruses. The R's of viruses vary significantly from 0.96 (human coronavirus) to 2.09 (Pandoravirus).

Viruses
Although the RNA viruses generally contain less than 10 protein-encoding genes, the protein intrinsic disorder-i.e., structural entropy-may be reflective of immune evasion and virus-host interactions, especially for the Ebola virus, in which all proteins are entropy-rich and have R > 1 [32]. On the other hand, the giruses have proteomes similar to those of some cellular organisms in both total protein numbers and their distributions in the CR-space. For example, ΣH:ΣI of Mimivirus is 1.56, close to that of the prokaryotes, and ΣH:ΣI of Pandoravirus is 2.09, close to that of some of the eukaryotes, such as the basal angiosperm A. trichopoda. Moreover, Pandoravirus has a fully disordered protein with a capacity of 29.

Datasets
The present work focuses on the intrinsic disorder and omits the environmental cues, which will certainly have great impact on determinations of the structural information and entropies. In contrast, the proteins collected in the datasets (DisProt [25] and PDB [26]) have been validated (or somewhat biased) experimentally. The protein distributions of the selected datasets are shown in Figure 2D.
Not surprisingly, the proteins from the DisProt database exhibit the highest ΣH:ΣI of 2.99. However, several proteins (8 out of 803, or 1%) have R < 1. This difference may be caused either by the protein disorder prediction methods used here or by inherent limits of the validation protocol of the database. The eight information-rich proteins from the DisProt database, as well as their sequences, have been listed in the SM. Apparently, the human Titin [33] protein (C = 34,350, R = 2.52) is collected in DisProt. However, a closer look finds that that protein in DisProt points to the nuclear magnetic resonance (NMR) structure (PDB entry 1BPV) of the type-1 module of the Titin protein (C = 112, R = 2.30). Nevertheless, the Titin protein does have an extremely long IDPR based on our calculation ( Figure S1 in the SM).
We further analyzed the protein sequence sets from the PDB. After removing redundant entries, each sequence in the following datasets represents a unique sequence: NMR set for sequences of structures collected by solution NMR (8,476 entries), X-ray 1.5 set for sequences from X-ray crystallographic structures with resolution higher than 1.5 Å (5,842 entries), and X-ray 3.0 set for sequences from X-ray structures with resolution lower than 3.0 Å (10,636 entries). The NMR method largely limited the size of proteins with the majority of Cs smaller than 200. However, the structures detected by NMR generally have higher flexibility, leading to larger R's (ΣH:ΣI = 2.65). The X-ray 1.5 and X-ray 3.0 sets have ΣH:ΣI of 1.78 and 1.73, respectively, but no significant difference was observed between these two datasets. It seems that the structural entropies of protein structures deposited in the PDB are method sensitive but are not determined by the resolution of the crystallography.
It is worth noting that many structures deposited in the PDB (especially those solved by X-ray crystallography) have missing residues. However, here we are interested in their sequences that contain the missing residues. We also noticed that there are fully disordered (i.e., R = ∞) proteins in all datasets from PDB (Table 1). It may not be surprising to solve the highly disordered proteins using NMR, especially for relatively small proteins. However, even the X-ray 1.5 set contains fully disordered proteins. We listed the fully disordered proteins with C > 20 in both X-ray 1.5 and X-ray 3.0 sets in the SM (Tables S2 and S3). First, these proteins are relatively short with the longest one (1JCD, X-ray 1.5 set) having 52 residues. It may be possible that the PONDR predictor did not perform well for these sequences, particularly the collagen sequences (Tables S2 and S3). On the other hand, all listed proteins either are subunits of big protein complexes or form oligomers (mostly homo-trimers), suggesting coupled folding and binding may help these IDPs attain the folded structures [34]. The mutual folding induced by the binding (MFIB) database has collected 205 IDPs from PDB (both X-ray and NMR structures), with their folding induced by binding [35]. None of the proteins (Tables S2 and S3) have been collected in the MFIB database, but several entries listed in Table S3 have also been collected in the intrinsically disordered proteins with extensive annotations and literature (IDEAL) [36] database. Our findings may help expand the collection of MFIB, IDEAL, and similar databases.

Random Sequences
We randomly generated 500 proteins with random capacities in the range [50,800]. Interestingly, this random set has a ΣH:ΣI ratio of 1.020 ( Figure S5 in the SM), indicating that the total entropy and information are roughly 1:1. This ratio is smaller than those from cellular proteomes shown in Figure 2 and is close to the viral proteome of human coronavirus ( Figure 2C), suggesting that when total structural entropy is found to be relatively larger than total structural information in cellular organisms, that relationship is not random.

Evolution of Protein Structural Entropies and Information
Based on the analysis of cellular organisms and viruses, it seems that the ΣH:ΣI ratio may set a boundary at approximately 1.6 to 2.0 for separating prokaryotes and eukaryotes. In addition, if there were such a boundary, the two giant DNA viruses (giruses) studied here would be located at different branches, with Mimivirus close to or within the prokaryotic branch and Pandoravirus close to or within the eukaryotic branch, respectively.
Previously, by using the protein length L (structural capacity C) and the disorder content D, distributions of proteins in the space with attributes L and D have been used to reconstruct a phylogenetic tree that indicates intriguing evolutionary dynamics associated with the variations of L and D in the proteomes [37]. A similar approach is used here to study protein distributions in the HI-space.
The CR-space is split into 5 × 4 blocks ( Figure 3A and Table 2). The gene densities [38] at different blocks are then calculated, and the distance between two organisms A and B is defined using the Euclidean Equation: where D AB is the distance between organisms A and B and X ij (X = A or B) is the gene density at the ij-th (1 ≤ I ≤ 5, 1 ≤ j ≤ 4) block. The calculated distance matrix is converted to a phylogenetic tree, as shown in Figure 3B. The RNA viruses are excluded in this tree because both have fewer than 10 proteins and therefore are not statistically significant. and D in the proteomes [37]. A similar approach is used here to study protein distributions in the HIspace. The CR-space is split into 5 × 4 blocks ( Figure 3A and Table 2). The gene densities [38] at different blocks are then calculated, and the distance between two organisms A and B is defined using the Euclidean Equation: As expected, the phylogenetic tree clearly separates eukaryotes from prokaryotes. This tree also correctly separates both animals (H. sapiens and Drosophila melanogaster) and plants (A. thaliana and O. sativa), whereas the basal angiosperm (A. trichopoda) and moss (Physcomitrella patens) are at the basal locations of the eukaryotic branch. Mimivirus is located at the prokaryotic branch and Pandoravirus is located at the eukaryotic branch, sitting with the moss P. patens. This trend is also expected from the ΣH:ΣI ratios. However, similar to our previous study [37], for all prokaryotes, the phylogeny failed to separate bacteria from archaea without introducing other attributes. Further investigations, such as considering the domain components in the proteins [39], will be required to understand the evolution of the protein entropy.
The phylogenetic tree presented in Figure 3B (tree of proteomes (ToP)) is by no means for capturing the evolution of species that are usually derived based on multi-sequence alignments (MSAs) of genes. However, this ToP might be visualized as the evolution of structural entropies. It seems that higher complexity favors higher entropic content, especially the entropy-to-information ratios, in the proteins.

Structural Entropy of Regulatory Proteins
Two categories of proteins-the transcription factors (TFs) and kinases-are further examined from three model organisms, Escherichia coli, A. thaliana, and H. sapiens. Both categories are abundant in the organism and play regulatory roles in the cells. Whilst regulations by the TFs often directly affect the expressions or translations of the genes or transcripts through recognitions and/or interactions with the nucleic acids, the kinases perform post-translational regulations by catalyzing the phosphorylation reactions to the target AAs of proteins (often Ser, Thr, or Tyr) through the adenosine triphosphate (ATP) cofactors. Figure 4 shows that the eukaryotes (the plant A. thaliana and animal H. sapiens) possess higher ΣH:ΣI ratios than the prokaryote (E. coli), which is consistent with the trend in Figure 2. Moreover, the regulatory mechanisms may influence the H:I ratios (R's) of the proteins such that the recognition and interactions with the DNA or RNA targets for the TFs may involve larger amounts of information gains or entropy losses compared to the enzymatic catalysis by the kinases via a lock-and-key-like model. Therefore, we may propose that more intrinsic structural entropies are required for the regulations by TFs than the those required by the kinases, or possibly other enzymes catalyzing different reactions in the cells. These differences are more significant in the eukaryotes (H. sapiens and A. thaliana) than in the prokaryotes (E. coli). Moreover, the A. thaliana TFs have a larger ΣH:ΣI ratio compared to those of H. sapiens, indicating that the plant TFs may possess overall higher structural entropies than do the animal TFs. from three model organisms, Escherichia coli, A. thaliana, and H. sapiens. Both categories are abundant in the organism and play regulatory roles in the cells. Whilst regulations by the TFs often directly affect the expressions or translations of the genes or transcripts through recognitions and/or interactions with the nucleic acids, the kinases perform post-translational regulations by catalyzing the phosphorylation reactions to the target AAs of proteins (often Ser, Thr, or Tyr) through the adenosine triphosphate (ATP) cofactors.  Figure 2, each protein is represented by a black dot in the maps, the horizontal axis is for the structural capacity Ci, and the vertical axis is for the H:I ratios Ri. ΣH:ΣI is given for each group of proteins. Figure 4 shows that the eukaryotes (the plant A. thaliana and animal H. sapiens) possess higher ΣH:ΣI ratios than the prokaryote (E. coli), which is consistent with the trend in Figure 2. Moreover, the regulatory mechanisms may influence the H:I ratios (R's) of the proteins such that the recognition and interactions with the DNA or RNA targets for the TFs may involve larger amounts of information gains or entropy losses compared to the enzymatic catalysis by the kinases via a lock-and-key-like model. Therefore, we may propose that more intrinsic structural entropies are required for the regulations by TFs than the those required by the kinases, or possibly other enzymes catalyzing different reactions in the cells. These differences are more significant in the eukaryotes (H. sapiens and A. thaliana) than in the prokaryotes (E. coli). Moreover, the A. thaliana TFs have a larger ΣH:ΣI ratio compared to those of H. sapiens, indicating that the plant TFs may possess overall higher structural entropies than do the animal TFs.  Figure 2, each protein is represented by a black dot in the maps, the horizontal axis is for the structural capacity C i, and the vertical axis is for the H:I ratios R i . ΣH:ΣI is given for each group of proteins.

Discussion
Many algorithms have been developed to predict the probabilities of being disordered for amino acids in proteins, which are presented as the disorder contents of the residues; however, it is not straightforward to quantify how "disordered" a protein is using these approaches. Here, we suggest converting the disorder content to structural entropy, which is continuous and additive and therefore makes it feasible to quantitatively compare two proteins in terms of the degrees of structural entropy and/or information they possess.
As Erwin Schrödinger noted, "life feeds on negative entropy" [40]. The nature of the negative entropy (negentropy) is, actually, the information. Our results suggest that proteins and other biomolecules also feed on information. The capacity of every protein is fixed (Equation (4)), and, therefore, the higher the information content it receives, the lower the entropy it possesses. A recent paper further indicates that the IDPs in the cell are either globally or partially structured. [41] In other words, the structural entropy carried by the IDPs is fully or partially registered by the structural information from the crowded and dynamic intracellular environment of the living cell.
It might also be expected that the structural entropy/information proposed here has implications in biological processes where structural information-gain/entropy-loss (or the reverse) occur, such as in gene transcriptions, translations, protein folding pathways, protein-protein or protein-ligand interactions, and enzymatic reactions, to name a few. A recent work [42] reported the general transcriptional coactivator CBP that binds to two different proteins, the negative feedback regulator CITED2 (target T1) and the hypoxia-inducible factor HIF-1α (target T2). The binding affinities of CBP to both substrates are very close. However, even at a modest concentration, the substrate T1 rapidly substitutes T2 in binding to CBP. In addition, it was shown that T1 exhibits higher disorder contents-and therefore higher structural entropy-than T2. Under our approach, the higher specificity of T1 can be understood as the higher information-gain of CBP upon binding to T1 than that of binding to T2.
This approach might be extended to DNA, RNA, and other biomolecules in cases where their entropies and information could be estimated. It should be emphasized here that to quantify the information or entropy, the meaning of the information might be needed. This may especially be the case for the biological information. For instance, the meaning of the information discussed in the present work is the structure of proteins-or, more precisely, the order-to-disorder continuum in the protein world.

Methods
Proteomes of organisms from all three domains of life have been surveyed in the present work. Three bacteria include the alphaproteobacterium Rickettsiales bacterium Ac37b [43], which may be a close precursor of eukaryotic mitochondria, the cyanobacterium Synechococcus elongatus PCC 7942 [44], which may represent the precursor of plant plastids, and, recently synthesized, the smallest known free-living organism JCVI-Syn3.0 [27]. Three archaea include the smallest known free-living archaeon I. hospitalis; its parasite, the smallest known archaeon N. equitans [45]; and Lokiarchaeum (sp. GC14_75) [30]. The latter was reported to possess eukaryotic genes, such that its close relative, may have served as the host to accommodate an obligate parasitic bacterium, and the archaeon and bacterium accidently merged to produce the first eukaryotic cell some 2 billion years ago. The eukaryotic organisms include three plant species: the basal angiosperm Amborella trichopoda [46], the monocot plant O. sativa [29] (rice), and the eudicot and model plant A. thaliana [47]; two animal species: H. sapiens [48,49] (human) and D. melanogaster [50] (fruit fly); the moss P. patens [51]; and the yeast S. cerevisiae [52]. Several viral proteomes were also studied, including two RNA viruses with less than 10 genes from the Ebola virus and human coronavirus (which causes the common cold), plus two giant DNA viruses (giruses), the Mimivirus [53], and Pandoravirus (or P. salinus) [54]. In addition, protein sequences from the database of experimentally confirmed disordered proteins, DisProt (v7.0) [25], and those from the protein data bank [26] (PDB, up to December 16, 2016) are also analyzed. The redundant sequences from the PDB have been removed, and each entry represents a unique protein sequence in the analysis: For each of the protein sets, if there are multiple identical copies of a sequence, such as those from identical chains from a homo-dimer or other oligomers, only one unique sequence is retained for further analysis.
The PONDR-VSL2 algorithm [55] was applied to predict the ID content of all residues in a protein. The obtained disorder content was then converted to probabilities that associated with the structural entropy, using Equation 3. A neighbor-joining method, from the T-REX web server [56], was used to convert the distance matrix to phylogenetic trees. More details on converting the disorder contents to structural entropy, and the fitting of the structural capacities using different models, can be found in the supplementary materials (SM).
GitHub repository: https://github.com/haoboguo/Protein-Structural-Entropy. Code written in R (random.seq.R) was used to generate 500 random sequences with random capacities in the range [50,800]. A shell-script was written (entropy.csh) to perform protein structural entropy/information/capacity evaluations, using Equation (3). The PONDR-VSL2 predictor was downloaded from http://www.dabi. temple.edu/disprot/predictor.php. JAVA is required to run this code to perform disorder predictions. See descriptions in above repository.
Supplementary Materials: The following are available online at http://www.mdpi.com/1099-4300/21/6/591/s1, Figure S1: Profile of structural entropy of the residues in the giant human Titin protein (C = 34350). Appendix, on the derivation of the equations that convert the disorder contents to the probabilities of states (with Figures S2  and S3), Figure S4: The exponential, gamma, and power law fittings to the structural capacities of the human and JCVI-Syn3.0 proteomes, Table S1: Summaries of the exponential, gamma, and power law fittings of the protein structural capacities of the proteomes studied in this paper list of 25 selenoproteins in human (H. sapiens) proteome, whose disorder contents cannot be predicted by the PONDR list of eight information-rich proteins from the DisProt database (v7.0) and their sequences, Table S2. X-ray structures from PDB with resolutions < 1.5 Å, R = ∞ (fully disordered) and C > 20 in sequences, Table S3. X-ray structures from PDB with resolutions > 3.0 Å, R = ∞ (fully disordered) and C > 20 in sequences, Figure S5. Distribution of proteins in the CR-space from 500 randomly built protein sequences with capacity randomly chosen in the range [50,800]. ΣH:ΣI ratio is 1.020 from this random set. The vertical dashed line represents the median capacity of 417 from H. sapiens proteome, and the horizontal dashed line is at R = 1.