While intrinsically disordered proteins and regions (IDPs/IDRs) compose a significant part of the proteome, their nature and disorder-function mechanism of activity represent long-missed biochemical paradigms. Their occurrence increases with organism complexity, which is mirrored in the portfolio of IDP/R activities, such as signaling, recognition, and translation/transcription regulation. In addition, IDP/Rs are considered more promiscuous and evolvable than globular proteins [1
The plethora of IDP/R properties and functions is so broad that the scientific community has been struggling to establish a simple classification strategy. Different classification schemes have been based on function, functional features, sequence motifs or biophysical properties [2
]. The simplest structural distinction can be made between IDP/Rs that (i) fold upon binding or as a response to a specific environment, and (ii) sequences that, to our best knowledge, remain unfolded, i.e., “induced fold” and “unfoldable” IDPs (as used throughout this report).
It is generally agreed that all the distinct functional and structural properties of IDP/Rs are rooted in their amino acid sequences [3
]. It was noticed soon after the first collection of IDP/Rs that there is a consequential bias in both the amino acid compositions and sequences [3
]. First, IDP/Rs are depleted in the “order-promoting” residues (CWYFILHVNM) and enriched in “disorder-promoting” residues (KEQSPRGA), bringing together a relatively high net charge with low mean hydrophobicity. Second, the bias is often exacerbated by low complexity regions in IDP/Rs containing multiple repeats [5
]. Because of the lack of structural constraints, IDP/Rs are considered less restricted to explore a larger sequence space [5
]. This could imply that the biophysical properties of IDPs are easier to maintain than those of globular proteins. To what extent these properties result from the globular/intrinsically disordered protein amino acid composition, and to what extent they are optimized by the sequence of amino acids in the protein chains, is not known.
In this study, we address these issues by analyzing physicochemical properties and performing an amino acid sequence permutation experiment. Rudimentary properties of intrinsically disordered and structured globular protein datasets are compared using sets of bioinformatic analyses on the level of their native sequences and their sequence permutations. While the amino acid composition is preserved during explored permutations, the contributions of composition versus sequence to selected protein properties is directly addressed. Here, we report how protein composition determines the spectrum of the basic biophysical properties that a sequence can adopt. The most general classes of IDPs (induced fold and unfoldable) can be distinguished based on these properties. On the other hand, the spectrum of properties for a fixed composition appears to be quite narrow for aggregation, while sequence rearrangements seem to have a slightly more pronounced impact on fine-tuning secondary structure elements.
The goal of this study is to quantify the importance of the amino acid composition versus sequence on the physicochemical properties of IDPs and globular proteins. As expected, IDPs have, on average, lower predicted secondary structure content, lower aggregation propensity and biased amino acid composition. This is in accordance with earlier studies [4
]. However, IDPs exhibit a broad range of these properties. We can conclude that IDPs that are foldable upon external triggering (induced fold) have similar compositions and secondary structure/aggregation propensities as folded proteins, distinguishing them from unfoldable IDPs. Amino acid composition seems to be the major determinant of this distinction. Unfoldable IDPs are, on average, more enriched, e.g., in glutamate and lysine (located on PC2-axis in Figure 2
). Some other representatives of IDPs show a high content of glycine, proline and serine (on PC1-axis in Figure 2
). These findings recapitulate the fact that there are different flavors of disorder, which are distinguishable by amino acid composition and biological function [20
Previous studies using polymer theory and biophysical measurements have defined three distinct compositional classes based on the fraction of charged versus polar residues [21
]. One of these classes (polar tracts) was reported to be enriched in polar amino acids, which probably corresponds to the PC1 component of our analysis (Figure 2
). In further agreement, the PC2 component of our analysis (enriched in charged residues) corresponds to the classes of polyampholytes/polyelectrolytes that would have an intrinsic tendency to populate expanded conformations [22
]. Our study is therefore in accordance with the IDP compositional bias reported previously. In addition, we show that compositional analysis and secondary structure prediction can further differentiate the induced fold IDP class that is very similar to globular proteins in these characteristics, and would not fit easily into categories described by polymer theory.
The sequence permutation experiment was used to resolve the contributions of sequence versus composition to globular and IDPs. Upon sequence randomization within a given composition, IDP sequences increased in beta-sheet content, unlike folded proteins that decreased moderately in secondary structure content. Sequence rearrangements appeared to have a smaller impact on the aggregation propensity. However, it has been recognized that there is a tradeoff between folding and aggregation propensity, and they cannot be optimized simultaneously [23
]. While highly aggregating sequence stretches are prevented, some extent of aggregation propensity is tolerated, especially at the cost of achieving protein structure [24
]. Loss of secondary structure content upon sequence randomization of folded proteins is suggestive of this tradeoff, where aggregation propensity can be counteracted by folding. Similarly, suppression of beta-sheet content in IDPs could minimize their risk of aggregation. This is in agreement with a previous study reporting that IDP compaction is preferentially achieved by forming alpha-helices [17
]. Interestingly, randomized IDP and PDB proteins have been reported to have an increased propensity for amyloidogenesis [17
]. For a given composition, protein sequence can probably be an important variable of structural compaction [25
We can conclude that changes in properties after sequence randomization are not dramatic, as observed in Figure 3
. This is in agreement with the observation that the secondary structure and aggregation propensity of random sequences are overall similar to biological proteins, and that even de novo genes have aggregation propensities similar to existing proteins [26
]. Moreover, our results are not surprising, as compositions have been reported to be conserved among orthologs of IDPs, even if their sequences are poorly conserved, and IDP composition has been suggested to define an IDP functional classification scheme based on these finding [22
However, it is important to keep in mind that the consensus secondary structure and aggregation predictions used in this study both have an accuracy of 70–80% (Table S2
]. It is possible that some of the observed phenomena were biased by the predictions.
Overall, our study suggests that amino acid composition is the most important parameter of behavior within the IDP class. Sequence rearrangements have limited effects on the physicochemical properties, both on aggregation as well as secondary structure content, based on bioinformatic predictions. Nevertheless, further experimental verification of this outcome is demanded.