Revisiting Chameleon Sequences in the Protein Data Bank

The steady growth of the Protein Data Bank (PDB) suggests the periodic repetition of searches for sequences that form different secondary structures in different protein structures; these are called chameleon sequences. This paper presents a fast (nlog(n)) algorithm for such searches and presents the results on all protein structures in the PDB. The longest such sequence found consists of 20 residues.


Introduction
It is the accepted wisdom of structural biology that the amino acid sequence of a protein determines its three-dimensional structure.This is rather remarkable since individual peptides are quite flexible, as can be deduced from the fact that large areas of the Ramachandran [1] plot represent literally observed conformations, this being confirmed by many simulations [2].This apparent conundrum can be resolved by assuming that for a sequence to be able to determine the structure it has to be long enough.The first attempt at clarifying this issue originated from the Paracelsus challenge of Creamer and Rose [3] where they posed the challenge of changing less than 50% of the sequence to cause a change in the fold (i.e., the tertiary structure).Not only was the challenge soon met, recent work found examples whereby changing only 10% of the sequence still changed the fold [4].The complexity of the relationship between sequence and structure is further highlighted by the discovery of misfolded proteins; the importance of understanding misfolding, in some cases combined with (or helped by(?)) aggregation, stems from the fact that they are often involved in serious diseases like mad-cow disease [5] and Alzheimer's [6], as well as Parkinson's [7].Another issue that somewhat muddies the sequence-structure relationship is the discovery of intrinsically disordered proteins (IDP), which highlights the important role the protein's environment plays in forming its structure.
However, the prediction of the protein structure is a 'Hard' problem [8].Thus, initial attention has been focused on the problem of predicting the secondary structure from the sequence [9].Clearly, successful prediction of the secondary structure is a necessary, but not sufficient, condition of successfully predicting the tertiary structure.The usefulness of the ability to predict secondary structure from sequence is further highlighted by the hierarchical nature of protein folding [10], i.e., during folding, secondary structures form locally as the first step.Given the known flexibility of peptides (vide supra), it is not surprising that the reliability of such prediction is limited [11].One limit on the reliability of sequence-based structure prediction is the existence of chameleon sequences, i.e., sequences that can form different secondary structures in different environments.In particular, the more reliable such predictions are, the less likely that a given peptide sequence, to be a chameleon.Given the fact that conformational promiscuity is often involved in diseases, the characterization of chameleon sequences can contribute both to the field of structure prediction and to the fight against misfolding-related diseases.
The Protein Data Bank (PDB) [12] has been searched for chameleons several times, starting with the work of Kabsch and Sander [13], and as the number of structures in the PDB grew several repeat searches where conducted (including one by this author) [14][15][16][17][18][19].The longest chameleon sequences found involved 10 residues.
As the number of structures in the PDB grew, the searches were restricted [17,19] to non-redundant subsets like The Structural Classification of Proteins (SCOP) [20] or the set at NCBI (https://blast.ncbi.nlm.nih.gov/Blast.cgi).This sped up the search and also eliminated the problem of giving large weight to proteins whose structural variants (and their homologs) occur repeatedly.It has the drawback of possibly missing chameleons that only occur in a variant/homolog that was not included in the non-redundant set.The aim of this paper is to present a fast search algorithm, apply it to the proteins in the current PDB, and analyze the resulting chameleons.The chameleon search software will be available from the author's website at the URL http://inka.mssm.edu/~mezei/cham.

Materials and Methods
The chameleon search requires the list of sequences and the identification of the secondary structure elements in each.The search can use the annotated list of sequences in the PDB, a file that can be downloaded from the PDB website https://cdn.rcsb.org/etl/KabschSander/ss.txt.gz with secondary structure annotations made with the DSSP (Define Secondary Structure of Proteins) algorithm [21] or the actual full structure files either in the legacy PDB format or in the more general PDBX/mmCIF format where the secondary structure element annotations are made by the author(s) submitting the structure.Note that the DSSP and authors' annotations may differ in some instances.Following the procedure used in the author's earlier publication [15], chameleons are in all-helix conformation in one structure and in all-sheet conformation in another; this is different from the convention that was used, e.g., in [16].Note that when the annotated sequence file is used, helix includes α helix, 3/10 helix, and π helix.
The algorithm used searches for chameleons of a given length, referred to as L. The algorithm consists of the following steps.

1.
For each PDB entry, extract all helix (H) and sheet (S) sequences as L-character strings (using the 1-character amino acid symbols) and save the PDBid, chain id, and starting residue number for each.

2.
For each PDB entry that includes more than one chain, sort the list of sequences and eliminate sequences that occur repeatedly in different chains (as it is assumed that such duplicates are the result of chains repeated in polymeric proteins).

3.
Add the character 'H' or 'S' (indicating helix or sheet) to the L+1th place to each sequence remaining.

4.
Once all PDB entries are read, sort the L-residue peptide sequences using all L+1 characters. 5.
Partition the sorted list into segments of identical sequences.If the first and last member of the partition has a different character at the L+1th position a chameleon is found.
The list of chameleons obtained with this procedure will generally include chameleons that are actually parts of a longer chameleon.To filter out such chameleons, the following algorithm was used.

1.
Perform the chameleon search starting with the longest possible, L max .It is set to 24 in the current implementation to be larger than the longest chameleon found, but it can be raised with minimal change in the code if needed (at the expense of using more memory).

2.
For all chameleon lengths L, L min ≤ L ≤ L max , create an L-character list consisting of: (a) all chameleons just found of length L; and (b) all L-character substrings of the already found chameleons.L min can be any positive number ≤ L max .

3.
Sort this combined list.

4.
Scan the sorted list.Whenever more than one identical string is found, and one of these was from the just found L-residue chameleon list, that chameleon should be dropped as it is part of a longer chameleon.
However, sequences that were found to be parts of longer chameleons may not be extendable to longer chameleons in every protein they occur.Searching for such chameleons can be performed by the following algorithm.

6.
For all chameleon lengths L < L max create an L + 1-character list consisting of: (a) two copies of all occurrences of chameleons just found of length L, where the residue string is extended in both directions with the residue at that position of the protein the chameleon copy was found; and (b) all L + 1-character substrings of the already found chameleons.7.
Sort this combined list.8.
Scan the sorted list.Whenever only one identical string is found and it came from an L-residue chameleon that occurrence was a non-extendable one.
This algorithm was not implemented into the chameleon search program.
The chameleon search was implemented in the Fortran77 program Cham, available at the author's website http://inka.mssm.edu/~mezei/cham.The string handling and sorting code were imported from the program Simulaid [22].The sortings were executed by the merge-sort algorithm that scales as nlog(n).Since the complexity of every other steps is linear in the database size, the complexity of the chameleon search is also nlog(n).
The chameleon search algorithm is based on the character string representation of sequences, thus it works equally well if a lower-resolution characterization of the sequence is used.The program CHAM can read a mapping of the 20 residues onto a smaller character set and perform the chameleon search on that basis.Since the search is quite fast (minutes), it is easy to explore different mappings.
It is also of interest to see the chameleon-forming propensities of the different residues.There are several different ways that such propensities can be characterized: (1) The number of occurrences of residue i in chameleons of length L N L,i divided by L × NC L , where NC L is the number of chameleons of exactly L residues, P 1 (L,i) = N L,i /( L × NC L ) (or multiplied by 100 to obtain percentages).(2) P 1 (L,i) normalized by a measure of the overall propensity of residue i.There are different options: (2.1) P 2 (L,i) = P 1 (L,i)/0.05(or 5 if percentages were used).This measure ignores the different probabilities of occurrences of the different amino acids.It was the measure used in [15].(2.2) P 3 (L,i) = P 1 (L,i)/P(i), where P(i) is the overall probability of occurrence of residue i.
(2.3) P 4 (L,i) = P 1 (L,i)/P HS (i), where P HS (i) is the overall probability of occurrence of residue i in either a helix or in a sheet.
The program Cham calculates these propensities for each length and averages them cumulatively over all the lengths the chameleon search was performed.

Results
The program Cham was run on the file ss.txt downloaded from the PDB.The current analysis examined 138,870 PDBids and 394,364 chains.The number of sequences examined and the number of chameleons L > 4 found are shown in Table 1 for chameleon searches at three different resolutions: (1) 20 residues; (2) 4 types only (H, P, +, − for Hydrophobic, Polar, positive, negative, respectively); and (3) 3 types only (H, P, C for Hydrophobic, Polar, Charged, respectively).Chameleons longer than eight residues are listed in Table 2.They were all examined visually using VMD (Visual Molecular Dynamics) [23] or Pymol [24].Note, that in that process, several annotations were found to be incorrect; for now those PDBids are excluded from consideration by the program Cham.The residue numbers in Table 2 refer to the residue positions in the ss.txt file.They may differ from the residue numbers in the full PDB file (usually because parts of the protein's structure could not be determined).Table 2 also lists the number of structures in the PDB where this sequence is found.Multiple occurrences of long chameleons are likely to occur in homologous structures.
All chameleons longer than 10 were found for different structures of the same protein.
Chameleons involving seven or eight residues are listed in the Supplementary Materials.Chameleons shorter than seven residues, as well as the list of all occurrences of the chameleons found can be generated by running the program Cham, available at the URL http://inka.mssm.edu/~mezei/cham.
Besides searching for chameleons, the program Cham also gathers some statistics on the full set: the distribution of helix and sheet lengths and the overall probability of an amino acid occurring in a protein, in helix, or in sheet conformation.It should be emphasized that these data are much more biased than statistics generated on non-redundant sets, this being one a downside of using the full PDB.Since the main purpose of this work was the search for chameleons, using the full PDB was the only way to ensure that none were missed.
The distributions of helix and sheet lengths are shown in Figure 1.While the lengths of most helices and sheets fall in the [4,10] range, it is clear that longer secondary structure elements are more likely to be helices.PDBid1 Ch1, Residue1 are the PDB id, chain id, and starting residue number of the protein where the sequence is in helix conformation, respectively; and PDBid2 Ch2, Residue2 are the corresponding quantities for the protein where the chameleon is in the sheet conformation.The last column (Number) gives the number of proteins in which this chameleon occurs.Table 3 presents various amino acid propensities in terms of percent occurrences.The overall percent of amino acids are given both based on the data set used in this study and on the amino acid composition in the UniProtKB/Swiss-Prot data bank [25], read from the Expasy server at the URL https://web.expasy.org/protscale/pscale/A.A.Swiss-Prot.html.Table 3  Table 3 presents various amino acid propensities in terms of percent occurrences.The overall percent of amino acids are given both based on the data set used in this study and on the amino acid composition in the UniProtKB/Swiss-Prot data bank [25], read from the Expasy server at the URL https://web.expasy.org/protscale/pscale/A.A.Swiss-Prot.html.Table 3 also shows the percent of each residue in helix and in sheet conformation, respectively, as well as the ratio of the helix and sheet percentages to their overall percentages, factoring out the effect of different overall propensities.Since it can be assumed that the distribution of amino acids is less biased both in the Expasy set and in the non-redundant sets, comparison of the overall percentages calculated on the full PDB and on the Expasy data set gives a measure of bias introduced by not using a non-redundant set.The differences are mostly small; the largest is for valine and leucine (0.85 and 0.64., respectively).Table 4 presents data characterizing the chameleon propensities of each amino acid.In accordance with earlier work leucine, valine, and alanine were found to feature prominently in chameleons.The chameleon propensities calculated with the reduced resolution residue sets HPC and HP+− are shown in Tables 5 and 6.In the HPC set residues are clustered into Hydrophobic, Polar and Charged sets while in the HPC+-set the clusters include Hydrophobic, Polar, Positive and Negative.

Discussion
The algorithm presented for the chameleon search is fast enough to be able to run easily on the full set of sequences in the PDB instead of just a representative, i.e., non-redundant set.This choice led to the discovery of chameleons significantly longer than those previously found.There is, however, one drawback to using the full set: since many proteins have several homologues and many protein structures have several (nearly) identical domains, the data obtained will have an unspecified bias in it.This bias will not affect the distribution of unique chameleons, but will affect the number of occurrences of both chameleons and the amino acid frequency statistics.However, the comparison of the amino acid percentages calculated on the PDB and on the Expasy set shows only small differences, thus the data set bias is small enough not to have a major effect on conclusions.
In any event, the reliability of the chameleon search hinges upon the reliability of the secondary structure annotation.Once a particular annotation is established the proposed algorithm is guaranteed to find all chameleons The existence of such long chameleons also raises an intriguing problem.In [15] it was argued, assuming uniform distribution of amino acids, that chameleons with more than seven residues (the longest ones found in that work) are unlikely simply because the chance of longer sequences occurring more than once (never mind their conformation) is vanishingly small.The fact that, contrary to this suggestion, well over two hundred chameleons with more than seven residues were found since that time suggests that there may be hidden organizing principles at work selecting protein sequences that fold into well-defined conformations.This suggestion is supported by the fact that a protein with a randomly selected sequence is unlikely to fold [26].
Searching for residue patterns in chameleons is facilitated by the speed of the algorithm, since it makes it easy to perform the chameleon search with a variety of low-resolution residue definitions.It only takes the specification of the residue mapping and about 10 minutes on a laptop computer to generate the data.
The earlier observation, that alanine, leucine, and valine are frequently seen in chameleons, still holds in the current results.Interestingly, the effect diminishes as the propensities are normalized by overall amino acid propensities and further weakens if propensities to form secondary structure elements are used for normalization.This observation generalizes to the low-resolution chameleon propensities where the hydrophobic residues were found to be dominant.Not surprisingly, no significant difference was found between the positive and negative residues in this respect.
There is also a counterintuitive observation about the helix and sheet propensities of amino acids.One would expect that a tendency for chameleon forming implies roughly equal propensities for helix and sheet formation.However, the two large differences observed were for leucine and valine, two of the three prominent chameleon forming amino acids.

Figure 1 .
Figure 1.Distribution of helix length (full line) and sheet (broken line) length in the Protein Data Bank (PDB); length is the number of residues.

Figure 1 .
Figure 1.Distribution of helix length (full line) and sheet (broken line) length in the Protein Data Bank (PDB); length is the number of residues.
NL(cham) is the number of chameleons of length L and NL; X(cham) is the number of chameleons that are exactly of length L, i.e., are not part of a longer chameleon.

Table 2 .
Chameleons longer than 8 residues.PDBid 1 Ch 1 , Residue 1 are the PDB id, chain id, and starting residue number of the protein where the sequence is in helix conformation, respectively; and PDBid 2 Ch 2 , Residue 2 are the corresponding quantities for the protein where the chameleon is in the sheet conformation.The last column (Number) gives the number of proteins in which this chameleon occurs.

Table 3 .
Overall residue propensities.(aa) and %(aa,Expasy) is the percent of the data set used in this study and in the Expasy data set, respectively, that is residue aa.%(helix) and %(sheet) are the percent of helix and sheet occurrences, respectively, of residue aa. %