DeepCNF-D: Predicting Protein Order/Disorder Regions by Weighted Deep Convolutional Neural Fields

Intrinsically disordered proteins or protein regions are involved in key biological processes including regulation of transcription, signal transduction, and alternative splicing. Accurately predicting order/disorder regions ab initio from the protein sequence is a prerequisite step for further analysis of functions and mechanisms for these disordered regions. This work presents a learning method, weighted DeepCNF (Deep Convolutional Neural Fields), to improve the accuracy of order/disorder prediction by exploiting the long-range sequential information and the interdependency between adjacent order/disorder labels and by assigning different weights for each label during training and prediction to solve the label imbalance issue. Evaluated by the CASP9 and CASP10 targets, our method obtains 0.855 and 0.898 AUC values, which are higher than the state-of-the-art single ab initio predictors.


Introduction
Protein sequence determining structure and therefore function is a long-standing paradigm in biology. However, this paradigm has been challenged ever since the discovery of the first functional protein featuring no stable structure over 50 years ago [1]. Since then, scientists have been discovering more and more proteins and/or protein regions with no unique structure that are involved in key biological processes, such as regulation of transcription, signal transduction, cell cycle control, post translational modifications, ligand binding, protein interaction, and alternative splicing [2,3]. Formally, these proteins and/or protein regions are defined as Intrinsically Disordered Proteins (IDP) or Intrinsically Disordered Regions (IDR).
Since the late 1990s, more interest has been focused on IDP and IDR due to additional evidence from experimental approaches such as NMR [4]. Due to the laborious and expensive nature of these experiments, numerous computational approaches have been proposed to predict the order/disorder regions [5]. Briefly, these methods could be categorized into two groups, (a) single ab initio methods that require only the protein sequence as input, and results are predicted based on a single method, such as IUpred [6], DISOPRED [7], SPINE-D [8], and DNdisorder [9]; (b) hybrid or consensus methods that exploit more information such as templates or output from other algorithms. Examples include PrDOS [10], PONDR [11], POODLE [12], and MetaDisorder [13]. For a more detailed introduction for these and other prediction methods, please refer to excellent reviews written by He et al. and Deng et al. [5,14].
Despite the good performance of currently available hybrid and consensus methods for order/disorder prediction [15], an accurate single ab initio method is still valuable since the improved results may contribute to other consensus methods, and then in turn to increase overall prediction accuracy. However, there are three challenges that block the advancement of single methods: (a) How to exploit the interdependency of adjacent order/disorder states (or labels), as well as the long-range sequential information that determines the order/disorder state at a certain position. In fact, adjacent order/disorder states are interdependent. In other words, the order/disorder state in one position influences adjacent states, as shown by the occurrence of long consecutively disordered regions. PrDOS-CNF [15] and OnD-CRF [16] exploit this interdependency between adjacent order/disorder labels through a linear-chain Conditional Random Field structure [17], but they are unable to describe the long-range sequence information; (b) How to solve the label imbalance issue. As the frequencies of order/disorder labels are strongly imbalanced (only ~6% of residues are disordered), using the conventional training method would bias the model towards fitting ordered states while performing poorly for disordered states; (c) How to incorporate additional features besides traditional ones. In short, it has been demonstrated that charged and hydrophilic residues, such as P, E, S, Q and K are more likely to be in disordered states, whereas neutral and hydrophobic residues, such as C, W, I, Y, F, L, M and H, are order-promoting residues [2]. It has also been suggested that evolutionary conservation largely determines order/disorder states [18]. Besides these amino acid and evolution related features, we need to find additional relevant and/or complementary features that could contribute to the order/disorder prediction accuracy. This paper presents a machine learning model, called weighted DeepCNF (Deep Convolutional Neural Fields), for order/disorder region prediction to overcome the three challenges described above.
(a) Weighted DeepCNF combines the advantages of both conditional neural fields (CNF) [19] and deep convolutional neural networks [20]. It not only captures medium-and long-range sequence information, but also describes interdependency of order/disorder labels among adjacent residues; (b) To solve the label imbalance issue, the reciprocal of the occurrence frequencies of order/disorder labels were used for weighting during training and prediction; (c) Besides the traditional amino acid and evolution related features that are relevant to order/disorder discrimination, a strong tendency has been shown for disordered regions to exist in the coiled state [18] and exposed to solvent [2]. Inspired by this phenomenon, additional structural properties, such as predicted secondary structure [21] and predicted solvent accessibility [22], were incorporated into our weighted DeepCNF model. Finally, experiments have shown that our method outperforms the state-of-the-art methods on publically available Critical Assessment of protein Structure Prediction (CASP) [15] dataset in terms of Area Under the ROC Curve (AUC) value.

Results
In this section we describe our experimental verification on the prediction for disordered regions. The first part describes the datasets we use, the programs to be compared with and the evaluation measures. The second part shows how we train and validate our DeepCNF model using a small Disorder723 dataset [23] by 10 cross validations to find out the best combination of the model parameters. In the third part, based on the best parameters determined from the Disorder723 dataset, we trained a model using a large but non-redundant UniProt90 dataset [24], and evaluated our method DeepCNF-D (here "D" stands for Disorder) with the other state-of-the-art programs on two publicly available CASP datasets [15,25].

Training, Validation, and Evaluation Dataset
Our weighted DeepCNF model for order/disorder prediction is trained and validated using Disorder723 dataset. This dataset, built by Cheng et al. [23] in May 2004, consists of 723 non-redundant chains which span at least 30 amino acids and were solved by X-ray diffraction with a resolution of around 2.5 Å.
A ten-fold cross-validation on this Disorder723 dataset was performed to determine the model parameter. In particular, the original dataset was randomly partitioned into 10 equal-sized subsamples. Among the 10 subsamples, a single subsample is retained as the validation data for testing the model, while the remaining nine are used as training data. The list of all these training/validating proteins, as well as the corresponding features for our model, could be found in Supplemental Data.
We further trained our weighted DeepCNF model using UniProt90 dataset for release purpose and evaluated it with two publicly available CASP datasets, i.e., CASP9 [25] and CASP10 [15]. CASP9 dataset contains 117 sequences while CASP10 contains 94. The sequences from UniProt90 dataset were downloaded from MobiDB [24], followed up by a 25% sequence identity filter to those sequences from the two CASP datasets. The number of remaining sequences for training our model is 20,337. This UniProt90 training list, as well as the list and features for CASP datasets, can be found in Supplemental Data. Note that the determination of disordered residues in the UniProt90 dataset follows the same procedure as Cheng et al. [23] for Disorder723 dataset.
We also analyzed the overall properties of the non-terminal disordered regions with different lengths on the four datasets used in this work. As shown in Table 1, although the testing datasets (i.e., CASP9 and CASP10) used in this study are dominated by proteins with short disordered regions, the two training datasets (i.e., Disorder723 and UniProt90) are relatively equally distributed in disordered region length, especially the UniProt90 dataset which is used to train our model for release. We compared our method with the following programs: IUpred [6], SPINE-D [8], and DISOPRED3 [7]. IUpred has two different prediction modes: IUpred (long) and IUpred (short). The former is designed for long disordered regions while the latter for short ones, and both modes are only based on the amino acid related features; SPINE-D and DISOPRED3 rely not only on amino acid related features, but also on evolution related features generated by PSI-BLAST. SPINE-D relies further on structure related features generated by SPINE-X [26]. It should be noted that all these programs to be compared are ab initio methods and are downloadable. Since our method is also an ab initio method, it might be unfair to compare ours with those clustering, template-based, and meta or consensus methods as defined in Deng et al. [14]. All results from IUpred, SPINE-D, and DISOPRED3 on the corresponding datasets can be found in Supplemental Data.

Evaluation Measurement
For evaluation of disorder predictors as binary classifiers we used the precision defined as TP/(TP + FP), the balanced accuracy (bacc) defined as 0.5 × (TP/(TP + FN) + TN/(TN + FP)), and the Matthews correlation coefficient (MCC) defined as , where TP (True Positives) and TN (True Negatives) are the numbers of correctly predicted disordered and ordered residues, respectively, whereas FP (False Positives) and FN (False Negatives) are the numbers of misclassified disordered and ordered residues, respectively [15].
As the problem of order/disorder region prediction is strongly imbalanced (only ~7% of residues are disordered), using the conventional accuracy and precision measurement might inflate performance estimates and is therefore not appropriate [27]. Another paradoxical issue is the decision threshold for order/disorder classification. Depending on how users set the threshold, a bias might be introduced, which could lead to an unfair comparison between distinct studies. To solve this issue, we consider the Receiver Operating Characteristic (ROC) analysis for the assessment of protein disorder predictions [27]. An ROC curve represents a monotonic function describing the balance between the True Positive (TP) and False Positive (FP) rates of a predictor. Since all the programs in this study could output a probability value, for a set of probability thresholds (from 0 to 1), a residue is considered as a positive case (i.e., disordered) if its predicted probability is equal to or greater than the threshold value. The area under the curve (AUC) is used as an aggregate measure of the overall quality of a prediction method. An area of 1.0 corresponds to a perfect predictor while an area of 0.5 means purely random.

Determining the Best Model Parameters on Disorder723 Dataset
There are three components of tunable parameters in the weighted DeepCNF model, (a) number of hidden layers; (b) weight of labels; and (c) combination of input features. It should be noted that in order to simplify the following analysis, we fix some of the other tunable model parameters such as the window size (fixed to 11) and the number of neurons at each layer (fixed to 50). Thus, by default, our weighted DeepCNF model contains two hidden layers, applies 0.7:9.3 for the weight ratio between order and disorder labels, and includes all 129 features from the three classes.

Contributions of Number of Layers
To show the relationship between the performance and the number of layers, we trained three different weighted DeepCNF models with one, two, and three layers, respectively. As shown in Table 2, the best AUC value is obtained by using two layers. The reason for getting inferior results when using three layers compared to using two layers may be due to over-fitting because the regularization factor (fixed at 200) is the same for all trained models. If the regularization factor is fine-tuned specifically for the three layer model, the results could be comparable to the two layer model. However, since the three layer model contains much more parameters than the two layer model, while achieving a similar result in terms of AUC value with much lower efficiency, we decided to use the two layer model in the following analysis. Table 2. AUC values of different layer models on 10 cross validation batch datasets of Disorder723. Note that other model parameters are fixed as default. The best value is shown in bold (the same convention is used in Tables 3-7).

Contributions of Different Weight Ratios
It is well known that the label imbalance issue occurs in many real-world prediction tasks [27]. A simple solution to address this problem is to assign different weights to different labels [28]. Here we show that by assigning the reciprocal of the occurrence frequency as the weight for order/disorder label, the trained model reaches the best performance compared to an equally weighted label ratio. Specifically, as shown in Table 3, the 1:1 weight ratio obtains a mean AUC value of 0.884, while the reciprocal of the occurrence frequency of the order/disorder label (0.7:9.3 weight ratio) obtains 0.901. However, there are no significant differences between weight ratios 1:9, 0.5:9.5, and 0.7:9.3. These results may imply that by assigning the weight ratio around the closer range of the reciprocal ratio there will not be a large difference in the performance. Table 3. AUC values of different combinations of weight ratio between order and disorder states on 10 cross validation batch datasets of Disorder723.

Contributions of Different Combinations of Feature Classes
As mentioned in the previous section, the features used in the training process consist of three classes: evolution related, structure related, and amino acid related, respectively. In order to estimate the impact of each class on order/disorder prediction, we applied each one and several combinations of them to train the model and perform the prediction. Table 4 illustrates the AUC value of 10 cross validation batch datasets of Disorder723 with settings of different combinations of feature classes. Not surprisingly, the most contributing feature class is evolutionary information, which confirms that it is the conservation profile that makes the order/disorder regions different [18]. The second important feature class is the structural information class which is based on predicted secondary structure and solvent accessibility. It is interesting that, although structure related features are actually derived from evolutionary information, the combination of these two classes of features reaches 0.889 mean AUC. This phenomenon was reported in [9], with the possible interpretation that there is a strong tendency for disorder regions to exist in the coiled state [18] and to be exposed to solvent [2]. Finally, using amino acid related feature alone reaches 0.833 mean AUC. The combination of all three classes of features achieves the best AUC value for each cross validation batch dataset.   Table 5 shows the performance of our method DeepCNF-D and the other four methods in terms of AUC value on the 10 cross validation batch datasets of Disorder723. As listed in this table, for eight of the ten batches, our method outperforms all the other methods; for the remaining two batches, our method is comparable to SPINE-D and only inferior to DisoPred3. Note that DisoPred3 is the current state-of-the-art method and is trained on a much larger PDB90 dataset [29]. However, DeepCNF-D achieves the best performance in terms of the mean AUC value at 0.901, which is better than the state-of-the-art value achieved on this Disorder723 dataset by DNdisorder at AUC value 0.899 [9]. Just by using pure amino acid feature shown in Table 4, our DeepCNF model could reach the mean AUC value 0.833, which is higher than IUpred (0.810 and 0.721 for long and short prediction modes, respectively). Note that IUpred is the current best order/disorder predictor based on amino acid sequence only. Table 5. AUC value of several methods on 10 cross validation batch datasets of Disorder723.

Performance on CASP Dataset
Tables 6 and 7 summarize the performances of our method DeepCNF-D and the other four predictors on the publically available CASP9 and CASP10 datasets. In order to perform a fair comparison with DisoPred3 and SPINE-D, which are trained on PDB90 and PDB25 datasets respectively, we re-trained our weighted DeepCNF model using the UniProt90 dataset from MobiDB with removal of the overlap sequences with CASP datasets by a 25% sequence identity filter. It is observed that DeepCNF-D reaches 0.855 AUC value on CASP9 and 0.898 on CASP10, which are higher than DisoPred3, SPINE-D, and IUpred for both long and short prediction modes. Moreover, if only amino acid related features are used, DeepCNF-D (ami_only) achieves 0.7 AUC on CASP9 and 0.772 on CASP10, which are significantly higher than IUpred, though with extremely fast running speed.  We also compare our method with other predictors using precision, balanced accuracy (bacc), and Mattehews correlation coefficient (MCC), respectively. Since these measurements are based on a user-defined threshold for order/disorder classification, we use default value (i.e., 0.5) for the other predictors while choosing 0.2 for our method DeepCNF-D. The reason to choose 0.2 for our method is based on the optimal performance of MCC value on the training data. The results show that DeepCNF-D archives 0.486 MCC value on CASP9 and 0.474 on CASP10, which are higher than the other predictors. If only amino acid related features are used, DeepCNF-D (ami_only) reaches 0.4 MCC on CASP9 and 0.433 on CASP10, which are significantly higher than IUpred.

Conclusions and Discussions
A sequence labeling machine learning model, namely weighted DeepCNF (Deep Convolutional Neural Fields), for protein order/disorder prediction has been presented. This model not only captures long-range sequence information by a deep hierarchical architecture and exploits interdependency between adjacent order/disorder labels, but also assigns different weights for each label during training and prediction to solve the label imbalance issue that was known as a long-standing problem in order/disorder prediction. The source code of our method DeepCNF-D is available at http://ttic.uchicago.edu/~wangsheng/DeepCNF_D_package_v1.00.tar.gz [30]. The overall performance in terms of AUC value of DeepCNF-D reaches 0.855 on CASP9 and 0.898 on CASP10, which are better than the state-of-the-art methods.
It should also be noted that if amino acid related features are used, our method outperforms the best sequence-based method IUpred, while still keeps the same extremely fast running speed. This pure sequence version of our method (called DeepCNF-D ami_only) can be applied to large-scale proteome analysis for disordered regions [31]. From a computational perspective, our weighted DeepCNF model provides a new framework to incorporate long-range sequence information to predict the labels. Such a framework can be directly applied to many sequence labeling issues such as secondary structure [21], solvent accessibility [22], and structural alphabet [32][33][34][35][36].
The observation of the deterioration with three layers compared to two layers may due to over-fitting because the regularization factor is the same for all trained models. If we specifically fine-tune the regularization factor for the three layer model, then the results could be comparable to the two layer model. We also tried different numbers of neurons per hidden layer (from 10 to 100) for the one to three layers models. The improvements in terms of AUC saturated at 50 neurons and did not largely increase after 50. However, instead of using these ad-hoc fine-tune procedures, we could try to use the following general strategies to prevent over-fitting and to further improve the performance of the deep network architecture: (a) a dropout framework [32][33][34][35][36] which takes into consideration the hidden structures of the neurons; (b) a dimension reduction technique [37] which could be used to reduce the total number of neuron weights; (c) a hessian-free strategy [38] which could be applied to accelerate the calculation based on the reduced neuron space.
Further developments for single ab initio order/disorder prediction methods are still possible in the following three approaches: (a) Since the area under the curve (AUC) is a proper measurement for the performance of order/disorder prediction and was applied in the CASP assessment [39], a method directly optimizing the AUC value would have a high chance to increase the overall performance; (b) Besides the structure features used in our method, such as predicted secondary structure and solvent accessibility, further information could be incorporated into order/disorder prediction. As suggested in Ucon [32][33][34][35][36], the predicted residue-residue contact information could contribute to the prediction accuracy; (c) It is shown that the characteristics of terminal and non-terminal disorder regions differ a lot from each other [15], and it is the same case for long and short disorder regions [40]. In the future, maybe these different disorder regions could be assigned different labels and trained by weighted DeepCNF, since our model can handle the label imbalance issue.

Method
In this section we describe the definition of order/disorder states, our proposed prediction methods, as well as the related features for prediction. The first part is the definition for order/disorder regions on a given protein. The second part shows the major contribution of our work, a machine learning model called weighted DeepCNF (Deep Convolutional Neural Fields), for order/disorder region prediction. DeepCNF is a Deep Learning extension of a probabilistic graphical model Conditional Markov Neural Fields (CNF), which can capture medium-or even long-range information in a protein chain by a deep hierarchical architecture, and also model interdependency between adjacent order/disorder labels. In order to solve the label imbalance issue, we assign a relatively larger weight to disorder label, which is sparsely represented in the dataset. Then more cost would be given to errors in these disorder labels during training process to unbias the DeepCNF model. In the third part we introduce the related protein features that could be categorized into classes of amino acid, evolution, and structural properties.

Order/Disorder Definition
We use the same definition of order/disorder states as Cheng et al. [5] and the CASP organization [23]. In particular, segments longer than three residues but lacking atomic coordinates in the crystal structure are labeled "disordered" whereas all other residues are labeled "ordered". Note that besides order/disorder state, there also exist "N" for not available and "X" for atomic coordinates missing in the CASP dataset. We simply remove these "N" residues and mark "X" residues as disordered.

Model Architecture
As shown in Figure 1, DeepCNF consists of two modules: (a) the CRF module consisting of the top layer and the label layer; and (b) the deep convolutional neural network (DCNN) module covering the bottom layer to the top layer. When only one hidden layer is used, this DeepCNF becomes CNF, a probabilistic graphical model described by Peng et al. [15]. Given a protein sequence of length , let = ( , … , ) denote its order/disorder label where is the order/disorder state at residue . Let = ( , … , ) denote the input feature where is a column vector representing the input feature for residue . Using DeepCNF, we calculate the conditional probability of on the input as follows, where , representing order/disorder states, say 0 for order and 1 for disorder. δ() is an indicator function, ( , , ) is a neural network function for the -th neuron at position of the top layer, and , , and are the model parameters to be trained. Specifically, is the parameter for the neural network, is the parameter connecting the top layer to the label layer, and is for label correlation. The section below shows the details of the deep convolutional neural network corresponding to ( , , ). is actually the input feature . Otherwise, is a matrix of dimension × . Let 2 + 1 be the window size at the -th layer. Mathematically, ( ) is defined as follows
(− ≤ ≤ ) is a two-dimension weight matrix for the connections between the neurons of position at layer and the neurons of position + 1 at layer + 1.
is shared by all the positions in the same layer, so it is position-independent. Here and ' index two neurons at the -th and ( + 1)-th layers, respectively.

Weighted DeepCNFs
Traditional algorithms choose an approximation of the accuracy (such as log probability of the occurrence label [19], or empirical label-wise accuracy [19]) as loss function. However, for imbalanced label applications, this criterion is no longer suitable since the minority label would have less impact on accuracy than the majority label, which could result in biased predictions [41]. Unfortunately, the order/disorder prediction is such a task whereas the minority class (i.e., the disorder state) is of prime importance but this class might be completely ignored by traditional classifiers. For example, with an imbalance ratio of 0.93 to 0.07 for order/disorder states, a classifier that classifies everything to be ordered state will be 93% accurate, but it will be completely useless as a suitable classifier.
In order to solve this imbalanced label problem, one idea is to design cost-enabled classifiers that assign different weights for different labels [28]. In this way, more cost can be given to errors in the minority class to make the classifier unbiased. Formally, we redefine Ψ′( , , ) and Φ′( , , ) functions to consider the weight of different labels as follows, Φ′( , , ) = , ( , , ) δ( = ) where is the weight of label . In this work, we found out that the weight for order/disorder states could be set to the reciprocal of their occurrence frequency, i.e., 0.07 for order state and 0.93 for disorder state.

Training Method, Regularization, and Time Complexity
We use the Limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) algorithm [28] as the optimization routine to search for the optimal model parameter that maximizes the object function (for more details, please refer to Supplemental Document). To control model complexity to avoid over-fitting, we add a L2-norm penalty term as the regularization factor. Suppose the number of neurons and the window size at each layer are fixed for all layers as and , then the computational complexity of the gradient computation is ( + + ) for a single input-output pair where the protein length is and the label dimension is 2.

Protein Features
A variety of protein features have been studied by Becker et al. [42] to predict the order/disorder states. These features could be categorized into three classes: amino acid related, evolution related, and structure related features, for each position . Here we use 129 features, described below.

Amino Acid Related Features
Four types of amino acid related features are considered, including: (a) amino acid indicator vector; (b) physico-chemical property; (c) specific propensity of being endpoints of a secondary structure segment; and (d) correlated contact potential. Specifically, the amino acid indicator vector is a 20 dimensional vector with value 1 for the specific amino acid occurring at position while the other 19 values are 0; physico-chemical property has 7 values for each amino acid (shown in Table 1 from Meiler et al. [18]); specific propensity of being endpoints of an SS segment has 11 values for each amino acid (shown in Table 1 from Duan et al. [43]); correlated contact potential has 40 values for each amino acid (shown in Table 3 from Tan et al. [44]). All these features have been studied in Wang et al. [45] for secondary structure elements prediction, in Ma et al. [21] for solvent accessibility prediction, and in [22] for homology detection. Overall, at each position , there are 78 = 20 + 7 + 11 + 40 amino acid related features.

Evolution Related Features
The order/disorder state of a residue has a strong relationship with the residue's substitution and evolution [46][47][48]. Residues in ordered state and disordered state were shown to have different substitution patterns due to different selection pressures. Evolutionary information such as PSSM (position specific scoring matrix) generated by PSI-BLAST [2] have been widely accepted to efficiently enhance prediction performance [49]. Here we use an additional evolution information from the HHM file generated by HHpred [18]. Overall, for each residue, there are 40 = 20 + 20 evolution related features.

Structure Related Features
Local structural features, such as secondary structure and/or solvent accessibility, are also very useful for predicting order/disorder state, as indicated in [50]. Here we use the predicted eight-state secondary structure elements (SSE) [18] and three-state solvent accessibility (ACC) [21] probability as structure related features for each residue position. Overall, for each residue, there are 11 = 8 + 3 structure related features.