IDP–CRF: Intrinsically Disordered Protein/Region Identification Based on Conditional Random Fields

Accurate prediction of intrinsically disordered proteins/regions is one of the most important tasks in bioinformatics, and some computational predictors have been proposed to solve this problem. How to efficiently incorporate the sequence-order effect is critical for constructing an accurate predictor because disordered region distributions show global sequence patterns. In order to capture these sequence patterns, several sequence labelling models have been applied to this field, such as conditional random fields (CRFs). However, these methods suffer from certain disadvantages. In this study, we proposed a new computational predictor called IDP–CRF, which is trained on an updated benchmark dataset based on the MobiDB database and the DisProt database, and incorporates more comprehensive sequence-based features, including PSSMs (position-specific scoring matrices), kmer, predicted secondary structures, and relative solvent accessibilities. Experimental results on the benchmark dataset and two independent datasets show that IDP–CRF outperforms 25 existing state-of-the-art methods in this field, demonstrating that IDP–CRF is a very useful tool for identifying IDPs/IDRs (intrinsically disordered proteins/regions). We anticipate that IDP–CRF will facilitate the development of protein sequence analysis.


Introduction
Intrinsically disordered proteins/regions (IDPs/IDRs) refer to the proteins/regions without a stable three-dimensional structure in their native state [1]. IDPs/IDRs are widely distributed in nature, and are correlated with many biological functions [2,3] and a broad range of human diseases, such as genetic diseases [4], cancer [3] and neurodegenerative diseases [5,6]. Therefore, accurately identifying IDPs/IDRs is crucial for understanding the mechanism of biological functions and exploring the relationship between IDPs/IDRs and diseases.
There are several databases containing experimentally determined IDPs/IDRs. For example, PDB [7] contains a large number of IDPs/IDRs annotated by X-ray crystallography (X-ray), and these IDPs/IDRs are organized by the MobiDB database [8,9]. DisProt [2] archives experimentally certified IDPs/IDRs by different techniques, such as X-ray crystallography, nuclear magnetic resonance (NMR) and circular dichroism (CD) spectroscopy. However, identifying IDPs/IDRs by using experimental methods is time consuming and expensive. Therefore, fast and efficient computational methods are urgently needed.
Existing computational predictors can be divided into four categories according to different strategies [1]: (1) physicochemical-based methods that directly utilize the physical principles to discriminate IDPs/IDRs [10,11]; (2) machine learning-based methods that are constructed based on machine learning algorithms, including classification models [12] and sequence labeling models [13,14]; (3) template-based methods that search for homologous proteins with known structures; (4) meta-methods that integrate various predictors into one prediction model [15]. For more information of these methods, please refer to the recent review paper [1].
In machine learning-based methods, different from the sequence labeling models, the classification models treat each amino acid residue as a separate sample, ignoring the interdependency between labels of sequence-adjacent residues [16,17]. However, sequence-adjacent residues may have similar characters in forming IDPs/IDRs [18], and the disordered residues tend to be neighbors in the sequence of a protein. In order to incorporate this information, several sequence labelling methods have been proposed. For example, OnD-CRF [14] is based on conditional random fields (CRFs) [19], and SPOT-disorder [13] is based on a bidirectional long short-term memory (BLSTM) model incorporating long-range interactions between amino acid residues. Both the two methods have made important contributions to the development of this very important field. However, they have several shortcomings: (1) inaccurate representation of proteins. The OnD-CRF is only based on several sequence-based features, which fail to capture the characteristics of disordered regions; (2) high computational cost. The computational cost of the SPOT-disorder model is high, preventing its applications to analyze large-scale datasets; (3) un-catching up. Both the two methods were trained and tested on small benchmark datasets. As a result, their generalization ability and performance are limited, and an updated benchmark dataset is highly required.
In order to overcome these disadvantages, in this study, we combine CRFs and various sequence-based features [20,21] to further improve the predictive performance, including PSSMs (position-specific scoring matrices), kmer, predicted secondary structure and relative solvent accessibility, and a predictor called IDP-CRF is proposed. Furthermore, the IDP-CRF is trained on a comprehensive and updated benchmark dataset constructed based on the MobiDB database [8,9,22] and the latest version of the DisProt_v7.0 database [2,23]. Tested on two widely used independent datasets, experimental results show that IDP-CRF achieves better or at least comparable predictive performance with 25 currently existing state-of-the-art methods in this field. IDP-CRF would be a useful tool for protein sequence analysis.

The Influence of Different Ratios of Positive and Negative Samples on the Performance of Various Predictors
In a training dataset, the imbalanced number of ordered residues and disordered residues would impact the performance of the computational predictors [24,25]. Therefore, we analyze the effect of different ratios of positive and negative samples on the performance of IDP-CRF. For comparison purposes, three classification-based predictors are constructed as well, which are based on support vector machine (SVM), artificial neural network (ANN) and random forest (RF) models. A series of training datasets are constructed by randomly removing a different number of ordered residues. By using five-fold cross-validation, the Matthew's correlation coefficient (MCC) changing curves of IDP-CRF and these three classification-based predictors at different ratios of disordered residues and ordered residues in training are shown in Figure 1. From Figure 1, we can see that IDP-CRF outperforms other predictors, and different predictors can achieve the best performance when the ratio of positive and negative samples is around 1:2. The reason is that IDP-CRF can capture the interdependency between labels of sequence-adjacent residues, and therefore, the global sequence patterns of disordered regions can be incorporated into IDP-CRF. Figure 1. The performance of IDP-CRF (intrinsically disordered protein-conditional random field) and three classification-based predictors trained with different ratios of disordered residues and ordered residues. These three classification-based predictors include a RF (random forest) predictor, an ANN (artificial neural network) predictor and an SVM (support vector machine) predictor. MCC represents Matthew's correlation coefficient performance metrics.

IDP-CRF (Intrinsically Disordered Protein-Conditional Random Field) Outperforms Classification-Based Predictors
Sequential adjacent residues may have similar characteristics in the formation of IDPs/IDRs [18]. However, traditional classification-based predictors treat each target residue as an independent sample, ignoring the global sequence patterns of disordered regions. To address this problem, IDP-CRF, proposed in this study, can take the relationship between labels of sequential adjacent residues into account. The performance of IDP-CRF and several classification-based predictors (cf. Section 3.1) is compared by using five-fold cross-validation, and is shown in Table 1. From Table 1, we can see that IDP-CRF obtains the highest accuracy (ACC). When the positive and negative samples are extremely unbalanced, although ACC favors "greedy" predictions (i.e., predicting more residues as disordered), IDP-CRF obtains the highest sensitivity (Sn) and specificity (Sp), indicating that IDP-CRF can achieve better trade-off between Sn and Sp automatically. Besides, the highest MCC of IDP-CRF also fully illustrates that it is an efficient predictor for identifying IDPs/IDRs. This is because IDP-CRF can obtain more information of global sequence patterns of disordered regions compared with classification-based predictors.

Several Examples Predicted by IDP-CRF and Three Classification-Based Predictors
In this section, three examples are used to visualize the prediction of the four predictors listed in Table 1, including IDP-CRF, RF, SVM and ANN. These proteins are 3H2YA, 2ODKA and 4AD4A, and their structure information is acquired from the PDB database [7]. To visualize the 3D structures of these proteins, PyMOL [26] software is adopted to generate 3D structures of ordered regions. For those disordered regions, their 3D structure is drawn manually. The performance of IDP-CRF (intrinsically disordered protein-conditional random field) and three classification-based predictors trained with different ratios of disordered residues and ordered residues. These three classification-based predictors include a RF (random forest) predictor, an ANN (artificial neural network) predictor and an SVM (support vector machine) predictor. MCC represents Matthew's correlation coefficient performance metrics.

IDP-CRF (Intrinsically Disordered Protein-Conditional Random Field) Outperforms Classification-Based Predictors
Sequential adjacent residues may have similar characteristics in the formation of IDPs/IDRs [18]. However, traditional classification-based predictors treat each target residue as an independent sample, ignoring the global sequence patterns of disordered regions. To address this problem, IDP-CRF, proposed in this study, can take the relationship between labels of sequential adjacent residues into account. The performance of IDP-CRF and several classification-based predictors (cf. Section 3.1) is compared by using five-fold cross-validation, and is shown in Table 1. From Table 1, we can see that IDP-CRF obtains the highest accuracy (ACC). When the positive and negative samples are extremely unbalanced, although ACC favors "greedy" predictions (i.e., predicting more residues as disordered), IDP-CRF obtains the highest sensitivity (Sn) and specificity (Sp), indicating that IDP-CRF can achieve better trade-off between Sn and Sp automatically. Besides, the highest MCC of IDP-CRF also fully illustrates that it is an efficient predictor for identifying IDPs/IDRs. This is because IDP-CRF can obtain more information of global sequence patterns of disordered regions compared with classification-based predictors.
The last example is 4AD4A, which contains two IDRs with a total of 31 disordered residues (Figure 4b   From these figures, we can see that within the scope of actual IDRs, disordered residues predicted by IDP-CRF are continuous, while those predicted by the classification-based predictors are discontinuous; and within the scope of ordered regions, the number of FPs predicted by IDP-CRF is obviously less than that predicted by the classification-based predictors.

Comparison with Other Related Predictors
Two widely used independent datasets (MxD494 and SL329) are used to further evaluate the performance of the proposed method and other related predictors. The performance of these predictors is shown in Tables 2 and 3 respectively. From these two tables, we can see that IDP-CRF shows better or at least comparable predictive performance with 25 currently existing state-of-the-art methods in this field. In particular, IDP-CRF outperforms the existing CRF-based predictor OnD-CRF [14] because IDP-CRF adopts more comprehensive sequence-based features to represent proteins. Besides, according to Table 2, IDP-CRF shows comparable performance with the state-of-the-art meta-predictor MFDp [15], and outperforms all the other related methods. According to Table 3, the performance of IDP-CRF is highly comparable with that of SPOT-disorder [13], and outperforms all the other related methods. The predictive results show that IDP-CRF achieves state-of-the-art performance.    [12,13]. b The results of OnD-CRF are acquired from web-server.

Benchmark Dataset
As discussed in previous studies [45][46][47][48][49], a reliable benchmark dataset is crucial to the construction of an accurate predictor [50]. In this study, we construct a comprehensive and updated benchmark dataset S based on the MobiDB database [8,9,22] and the DisProt_v7.0 database [2,23]. S can be represented as where S 1 contains 4590 proteins from the MobiDB database, whose structures are solved by X-ray crystallography, and those proteins/regions with missing electron densities are IDPs/IDRs; and S 2 contains 683 proteins from the DisProt database. The proteins in S 1 are selected from 24,669 proteins by the following criteria: (a) resolution ≤ 2Å, (b) length ≥ 30 residues, (c) contains at least one IDR. DisProt_v7.0 includes both confident and ambiguous annotations for IDPs/IDRs. In this study, all the proteins with confident annotations are selected, and then merged with the selected proteins from the MobiDB database. Furthermore, the redundant proteins in the merged dataset are removed by using the Blastclust algorithm [51] with similarity 25%. Finally, 5273 proteins are left and used for 5-fold cross-validation. The detailed sequences in the benchmark dataset S are given in Supplementary Materials.

Benchmark Independent Datasets
To make a comprehensive comparison with more different methods, two benchmark independent datasets MxD494 [15,40] and SL329 [13,52] are selected as independent test datasets. In order to fairly test our method on these two independent datasets, two training datasets are constructed by removing the overlaps between our constructed benchmark dataset and these two independent test datasets by using the Blastclust algorithm [51] with 25% sequence identity cutoff.

Features
Feature extraction is a key step for constructing a predictor [53][54][55][56][57]. The construction of IDP-CRF is based on transition and state features. In this study, four different state features are used, including PSSMs, kmer, secondary structure and relative solvent accessibility. In addition, all the classification-based predictors shown in this article are based on these four features.

Transition Feature
The transition feature depends on the current position and the previous position of the label sequence. Suppose the label set for residues is ∅ = {O, D}, where O represents ordered residue and D represents disordered residue. Transition feature is defined as [19]: t y,y (y i−1 , y i , x, i) = 1 if y i−1 = y and y i = y 0 otherwise (2) where y i−1 and y i (y, y ∅) represent the labels of residues at position i − 1 and i in the protein sequence x, respectively.

PSSMs (Position-Specific Scoring Matrices)
Due to its ability to capture the important evolutionary information, PSSM features are considered as one of the most important and essential features in a number of previous bioinformatics studies [58][59][60][61][62][63][64][65]. In this study, the PSSMs are obtained by running three iterations of PSI-BLAST [51] searching against the nrdb90 database [66] under condition E-value = 0.001, and the other parameters of PSI-BLAST are set as default. Then, PSSMs are normalized to [0, 1] by the following equation [67]: For each target residue, its PSSM feature is constructed based on the 11 nearest sequential adjacent residues centered on the target residue. Therefore, for each residue, the dimension of the PSSM feature is 20 × 11 = 220.

Kmer
Kmer [68,69] is the most direct representation of protein sequence, which is defined as the occurrence frequencies of k neighboring amino acids. In this study, for each target residue, the kmer (k is set to 1) feature is calculated in the window of 11 nearest sequential adjacent residues centered to the target residue. Therefore, for each residue, the dimension of the kmer feature is 20.

Secondary Structure
Secondary structure features are effective in protein structure prediction [70,71]. The PSIPRED version 4.01 package [72] includes two approaches to predict secondary structure of proteins; one is a profile-based method and the other is a sequence-based method. In this study, the profile-based PSIPRED is adopted to predict secondary structure for each target residue among three types of structures (i.e., helix, beta strand and coil). However, when a protein has no homologous sequences after searching against the nrdb90 database [66], the sequence-based PSIPRED is adopted. For each target residue, the dimension of secondary structure feature is one.

Relative Solvent Accessibility
Previous studies have indicated that incorporating the predicted solvent accessibility information is useful for improving the prediction of protein functional sites [73][74][75][76]. In this study, Sable version 2 package [77,78] is adopted to generate relative solvent accessibility information for each target residue, and the dimension of this feature is one for each target residue. The parameters of Sable are set as: SA_ACTION = SVR, SA_OUT = RELATIVE and other parameters are set as default.

Conditional Random Fields
Conditional random fields (CRFs) were proposed by Lafferty et al. [19], and compose a probabilistic model for labeling sequence data. Due to their advantages, CRFs have been widely applied to solve a number of prediction tasks in the field of bioinformatics and computational biology, including protein-protein interaction prediction [79,80], phosphorylation site prediction [81], transcription factor binding site prediction [82], and protein-RNA residue-based contact prediction [83].
In this study, the identification of IDPs/IDRs is solved as a sequence labeling task by using CRFs, in which proteins are observation sequences, then each amino acid residue is annotated as disordered or ordered. Given some protein sequences represented as X and their label sequences represented as Y, then these data are used to train a conditional probability model P(Y|X), which is finally used to label unlabeled protein sequences. In general, CRFs employ the simplest first-order chain structure. Therefore, given an unlabeled observation sequence x, the conditional probability of its label sequence y has the following form [19]: where Z(x) is a normalization factor, t k (y i−1 , y i , x, i) is a transition feature function of the observation sequence x and the labels at position i − 1 and i, and the transition feature is defined as Equation (2) in this study. s l (y i , x, i) is a state feature function of observation sequence x and the label at position i. In this study, state features include PSSMs, kmer, predicted secondary structures and relative solvent accessibility. The index k of t k and the index l of s l is the number of different features. λ k and µ l are the weights of t k (y i−1 , y i , x, i) and s l (y i , x, i), respectively.

Implementations
FlexCRF [84] is an implementation of CRF, which was modified to be able to handle real value features as described by Li et al. [85]. In this study, the modified FlexCRF is adopted, and the first-order Markov CRF is used. The parameter num_iterations is optimized from 30 to 60 with an increment of 10, and the optimal value is 50. The parameter init_lambda_val is optimized from 0.05 to 0.1 with an increment of 0.05, and the optimal value is 0.05. Scikit-learn [86] version 0.19.1 is used for the implementations of random forest (RF) and artificial neural network (ANN). For the RF predictor, the parameter n_estimators is optimized from 100 to 1000 with an increment of 100 and the optimal value is 500. For the ANN predictor, its structure includes an input layer, a hidden layer, and an output layer. The parameter hidden_layer_sizes is optimized from 20 to 80 with an increment of 10, and the optimal value is 40. In order to handle large-scale datasets, LIBLINEAR [87] is adopted for the implementation of support vector machine (SVM). For the SVM predictor, the parameter c is optimized in the range of 2 i , where i is an integer and i ∈ [−5, 5], and the optimal value is 2 −4 . The other parameters of each algorithm are set as default.

Criteria for Performance Evaluation
In this study, sensitivity (Sn) and specificity (Sp) are adopted, which measure the performance of each class in binary prediction. In the datasets of IDPs/IDRs, the positive and negative samples are unbalanced, and the number of ordered residues is far more than that of disordered residues. Therefore, we choose another two metrics, balanced accuracy (ACC) and Matthew's correlation coefficient (MCC) [88,89], to measure the performance of different methods. These metrics are defined as follows: where TP (true positive) and FP (false positive) represent the number of corrected and misclassified predicted disordered residues, respectively; TN (true negative) and FN (false negative) represent the number of corrected and misclassified predicted ordered residues, respectively.

Conclusions
In this study, we propose a new computational method called IDP-CRF combining various sequence-based features and conditional random fields (CRFs) to predict IDPs/IDRs. Furthermore, this predictor is trained on an updated benchmark dataset. Experimental results show that IDP-CRF performs better than, or at least highly comparable to, 25 existing state-of-the-art methods in this field. The good performance of IDP-CRF can be attributed to its following three advantages. (1) IDP-CRF is trained on a more reliable benchmark dataset, which is the currently most comprehensive benchmark dataset constructed in this paper; (2) combining CRFs enables IDP-CRF to contain the relationship between labels of sequential adjacent residues, and therefore, the global sequence patterns of disordered region distributions are incorporated; (3) IDP-CRF improves the previous CRF-based predictor by incorporating more comprehensive sequence-based features. In our future studies, we will focus on exploring new machine learning algorithms to further improve the accuracy of prediction of IDPs/IDRs [90][91][92][93].