The Prediction of Intrinsically Disordered Proteins Based on Feature Selection

Intrinsically disordered proteins perform a variety of important biological functions, which makes their accurate prediction useful for a wide range of applications. We develop a scheme for predicting intrinsically disordered proteins by employing 35 features including eight structural properties, seven physicochemical properties and 20 pieces of evolutionary information. In particular, the scheme includes a preprocessing procedure which greatly reduces the input features. Using two different windows, the preprocessed data containing not only the properties of the surroundings of the target residue but also the properties related to the specific target residue are fed into a multi-layer perceptron neural network as its inputs. The Adam algorithm for the back propagation together with the dropout algorithm to avoid overfitting are introduced during the training process. The training as well as testing our procedure is performed on the dataset DIS803 from a DisProt database. The simulation results show that the performance of our scheme is competitive in comparison with ESpritz and IsUnstruct.


Introduction
The intrinsically disordered proteins (IDPs) have at least one region lacking a unique 3D structure [1].They exist as conformational ensembles without equilibrium positions for their atom positions and bond angles [2].Their mobile flexibility and structural instability are encoded by their amino acid sequences [3].They play a crucial role in a variety of important biological functions [4].It is confirmed that some of these IDPs are related to many important regulatory functions in the cell [5], which have great impact on diseases such as Parkinson's disease, Alzheimer's disease and certain types of cancers [6].Thus, accurately identifying IDPs is vital to obtaining more effective drug designs, better protein expressions and functional annotations.However, it is often difficult to purify and crystallize the disordered protein regions [7], which makes the identification of IDPs usually both expensive and time-consuming with experimental approaches [6].Thus, it is essential to predict IDPs through the computational approaches.
For the past few decades, many computational schemes for predicting IDPs have been proposed.These schemes can be roughly classified into three categories: (1) employ the amino acid physicochemical properties and propensity scales to predict IDPs, such as FoldIndex [8], GlobPlot [9], IUPred [10], FoldUnfold [11] and IsUnstruct [12].These schemes are based on the scale of charge/hydropathy, the relative propensity of an amino acid residue, the inter-residue contacts, and statistical physics, respectively.In particular, IsUnstruct performs well in predicting, which is an approximation of the Ising model that replaces the interaction term between neighbors with a penalty for changing between ordered and disordered states, so it is selected as a scheme of comparison in this paper.(2) Exploit machine learning techniques for the prediction of IDPs.In this category, the prediction of IDPs is considered as a binary classification problem based on many machine learning algorithms, such as artificial neural network, support vector machine and so on.The examples of this category include PONDRs [13], RONN [14], DISOPRED3 [15], BVDEA [6], DisPSSMP [16], SPINE-D [17], ESpritz [18], and so on.In this paper, we choose ESpritz as the other scheme of comparison, which is based on a bidirectional recursive neural network and also performs well in predicting.(3) Meta-approaches, such as MFDp [19], MetaPrDOS [20] and Meta-Disorder predictor [21], which use several different predictors and their trade-off to yield an optimal prediction for IDPs.
In this paper, we develop a scheme for predicting intrinsically disordered proteins which requires computing three types of features including eight structural properties, seven physicochemical properties and 20 pieces of evolutionary information.Without sacrificing our prediction accuracy of IDPs, we use a preprocessing procedure which can greatly reduce the input features and hence dramatically simplify as well as speed up identification of IDPs.Furthermore, we use two different windows to preprocess for each residue, which highlight the properties of the target residue and the surrounding around it, respectively.Thus, there are just 70 features for each residue.This number is far fewer than that of most other methods [16,17].For training our scheme, we introduce the Adam algorithm [22] for the back propagation together with the dropout algorithm to avoid overfitting [23] in the multi-layer perceptron (MLP) neural network.The dataset DIS803 from the latest DisProt [24] database is used for training as well as testing our procedure.The dataset R80 collected by Yang et al. in [14] is used as the blind test set.Using the same input features, we compare the prediction performance of the MLP with support vector machine (SVM) [25] and Fisher discriminant [26].The predictive results express that MLP is more appropriate to train our scheme.Finally, based on the same test sets, we compare our scheme with two competitive prediction schemes ESpritz and IsUnstruct.The simulation results show that our scheme is more accurate.

Materials and Methods
In this section, we propose a preprocessing procedure to preprocess three types of features with two different windows.The preprocessed features are capable of containing not only the properties of the surroundings of the target residue with the long window but also the properties related to the specific target residue with the short window.Then, for training our scheme, we employ the Adam algorithm [22] for the back propagation together with the dropout algorithm [23] to avoid overfitting.

Feature Selection
We start with a procedure of preprocessing features which contain structural and physicochemical properties as well as evolutionary information of the protein sequences.Without sacrificing our prediction accuracy of IDPs, through preprocessing, our scheme relies on only 70 features as its inputs for the identification of IDPs, which is far fewer features needed in comparison with the most of the other methods [16,17].
The selection of the structural properties including the region complexity and residue location is due to the fact that the disordered region tends to have less complexity than the ordered region [14].
To further extract the complexity information of the regions, we also add the topological entropy of a protein sequence as a novel feature for our identifying IDPs.Thus, the topological and Shannon entropy together with three amino acid propensities from GlobPlot NAR paper [9,27] are used to describe the complexity of a region around the target residue.The residue location information [28] can be obtained from computing the amino acid type, the ratio of its position in the protein sequence and the terminal position [17].
Since the IDPs and folded proteins exhibit the distinct physicochemical properties in high probability [1], we use seven physicochemical properties introduced by Jens et al. [29] for our identification.They are steric parameter, polarizability, volume, hydrophobicity, isoelectric point, helix probability and sheet probability.Finally, we also compute the Position Specific Scoring Matrix (PSSM) [16,30] through executing three iterations of PSI-BLAST (Position-Specific Iterative Basic Local Alignment Search Tool) against NCBI (National Center for Biotechnology Information) [31] non-redundant database with default parameters.
The preprocessing procedure proceeds as follows: considering a protein sequence w of length L, we choose a sliding window of odd length N (N < L).For the simplicity, we first transform w into a sequence of length L + N − 1 through appending (N − 1)/2 zeros to the both ends of w.Using the sliding window of length N, we obtain N consecutive residues w j w(j) , . . ., w(j + N − 1) with 1 ≤ j ≤ L. For each w j with 1 ≤ j ≤ L , we obtain where m k (w j ) with 1 ≤ k ≤ 32 denotes the value of k-th calculated feature of w j .Without loss of generality, we assume that the block m k (w j ) with 1 ≤ k ≤ 5 , 6 ≤ k ≤ 12 and 13 ≤ k ≤ 32 corresponds to the complexity features, physicochemical features and evolutionary information, respectively.Thus, each residue in the window w j is associated with a vector v j .Sliding the window to the right one residue, we can compute v j+1 and associate it with every residue in the window w j+1 .Repeat this procedure for 1 ≤ j ≤ L. Finally, every residue in the protein sequence is associated with multiple v j with 1 ≤ j ≤ L. For each residue, we assign the average value of all the v j with 1 ≤ j ≤ L associated with it as the feature vector of this specific residue.Let x pre j (1 ≤ j ≤ L) denote the feature vector which can be computed as where v i for 1 ≤ i ≤ L is defined in Equation (1).
In Equation ( 1), the computation of the Shannon entropy or topological entropy can be found in [27].For one amino acid property, m k (w j ) represents the average value of the mapped w j .In addition, the L × 20 matrix PSSM w is used to calculate: Finally, 35 × 1 feature vector x j (1 ≤ j ≤ L) for each residue with one sliding window is with x loc j being the three pieces of location information.

Designing and Training the MLP Neural Network
Our multi-layer perceptron (MLP) neural network structure that simultaneously takes two 35 × 1 feature vectors x j as its inputs is comprised of two hidden layers with each layer containing 50 perceptrons and one bias.These two 35 × 1 input vectors with one yielded from a short window and the other from a long window.Define X = [x 1 x 2 . . .x N s ] where 70 × 1 vector x j (1 ≤ j ≤ N s ) is computed from Equation (4) with N s being the total number of all the samples.Let Y be the 2 × N s output matrix whose column is equal to either [1, 0] T or [0, 1] T corresponding to the disordered or ordered residue.The training process proceeds as follows: 1. Input X and compute the forward propagation through (5) where g [l] is the activation function in the l-th layer (l = 1, 2, 3).In our MLP neural network, both g [1] and g [2] are ReLU functions and g [3] is softmax function.R(p d ) represents the dropout vector which obeys the Bernoulli distribution with probability p d .The symbols * and • respectively represent the scalar and matrix multiplication.Furthermore, A [0] is equal to X while A [3] is the prediction result Ŷ.The structure of the MLP network is shown in Figure 1. 2. After obtaining Ŷ , compute the cost function which is the cross entropy.If the cost function is converged or the max iteration number is achieved, then stop and output Ŷ. 3. Employ Adam algorithm [22] to perform the back propagation.Update W and b and repeat 1.The detail paradigm of our scheme is shown in Figure 2. Based on the feature matrix X, we use 10-fold cross validation to train MLP networks, and select the best one according to their prediction results.Then, to reduce the impact of initial values, five different random initial values are used to train five independent MLP networks.Finally, the prediction results are obtained by averaging the outputs of these five trained MLP networks.

Datasets for Training and Testing
The dataset DIS803 from the latest version of DisProt [24] is used in our training and testing.The DIS803 is comprised of 803 protein sequences with 1254 disordered regions and 1343 ordered regions, which corresponds to 92,418 disordered and 315,856 ordered residues, respectively.We randomly split this dataset into two subsets of data with the ratio of 9:1.The larger one containing 721 protein sequences uses as the training set by 10-fold cross validation.The training set contains 85,184 disordered and 289,983 ordered residues.The test set has 82 protein sequences, which corresponds to 7234 disordered and 25,873 ordered residues.In addition, we also use the dataset R80 as the blind test set.Dataset R80 is collected by Yang et al. in [14], which is comprised of 80 protein sequences with 183 disordered regions and 151 ordered regions.

Performance Evaluation
The performance of our scheme is evaluated by four metrics [32] which contain sensitivity (Sens), specificity (Spec), the weighted score (Sw) and Matthews correlation coefficient (MCC).We use TP, TN, FN and FP to correspond to the number of true positives, true negatives, false negatives and false positives, where positive is disorder and negative is order.The MCC and Sw can be computed as follows: where Sens = TP/(TP + FN) and Spec = TN/(TN + FP) .

Results and Discussion
The simulation results show that our scheme yields better weighted scores in comparison with ESpritz and IsUnstruct based on the test set from DisProt [24] and the blind test set R80.

Impact of the Sliding Window Sizes
Since the input of our scheme is comprised of the information obtained from a short window as well as a long window, we choose different window sizes to study the impact of window sizes on the training set of our scheme.The MLP neural network contains two hidden layers with each layer containing 50 perceptrons and one bias.We set the dropout parameter p d to be 0.5 with the learning rate being 0.0001.
Fixing the size of the short window at 11, Table 1 shows the predictive performance of various long windows based on the training set with 10-fold cross validation.
When the size of the long window is larger than 55, the curves of the two parameters Sw and MCC tend to be stable as shown in Figure 3. Thus, we set the size to be 55.Next, fixing the size of the long window at 55, we change the size of the short window.The predictive performance is shown in Table 2 as well as Figure 4.These results show that our choice of window size 11 is acceptable.

Impact of the MLP Parameters
We set the sizes of the short and long window respectively to be 11 and 55.We change the parameters of the MLP neural network to investigate the influence of these parameters on the training set with 10-fold cross validation.
Firstly, we change the number of perceptrons in hidden layers.We use N neur to denote the number of perceptrons in each hidden layer.Table 3 4 shows the results of p d = 0.5, 0.7, 1.In Table 4 , as the value of p d increases, which means that the degree of dropout decreases, the Sw shows a downward trend.Taking the values of these four evaluation parameters into consideration, we select p d = 0.5.

Impact of the Preprocessing
After determining the parameters of MLP, we analyze the impact of the preprocessing procedure.If we directly take the features in the sliding window as the input of the target residue without preprocessing, the input features will contain a lot of redundant information.It is expected to improve the predictive performance by reducing the dimension of features and improving the effectiveness of features, which is exactly the role of preprocessing.
Without preprocessing, the feature dimensions are 335 and 1655 in the windows of 11 and 55, respectively.Thus, the feature dimension of input is 1990 for each residue.However, after preprocessing, the feature dimension of input is greatly reduced to 70, but the input feature still contains the main information of both the target residue and the its surroundings.Table 5 shows the performance comparison of the schemes with and without preprocessing based on the training set with 10-fold cross validation.These results indicate that the performance of the scheme with preprocessing outperforms the scheme without preprocessing.

Impact of the Evolutionary Information
In this paper, we extract evolutionary information from the Position Specific Scoring Matrix (PSSM) which can improve the predictive accuracy but increase the execution time at the same time.However, both the accuracy and speed of the prediction are important.Table 6 displays the performance comparison of our scheme excluding or including PSSM based on the training set with 10-fold cross validation.Moreover, in order to compare the execution time, we select a sequence (DP01011) to calculate the execution time of each situation as shown in Table 6.Although using PSSM will increase the execution time, it significantly improves the accuracy of prediction in our scheme.Thus, we still include PSSM information in the input features.

Comparing with Other Classification Algorithms
Based on the same input, we compare the trained multi-layer perceptron (MLP) with other two classification algorithms support vector machine (SVM) and Fisher discriminant.The principle of SVM is to find a hyperplane that maximizes the interval between two categories of samples [25].It can be transformed to solve a quadratic programming problem.We use a SVM with radial basis function (RBF) Gaussian kernel to contrast.Fisher discriminant is a linear classification algorithm.It maximizes the generalized Rayleigh entropy to search the optimal projection direction [26].Table 7 shows the predictive results of MLP, SVM and Fisher discriminant, based on the training set with 10-fold cross validation.It is obvious that MLP obtains better Sw and MCC in Table 7.Therefore, based on the same preprocessed features as input, MLP is more appropriate to train our scheme.

Testing Performances
In this section, using the same test sets, we compare our scheme, named DISpre, with two competitive schemes ESpritz and IsUnstruct.Table 8 shows the predictive performance of these three schemes on the test set from the latest version of DisProt [24].The results of the ESpritz and IsUnstruct are obtained from their online predictors.From Table 8, our scheme DISpre gets the best Sens and ESpritz gets the best Spec.However, DISpre also yields better Sw and MCC than ESpritz and IsUnstruct on the test set.In addition, we also compare these three schemes on the blind test set R80 as shown in Table 9.In this dataset, our scheme still obtains better Sens and Sw.

Conclusions
In this paper, we present a new scheme for predicting intrinsically disordered proteins.Our scheme uses three types of features including eight structural properties, seven physicochemical properties and 20 pieces of evolutionary information.Furthermore, we employ a preprocessing procedure which greatly reduces the input features.With two different windows, the preprocessed features are capable of containing not only the properties of the surroundings of the target residue with the long window but also the properties related to the specific target residue with the short window.We also use the Adam algorithm for the back propagation together with the dropout algorithm to avoid overfitting in training our scheme.Comparing with SVM and Fisher discriminant, the predictive results express that MLP is more appropriate to train our scheme.Finally, based on the same test sets, the simulation results show that our scheme obtains better Sw than two competitive schemes ESpritz and IsUnstruct.

Figure 1 .
Figure 1.The structure of the MLP network.

Figure 2 .
Figure 2. The detail paradigm of our scheme.

Figure 3 .
Figure 3. Performance with different long windows.

Figure 4 .
Figure 4. Performance with different short windows.

Table 1 .
Performance with different long windows.

Table 2 .
Performance with different short windows.

Table 3 .
Performance with different N neur .

Table 4 .
Performance with different p d .

Table 5 .
Performance comparison of the schemes with and without preprocessing.

Table 6 .
Performance comparison of the schemes excluding and including PSSM.

Table 7 .
Performance comparison with SVM and Fisher discriminant.

Table 8 .
Performance comparison with existing schemes on the test set from DisProt.

Table 9 .
Performance comparison with existing schemes on R80 set.