Identiﬁcation of Intrinsically Disordered Protein Regions Based on Deep Neural Network-VGG16

: The accurate of i identiﬁcationntrinsically disordered proteins or protein regions is of great importance, as they are involved in critical biological process and related to various human diseases. In this paper, we develop a deep neural network that is based on the well-known VGG16. Our deep neural network is then trained through using 1450 proteins from the dataset DIS1616 and the trained neural network is tested on the remaining 166 proteins. Our trained neural network is also tested on the blind test set R80 and MXD494 to further demonstrate the performance of our model. The MCC value of our trained deep neural network is 0.5132 on the test set DIS166, 0.5270 on the blind test set R80 and 0.4577 on the blind test set MXD494. All of these MCC values of our trained deep neural network exceed the corresponding values of existing prediction methods.


Introduction
The intrinsically disordered proteins (IDPs) have at least one region that do not have stable three-dimensional structure [1][2][3], and they are widespread in nature [4]. IDPs are involved in key biological processes, such as the folding of nucleic acids, transcription and translation, protein-protein interactions, protein-RNA interactions, etc. [5][6][7]. Besides, IDPs are related to many human diseases, such as cancer, Alzheimer's disease, cardiovascular disease, and various genetic diseases [8,9]. Therefore, the accurate identification of IDPs is crucial for the study of protein structure, function, and drug design. In the past few decades, various experimental methods have been proposed to identify IDPs, such as nuclear magnetic resonance (NMR), X-ray crystallography, and circular dichroism (CD) [10,11]. However, it is expensive and time-consuming to identify disordered proteins through experiments, so numerous computational methods have been proposed to predict disordered proteins.
The available computational methods can be broadly classified into three categories: (1) scoring function-based methods. Methods in this category use a function to predict the propensity for disorder. Furthermore, they commonly employ the physiochemical properties of individual amino acids in the protein chain as its input. Example methods include IsUnstruct [12], GlobPlot [13], FoldUnfold [14], and IUPred [15], etc. (2) Machine learningbased methods. In this category, predictive models that are generated by machine learning algorithms (such as neural networks and support vector machines) are used to calculate the disorder tendencies. Typically, these predictive models use physiochemical properties of amino acid, evolutionary information, and solvent accessibility as inputs. These include RFPR-IDP [16], IDP-Seq2Seq [17], SPOT-Disorder [18], SPOT-Disorder2 [19], DISOPRED3 [20], SPINE-D [21], and ESpritz [22]. (3) Meta methods. They include MFDp [23], MetaDisorder [24], and MetaPrDOS [25]. These methods combine the results of multiple predictors to obtain the optimal prediction of IDPs.
Because traditional machine learning-based methods do not efficiently capture local patterns between amino-acids, we chose to use CNN as our network structure, which is known to be relatively lighter than other neural networks (e.g., recurrent neural networks), and it is effective in capturing local patterns [26,27]. In this paper, to capture the patterns of features in more detail, we develop a deep neural network that is comprised of ten identical deep neural structures, with each of them being derived from the original VGG16 model [28] and a MLP network, as shown in Figure 1. Each of these ten identical deep neural architectures is identical to the VGG16 model with no the fully-connected (FC) layer. The outputs of these structures are then used as the inputs of our MLP network for classification, where our MLP consists of two hidden layers with 512 perceptrons in each layer and uses cross entropy as its loss function. Subsequently, we randomly subdivide the dataset DIS1616 [29] into two disjoint subsets with one containing 1450 proteins for training and the other DIS166 having the remaining 166 proteins for testing. The ten-fold cross validation is performed on the training dataset DIS1450. On the test set DIS166 and blind test sets R80 [30] and MXD494 [31], the performance of our trained deep neural network is compared to two recent prediction methods that claim to have the best performances, i.e., RFPR-IDP and SPOT-Disorder 2. The MCC value of our trained deep neural network is 0.5132 on the test set DIS166, 0.5270 on the blind test set R80 and 0.4577 on the blind test set MXD494. All of these MCC values from our trained deep neural network exceed the corresponding values of RFPR-IDP and SPOT-Disorder2 .

Materials and Methods
In this section, we utilize the neural structure VGG16 and develop a deep neural network for identifying IDPs. Our deep neural structure comprises ten variant VGG16 and a MLP network.

Datasets
The first dataset we use through out this paper is DIS1616 of the DisProt database [29], which contains 182,316 disordered amino-acids and 706,362 ordered amino-acids. Here, we randomly split the set DIS1616 of 1616 proteins into two disjoint subsets with one (referred to DIS1450) containing 1450 proteins for training and the other (referred to DIS166) having the remaining 166 proteins for testing. Subsequently the ten-fold cross validation is performed on the training dataset DIS1450 . To further demonstrate the performance of our trained deep neural network, we use the datasets R80 and MXD494 as blind test sets, where R80 consists of 3566 disordered amino-acids and 29,243 ordered amino-acids, and MXD494 consists of 44,087 disordered amino-acids and 152,414 ordered amino-acids.

Pre-Processing the Features of Proteins for the Input of Our Model
The input features of our deep neural network can be classified into three types, i.e., the physicochemical properties, evolutionary information, and amino acid propensities. The physicochemical parameters [32] we use are steric parameter, polarizability, volume, hydrophobicity, isoelectric point, helix, and sheet probability. The evolutionary information is a 20-dimensional position-specific substitution matrix (PSSM) [33] derived from the Position-Specific Iterative Basic Local Alignment Search Tool (PSI-BLAST) [34]. Three propensities of amino-acids are from GlobPlot NAR paper.
The procedure of pre-processing proceeds, as follows: 1.
Given an amino-acid sequence of a protein of size L, we use a sliding window of size 21 and, for each amino-acid in the sliding window, compute the physicochemical properties, evolutionary information, and amino-acid propensities that are defined in the above paragraph. These computed feature values of amino-acids are then averaged over our sliding window and used to represent the feature value of the amino-acid in the center of our sliding window. For two terminal amino-acids of a protein, we append 10 zeros on the left and right side of the amino-acid sequence of a protein, respectively. Thus, for each amino-acid sequence of a protein, we can associate it with a 30 × L feature matrix where the entries of vector x j (1 ≤ j ≤ L) represents the feature values (seven physicochemical properties, twenty entries of a column in the PSSM matrix, and 3 aminoacid propensities) associated with the j-th amino-acid. Assume that a 30 × 1 vector where m k,j is the k-th feature value (1 ≤ k ≤ 30) that is associated with the j-th amino-acid (1 ≤ j ≤ L) averaged over the sliding window of size 21.

2.
Because the minimum input size of the VGG16 is required to be 32 × 32 × 3, we transfer the 30 × L feature matrix to three matrices of size 32 × 32, as Figure 2 depicts. That is, for each feature vector m k,j 1 ≤ k ≤ 30 and 1 ≤ j ≤ L defined in (2), we introduce an associated 32 × 32 matrix w k,j , which is . . .
where m k ,j = 0 with j < 1 or j > L.

3.
Thus, for the j-th amino-acid, its thirty input features yield a sequence of 32 × 32 matrices w 1,j , w 2,j · · · w 30,j . Therefore, at each step, we choose three consecutive from the sequence of matrices w 1,j , w 2,j · · · w 30,j as the input of VGG16 network.

Designing and Training Our Deep Neural Network
The overall architecture of our deep neural network is comprised of ten identical deep neural structures, with each of them being derived from the original VGG16 model and a MLP network. Each of these ten identical deep neural architectures shown in Figure 3 only utilizes the convolutional blocks of the original VGG16 model, i.e., erasing the fullyconnected (FC) layer of VGG16, but preserving the remaining VGG16 model as well as its associated weights and bias. The outputs of these ten identical deep neural structures are then used as the inputs of our constructed MLP network for classification. The MLP network that we construct is a fully-connected MLP network with two layers, which employs the loss function defined as where m represents the batch size, i.e., the number of amino-acids used in each iteration during the training. y (i) is a binary indicator (0 or 1) with 1, which suggests that the i-th amino-acid is disordered and 0 indicating that it is ordered.ŷ (i) is the predicted probability of the output y (i) = 1. Similarly, 1 −ŷ (i) denotes the predicted probability of y (i) = 0. Parameter α 1 and α 2 are the weights computed through where N ord and N dis correspond to the number of the ordered and disordered amino-acids in the training dataset, respectively. Because we obtain a 1 × 1 × 512 matrix from each our VGG16 output, ten 1 × 1 × 512 matrices are obtained from our deep neural network. Each of these 1 × 1 × 512 matrices are first flattened into a 1 × 512 vector using the flattening layer. Subsequently, the concatenation function from Keras framework [35] is utilized to concatenate these ten 1 × 512 vectors into a 1 × 5120 vector as the input to our MLP network. Each of two hidden layers of our MLP network contains 512 perceptrons and the activation functions for these two hidden layers are chosen to be the rectified linear unit (ReLU). Moreover, we employ the dropout algorithm [36], with a dropout percentage of 50% in each hidden layer of our MLP, to reduce overfitting. The output layer of our MLP network contains 2 perceptrons and it employs softmax as the activation function. The softmax activation function is defined as where s (i) p represents the p-th perceptron output value of the i-th amino-acid. The predicted probability of the output y (i) = 1 is defined as Similarly, the predicted probability of the output y (i) = 0 is defined as We randomly subdivide the amino-acids of our training dataset DIS1450 into disjoint batches with each containing 128 amino-acids. The training process proceeds, as follows: for each amino-acid in a given batch, we use our deep neural network to compute the predicted probabilities that are defined in (8) and (9). Once we compute all of the predicted probabilities in this given batch, we then utilize Equation (4) to estimate the average loss of this batch. This computed average loss of the batch is employed to update the weights and bias of our MLP network through the Adam algorithm [37], where the initial learning rate of 0.0005 is chosen to perform the back propagation and updating. The above process is repeated for each batch until all of the batches have been exhausted, which is referred as an epoch. Subsequently, we randomly subdivide the amino-acids of our training dataset DIS1450 again into disjoint batches with each containing 128 amino-acids and then repeat the above procedure until the loss function stops convergence or the maximum number of epochs is reached. Figure 4 shows the learning curve of the model. The predicted probabilities defined in (8) and (9) we obtain after completing the training are used for determining whether i-th amino-acid is ordered or disordered where the majority-voting principle is employed.

Performance Evaluation
The performance of the IDPs prediction is evaluated through sensitivity (Sens), specificity (Spec), binary accuracy (BACC), and Matthews correlation coefficient (MCC). The formulae for computing Sens, Spec, BACC, and MCC are In these equations, TP, FP, TN, and FN, respectively, represent true positives (correctly predicted disordered residues), false positives (ordered residues that are incorrectly identified as disordered residues), true negatives (correctly predicted ordered residues), and false negatives (disordered residues that are incorrectly identified as ordered residues), respectively. The values of MCC fall in −1 and +1, where 0 indicates a lack of correlation. The higher the MCC values, the better the prediction.

Results and Discussion
Throughout the rest of this paper, DISvgg is used as the abbreviation of our deep neural network. Our DISvgg is compared with two recent prediction methods that claim to have the best performances, i.e., RFPR-IDP (available at http://bliulab.net/RFPR-IDP/server (accessed on 26 March 2021)) and SPOT-Disorder2 (available at https://sparks-lab.org/ server/spot-disorder2/ (accessed on 26 March 2021)). The ten-fold cross validation is performed on the training dataset DIS1450 . On the test set DIS166 and blind test set R80 and MXD494, the performance of our trained deep neural network is compared to two recent prediction methods that claim to have the best performances, i.e., RFPR-IDP and SPOT-Disorder2. The MCC value of our trained deep neural network is 0.5132 on the test set DIS166, 0.5270 on the blind test set R80 and 0.4577 on the blind test set MXD494. All of these MCC values from our trained deep neural network exceed the corresponding values of RFPR-IDP and SPOT-Disorder2 . Tables 1-3 list all of the related values for evaluating the performance on the test set DIS166 and blind test set R80 and MXD494.

The Performance Comparison on the Test Dataset DIS166 and Blind Test Datasets
In this section, we present the simulation results that were obtained from applying our trained DISvgg as well as some existing prediction methods, such as RFPR-IDP and SPOT-Disorder2, on the independent dataset DIS166 as well as blind test datasets R80 and MXD494. Tables 1-3 summarize all of the values used for evaluating the performance of prediction methods.

Conclusions
The accurate identification of intrinsically disordered proteins or protein regions is of great importance, as they are involved in critical biological processes and related to various human diseases. In this paper, we develop a deep neural network that is based on the well-known VGG16 . Our deep neural network is then trained through using 1450 proteins from the dataset DIS1616 and the trained neural network is tested on the remaining 166 proteins. Our trained neural network is also tested on the blind test sets R80 and MXD494 to further demonstrate the performance of our model. The MCC value of our trained deep neural network is 0.5132 on the test set DIS166 , 0.5270 on the blind test set R80, and 0.4577 on the blind test set MXD494. All of these MCC values of our trained deep neural network exceed the corresponding values of existing prediction methods. Data Availability Statement: Publicly available datasets are analyzed in this study. The dataset DIS1616 can be found on the DisProt website(http://www.disprot.org/ (accessed on 26 March 2021)) and the dataset R80 are provided in the published [30].