Identifying Intrinsically Disordered Protein Regions through a Deep Neural Network with Three Novel Sequence Features

The fast, reliable, and accurate identification of IDPRs is essential, as in recent years it has come to be recognized more and more that IDPRs have a wide impact on many important physiological processes, such as molecular recognition and molecular assembly, the regulation of transcription and translation, protein phosphorylation, cellular signal transduction, etc. For the sake of cost-effectiveness, it is imperative to develop computational approaches for identifying IDPRs. In this study, a deep neural structure where a variant VGG19 is situated between two MLP networks is developed for identifying IDPRs. Furthermore, for the first time, three novel sequence features—i.e., persistent entropy and the probabilities associated with two and three consecutive amino acids of the protein sequence—are introduced for identifying IDPRs. The simulation results show that our neural structure either performs considerably better than other known methods or, when relying on a much smaller training set, attains a similar performance. Our deep neural structure, which exploits the VGG19 structure, is effective for identifying IDPRs. Furthermore, three novel sequence features—i.e., the persistent entropy and the probabilities associated with two and three consecutive amino acids of the protein sequence—could be used as valuable sequence features in the further development of identifying IDPRs.


Introduction
Protein regions which lack stable three-dimensional structures are referred to as intrinsically disordered regions (IDPRs) [1]. In recent years, it has come to be recognized more and more that IDPRs have a huge impact on many important physiological processes [2,3], such as molecular recognition and molecular assembly, the regulation of transcription and translation, protein phosphorylation, cellular signal transduction, etc. [4][5][6]. Furthermore, some human diseases, such as certain types of cancer, Parkinson's disease, and cardiovascular disease [7][8][9], have been found to be linked with IDPRs. However, the experimental methods used to identify IDPRs are usually expensive and time-consuming [10]. Thus, the fast, reliable, and accurate identification of IDPRs by computational methods is a valuable complement to experimental studies.
In this paper, we develop a deep neural structure composed of a variant VGG19 [30], where the variant VGG19 is situated between two multilayer perceptron (MLP) networks for identifying IDPRs. In the variant VGG19, we erase the fully connected (FC) layers of VGG19 but preserve the other parts of the VGG19 structure and related parameters. In comparison with ResNet, the parameters of VGGNet could be easily manipulated. The MLP network consists of an input layer, hidden layers, and an output layer. The MLPs are employed for transforming the features into the formats suitable for serving as the inputs of the variant VGG19 and classification network, respectively. Compared with our previous DISpre algorithm [31] and DISvgg algorithm [16], we introduce VGG19 as a part of the network instead of as a single MLP network, and additionally use one VGG19 instead of ten VGG16. Moreover, to further improve the performance of prediction, we introduce new features for prediction. For the first time, three sequence features, which are the persistent entropy based on the persistent homology and the probabilities associated with two and three consecutive amino acids of the protein sequence (PCAA2, PCAA3), are introduced for identifying IDPRs. These three novel sequence features together with those used in [32]-i.e., two sequence features, seven physicochemical propensities, and three propensities of amino acids, as well as twenty evolutionary features-are used as the inputs for our neural structure. The simulation results obtained for two blind testing sets, R80 [25] and MXD494 [33], show that our neural structure either performs considerably better than other well-known methods [17,20] or, when relying on a much smaller training set (DIS1616) compared to the one used in [18], attains a similar performance.

Datasets and Input Features
In this section, the datasets used in this paper for training and blind testing are presented. The features extracted from the training dataset are depicted. In particular, we introduce three novel features, which are used for the first time for identifying IDPRs. These three novel features are persistent entropy based on persistent homology, PCAA2, and PCAA3.

Datasets
The dataset DIS1616 from the DisProt [34] (accessed on June 2020) is employed for training and cross validating, while the datasets R80 [25] and MXD494 [33] are used for blind testing. The training dataset DIS1616 consists of 1616 protein sequences which contain 182,316 disordered and 706,362 ordered amino acids. The dataset DIS1616 is randomly split into two subsets: DIS1450 and DIS166. They contain 1450 protein sequences and 166 protein sequences and are used for training and testing, respectively. The blind testing dataset R80 has 78 protein sequences, in which there are 3566 disordered and 29,243 ordered amino acids. There are 494 protein sequences in the blind testing dataset, MXD494, among which 44,087 disordered and 152,414 ordered amino acids are presented.

Input Features Used for the Identification of IDPRs
The features fed to our neural structure for identifying IDPRs can be summarized as five sequence features, seven physiochemical propensities, and three propensities of amino acids, as well as twenty evolutionary features of the given protein sequence. Of these five sequence features, persistent entropy based on persistent homology, PCAA2, and PCAA3 are, for the first time, introduced for identifying IDPRs. The remaining two sequence features are the Shannon entropy and topological entropy [32]. Topological entropy is used to depict the complexity of the protein sequence. The seven physiochemical properties of the amino acids are steric parameter, polarizability, volume, hydrophobicity, isoelectric point, helix, and sheet probability, as illustrated in the reference [35]. Three propensities of the amino acids are Remark 465, Deleage/Roux, and Bfactor(2STD), which are derived from the GlobPlot NAR paper [12]. Twenty evolutionary features can be determined through the Position-Specific Substitution Matrix (PSSM) [36], which is computed using the Position-Specific Iterative Basic Local Alignment Search Tool (PSI-BLAST) [37].

The Computation of Persistent Entropy
In this section, we will to briefly illustrate the procedure used for computing the persistent homology as well as its persistent entropy from the given protein sequence. More information related to the computation of the persistent homology and its persistent entropy can be found in [38,39].
Given a protein sequenceŵ = w 1 · · · w L of length L, we choose a sliding window of odd length N (N < L) to extract N consecutive amino acids fromŵ. For simplicity, we first transformŵ into a sequence of size L + N − 1 through appending (N − 1)/2 amino acids to both ends ofŵ. The (N − 1)/2 appended amino acids at both ends are identical to either the first or last amino acid of the protein sequenceŵ . Thus, utilizing a sliding window of size N, we can slice the transformedŵ of size L + N − 1 into L amino acid subsequences β j = w j · · · w j+(N−1) with 1 ≤ j ≤ L. To compute the persistent entropy of β j , we need to map each amino acid w m in β j to a set of points, which leads us to define where the value for k is 1 ≤ k ≤ 20 and δ(·) is the delta function. We use a one to one correspondence to represent the set of amino acid symbols as: Thus, each amino acid symbol . ( We use x where V denotes a filtration with its associated persistence diagram dgm with 0 < ε 1 < · · · < ε l = K. In Equation (5), the simplicial complex VR(β j , K) (K > 0) is chosen to be the Vietoris Rips complex of β j , which is defined as: Given a filtration V defined by (5), a barcode in the k-dimensional persistence with endpoints [ε s , ε e ) corresponds to a k-dimensional hole that appears at filtration time ε s and remains until filtration time ε e . The set of bars [ε s , ε e ), representing the birth and death times of homology classes, is called the persistence barcode B(V ) for the filtration V of (5). Analogously, the set of points (ε s , ε e ) ∈ R 2 is called the persistence diagram dgm(V ) of the filtration V of (5). The persistent entropy of each amino acid w m for

The Computation of the Features Using the Probabilities Associated with the Protein Sequence
The probability associated with two and three consecutive amino acids of the protein sequence depends on the probability of each amino acid occurring in the observed protein, which depends on the protein sequence length and the number of each individual amino acids in the protein sequence. We put all amino acids from all proteins in DIS1616 together and, based on this set, calculate the probabilities associated with two and three consecutive amino acids of the protein sequence. Consider the given protein sequenceŵ = w 1 · · · w L . For convenience, we define two sets: which represent all the possible combinations of two or three consecutive amino acids in this protein sequence. Two novel features introduced in this paper are: which can be derived from the probability features P 2 and P 3 , respectively, associated with two or three consecutive amino acids of the protein sequenceŵ = w 1 · · · w L . Using the notation δ(·) function, H 2 (j) and H 3 (j) in Equations (9) and (10) for 1 ≤ j ≤ L can, respectively, be computed using: In view of (7) and (8), functions I 2 (j) and I 3 (j) in (11) and (12) are defined as: where w j andw j , respectively, represent w j w j+1 (1 ≤ j ≤ L − 1) and w j w j+1 w j+2 (1 ≤ j ≤ L − 2). It is easy to verify 1 ≤ I 2 (j) ≤ 400 and 1 ≤ I 2 (l) ≤ 8000 for 1 ≤ j ≤ L − 1 and 1 ≤ l ≤ L − 2. Functions N 2 (·) and N 3 (·) in (11) and (12) defined over the sets {1 , . . . , 400} and {1 , . . . , 8000}, respectively, are scaled probability features P 2 and P 3 , with where we have The probability features P 2 and P 3 associated with two and three consecutive amino acids of the protein sequence, respectively, are equal to: In (17) and (18), we have: where we assume that the set of the protein sequences is denoted by Ω (in this paper, we have Ω = DIS1616). For a given protein sequenceŵ = w 1 · · · w L , the functions Mŵ Φ 2 (k) and Mŵ Φ 3 (k) , which, respectively, count the total number of occurrences of a particular combination of two or three consecutive amino acids inŵ, are equal to: where w j andw j , respectively, represent w j w j+1 (1 ≤ j ≤ L − 1) for 1 ≤ j ≤ L − 1 and w j w j+1 w j+2 (1 ≤ j ≤ L − 2) for 1 ≤ j ≤ L − 2.

Pre-Processing the Data Extracted from the Protein Sequences
In this section, we illustrate how to compute the input of our deep neural network, which is composed of 35 features derived from a protein sequence. Of these 35 features, there are twenty evolutionary features which are determined through the PSSM [36] computed through the PSI-BLAST [37]. Seven physiochemical properties of the amino acids are steric parameter, polarizability, volume, hydrophobicity, isoelectric point, helix, and sheet probability, which can be obtained from the paper [35]. Three propensities of the amino acids are Remark 465, Deleage/Roux, and Bfactor (2STD), as detailed in the GlobPlot NAR paper [12]. The other two features used to measure the complexity of the protein sequence are Shannon entropy and topological entropy [32].
Given a protein sequenceŵ = w 1 · · · w L of length L, we choose a sliding window of odd size N (N < L) to extract N consecutive amino acids. Then, for these amino acids in the sliding window, we compute the evolutionary features, physiochemical properties, and propensities, as defined in the previous paragraph. These thirty computed feature values of amino acids in the sliding window are averaged and the averaged results are used to represent the feature values of the amino acid in the center of the sliding window. For simplicity, we first transformŵ into a sequence of size L + N − 1 through appending (N − 1)/2 zeros to the both ends of the protein sequence. With this sliding window of size N, we also compute the Shannon and topological entropy through the procedure from Equations (1)- (14), as described in the paper [32], as well as the persistent entropy defined in (4). Thus, for each w j with 1 ≤ j ≤ L in the protein sequenceŵ = w 1 · · · w L , we can combine it with a 33 × L feature matrix v j = [m 1,j m 2,j ... m 33,j ] (23) where m k,j for 1 ≤ k ≤ 20 , 21 ≤ k ≤ 27 , 28 ≤ k ≤ 30 and 31 ≤ k ≤ 33, respectively, align to a 20-dimensional PSSM of the evolutionary information [36,37], seven physiochemical properties [35], three propensities of amino acids from the paper [12], and three entropies (Shannon, topological [32], and persistent entropy). We also use Equations (11) and (12) to compute two novel features H 2 (j) and H 3 (j) (1 ≤ j ≤ L) that are associated with two or three consecutive amino acids of the protein sequence and set m 34,j = H 2 (j) and m 35,j = H 3 (j). Finally, we modify the 33 × L feature matrix defined in (23) to a 35 × L feature matrix: with where v j (1 ≤ j ≤ L) is defined in (23). The input to our deep neural network is x j (1 ≤ j ≤ L), as defined in (25).

The Structure of Our Neural Network and Training Procedure
In this section, we develop a deep neural structure composed of a variant VGG19, where the variant VGG19 is situated between two MLP networks used for identifying IDPRs. Then, we introduce the process of training the deep neural network.

The Structure of Our Deep Neural Network
The overall architecture of our model, as shown in Figure 1, is based on a variant VGG19 in cascade with two MLP networks, with the variant VGG19 being situated between two MLP networks. In the variant VGG19, we erase the fully connected (FC) layers of VGG19 but preserve the remaining VGG19 structure and its associated weights and biases.  The output of the variant VGG19 is a 1 × 3675 vector. As shown in Figure 2b, the skip connection is employed, where the sum of the output from the variant VGG19 and the output from the MLP network connecting to the features defined in (25) is fed as the input to a novel MLP network. This MLP network contains one hidden layer with 3675 neurons, whose activation functions are chosen to be the ReLU. The output layer has only 1 neuron with the sigmoid function as its activation function-i.e., where z i (1 ≤ i ≤ L) is the output of this sigmoid function and the index i is the i-th amino acid in the protein sequenceŵ = w 1 · · · w L . The dropout algorithm [40] with a dropout percentage of 50% is employed for this MLP network. The total loss function of our model for a package of size m (i.e., the number of amino acids used in each iteration during the training) is therefore defined as: In Equation (27), the predicted probability a i of the output y i = 1 is equal to: where y i is equal to either 1, suggesting that the i-th amino acid is disordered, or to 0, implying that it is ordered.

Training Procedure
In this section, we present the process of training the deep neural network developed in the previous section. The training dataset we use in this paper is DIS1450 from the DisProt [34]. We put all amino acids from all proteins in DIS1450 together and, based on this set, randomly divide them into packages of 128 amino acids. The training procedure is as follows: For each amino acid in a given package, we use the deep neural network constructed above to calculate the predicted probability defined in the Equation (28). When we have calculated all predicted probabilities for this given package, we can use the Equation (27) to estimate the average loss for the package. This computed averaged loss of the package is used to update the weights and biases of our network via a stochastic gradient descent (SGD) algorithm [41], where the learning rate η = 0.0001. We repeat the above process until all the packages have completed. We refer to this process as an epoch. Then, we repeat the above process until the loss function stops converging or reaches the maximum number of epochs.

Performance Evaluation
Four metrics were used to evaluate the performance of IDPR prediction [42]. These were sensitivity (Sens), specificity (Spec), balanced accuracy (BACC), and Matthews correlation coefficient (MCC). The related formulas are as follows: We use TP, FP, TN, and FN to represent the number of true positives, false positives, true negatives, and false negatives, respectively. The values of MCC can be any number between −1 and 1. The prediction accuracy for both ordered and disordered residue increases as the MCC value becomes closer and closer to 1.

Experimental Results
In this section, we will demonstrate the performance of our deep neural network on the different test sets: DIS166 [34], R80 [25], and MXD494 [33]. As a comparison, we also present the simulation results of the best known predictors for these datasets, such as RFPR-IDP (available at http://bliulab.net/RFPR-IDP/server (accessed on 26 March 2021)), SPOT-Disorder2 (available at https://sparks-lab.org/server/spot-disorder$2$/ (accessed on 26 March 2021)), DISvgg [16], and IDP-Seq2Seq [18]. For convenience, we refer to our method as MLP-VGG19-MLP. A ten-fold cross validation was performed on the training dataset DIS1450. The results of MLP-VGG19-MLP with different window sizes are shown in Table 1. In addition, the values achieved for BACC and MCC with different sliding window sizes are shown in Figure 3. When the sliding window size was larger than 33, the values tended to be smooth. Thus, we used the sliding window size of N = 33 in subsequent simulations.  On the test sets DIS166, R80, and MXD494, the performance of MLP-VGG19-MLP was superior to that of RFPR-IDP, SPOT-Disorder2, and DISvgg. The MCC value of MLP-VGG19-MLP is 0.5674 on the test set DIS166, 0.5775 on the blind test set R80, and 0.4737 on the blind test set MXD494. The simulation results show that MLP-VGG19-MLP either considerably outperforms these methods or, when relying on a much smaller training dataset compared to the one used in [18], attains a performance similar to that of IDP-Seq2Seq [18]. Tables 2-4, respectively, present the performances of all these methods on test sets DIS166, R80, and MXD494.

Conclusions
In this study, a deep neural structure is developed for identifying IDPRs, where a variant VGG19 is situated between two MLP networks. Furthermore, for the first time, three novel sequence features-i.e., persistent entropy, PCAA2, and PCAA3-are introduced for identifying IDPRs. In comparison with our previous DISvgg algorithm, the prediction performance of MLP-VGG19-MLP exceeded it. Furthermore, only one VGG19 was used in this paper, while ten VGG16nets were employed in the previous paper. In comparison with RFPR-IDP, SPOT-Disorder2, and IDP-Seq2Seq, MLP-VGG19-MLP relies on a much smaller training set to achieve a performance that is better or similar to that achieved using other methods. The simulation results show that our neural structure either considerably outperforms other known methods or, when relying on a much smaller training set, attains a similar performance. Three novel sequence features could be used as valuable sequence features in the further development of identifying IDPRs.
Author Contributions: Z.W. designed the study, carried out the data analysis, and drafted the manuscript. J.Z. participated in the design of the study and revised the paper. All authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.

Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.