Protein Solvent-Accessibility Prediction by a Stacked Deep Bidirectional Recurrent Neural Network

Residue solvent accessibility is closely related to the spatial arrangement and packing of residues. Predicting the solvent accessibility of a protein is an important step to understand its structure and function. In this work, we present a deep learning method to predict residue solvent accessibility, which is based on a stacked deep bidirectional recurrent neural network applied to sequence profiles. To capture more long-range sequence information, a merging operator was proposed when bidirectional information from hidden nodes was merged for outputs. Three types of merging operators were used in our improved model, with a long short-term memory network performing as a hidden computing node. The trained database was constructed from 7361 proteins extracted from the PISCES server using a cut-off of 25% sequence identity. Sequence-derived features including position-specific scoring matrix, physical properties, physicochemical characteristics, conservation score and protein coding were used to represent a residue. Using this method, predictive values of continuous relative solvent-accessible area were obtained, and then, these values were transformed into binary states with predefined thresholds. Our experimental results showed that our deep learning method improved prediction quality relative to current methods, with mean absolute error and Pearson’s correlation coefficient values of 8.8% and 74.8%, respectively, on the CB502 dataset and 8.2% and 78%, respectively, on the Manesh215 dataset.


Introduction
Residue solvent accessibility (RSA) [1] in a protein is defined as the extent of accessible surface area of a given residue and is related to the residue spatial arrangement and packing. It reveals the folding state of proteins and has been considered as a significant quantitative measurement for three-dimensional structures of proteins [2]. Solvent accessibility is closely involved in structural domains' identification [3], fold recognition [4], binding region identification [5], protein-protein interactions [6] and protein-ligand interactions [7]. Therefore, predicting the RSA of a protein represents an important step in determining its structure and function. Traditionally, RSA prediction is performed in two forms: (1) as a binary or multi-class classification problem with varying thresholds (two-state (exposed or buried) [8] or three-state (exposed, intermediate, or buried)) [9]; and (2) based on the relative accessible solvent area (rASA) prediction [10]. For example, if the surface area of a residue exceeds a threshold of 25%, the residue is classified as exposed; however, there is no standard definition of the thresholds for solvent-accessible area. Generally, the later approach is preferred over the former since rASA provides more information compared to binary or multi-class classification. For instance, it provides numerical values, which is required to apply this characteristic in protein structure and function prediction. In view of this, it is necessary to predict rASA. Recently, more and more studies have focused on rASA prediction.
In 2003, Ahmad et al. [10] firstly detailed rASA prediction, and more attention has been paid to this research since that time. Ahmad et al. proposed a neural network method with only single sequence information as the input features, and the result of 18.8% mean absolute error (MAE) was achieved on the CB502 dataset. Wang et al. [18] applied the multiple linear regression method to predict rASA from the sequence information and position-specific scoring matrix (PSSM). This method achieved a result of 16.2% MAE on the CB502 dataset. Wang et al. [17] improved the result to 15.1% on the same dataset by accumulation cutoff set and support vector machine. Using a weighted sliding window scheme, Zhang et al. [24] obtained the result of 14% MAE on the CB502 dataset. Fan et al. [22] used gradient boosted regression trees to predict rASA and achieved a state-of-the-art performance, which is 9.4% MAE and 0.73 Pearson's correlation coefficient (PCC) on the CB502 dataset. Another benchmark dataset Manesh215 [25] is also widely used by researchers [10,13,14,22,24,26] to validate prediction methods. Table 1 summarizes the recent developments in predicting the values of rASA.  [20] k-nearest neighbor PSSM 14. 8 Kashefi, 2013 [30] SVR and scatter search methods PSSM, qualitative physicochemical features 12. 31 Zhang, 2015 [24] Weighted sliding window PSSM, secondary structure, native disorder, physicochemical propensities, sequence-based features 14 Fan, 2016 [22] Gradient boosted regression trees PSSM, secondary structure, native disorder, conservation score, side-chain environment 9.4 The MAEs reported in this table were evaluated on a different dataset.
Although these methods for RSA prediction have progressed, RSA prediction remains challenging. Improved performance in these areas will enable more precision in related protein studies.
Recently, deep learning methods have dramatically improved the state-of-the-art in speech recognition, visual object recognition, object detection and many other domains [45]. The Recurrent Neural Network (RNN), a type of deep neural network, processes an input sequence one element at a time, maintaining a hidden vector that implicitly contains the history information about the past elements of the sequence. RNN is an extremely powerful sequence model for sequence modeling tasks [46], and many RNN methods have been applied to protein structure and function prediction [47][48][49]. We focused on a RNN model specific to protein sequences and applied a bidirectional recurrent neural network (BRNN) to predict RSA. To capture more local and long-range information, a merging operator was used to merge bidirectional information, given that this has rarely been addressed previously. Our deep learning method, a stacked deep bidirectional recurrent neural network (SDBRNN), is proposed, with three-layer bidirectional long-short memory (BLSTM) network with different merging operators used in each hidden layer. The public benchmark datasets CB502, Manesh215 and CASP10 were used for testing, with our experimental results showing that SDBRNN outperformed current state-of-the-art methods. Our method represents a more general approach and can be applied to a broad range of problems within and outside of computational biology. The rASA prediction tool of SDBRNN, training and testing datasets can be download from http://210.45.175.81:8080/rsa/sdbrnn.html.

Measurement Evaluation
To evaluate model performance for RSA prediction, four widely-used measures were used: MAE, PCC, accuracy (ACC) and Matthews' correlation coefficient (MCC). The formulas are defined in Equations (1)-(4).
In Equations (1)-(2), RSAr i and RSAp i are the real and predicted rASA values of the i-th residue in the sequence, respectively, while RSAr and RSAp are the corresponding mean values. N is the length of the protein sequence to predict. MAE is used to evaluate the deviation between the predicted and real relative values of RSA, and PCC is employed to quantify the relationship between predicted and real values.
MAE and PCC were commonly used to measure continuous values, whereas ACC and MCC were often used to measure binary classification. Prediction accuracy is the true predicted residues divided by the length of predicted sequence. The MCC is a correlation coefficient between the observed and predicted binary classifications. MCC is able to recover the drawback of accuracy regarding unbalance data. In Equations (3)-(4), TP, FP, TN and FN represent the number of true positives (exposed residues), true negatives (buried residues), false positives and false negatives, respectively.

Performance on Relative Solvent-Accessible Area Prediction
RSA predictors that use machine learning methods are converted into regressive problems. Results from the SDBRNN on the CB502 and Manesh215 datasets are shown in Table 2. Five predictors, including SARpred [13], SVR [13], Real-SPINE [15], NetSurfP [50] and PredRSA [22], the results of which are state-of-the-art, were used for comparison. Our method achieved an MAE of 8.8% and a PCC of 75% on CB502, an MAE of 8.2% and a PCC of 78% on Manesh215 respectively, both of which are better than the state-of-the-art. Additionally, PCC values for the CB502 and Manesh215 datasets were 2% and 3% higher than that of PredRSA. To evaluate the generalization on individual sequences, the predicted MAEs of individual sequences from the Manesh215 dataset have been further analyzed in Figure 1. The x-axis represents the sequence length. The MAE values were mostly from 0.065-0.1. In general, prediction performances on long sequences are higher than those of short sequences.  Two publicly-available datasets (CASP10 and CASP11) were also used to validate SDBRNN. CASP10 contains 90 proteins, and CASP11 contains 72 proteins, with the lengths of all sequences between 50 and 600 residues. The PCC value for the CASP10 dataset was 74.2%, which was 3.2% better than that for PredRSA, and that for CASP11 was 74.3%, which was slightly better than SPIDER2 [23].
Another representative method (PSO-SVR [24]) was also compared. The maximum solventaccessible area standard used in that study was according to Adhamd's work [10], which used extended tripeptides (Ala-X-Ala). We re-prepared data using the standard and re-trained model. Experiments using our method showed MAE and PCC values of 12.0%, and 0.765, respectively, on the Manesh215 dataset, which were better than 13.2% and 0.74 achieved by PSO-SVR. Additionally, on the CB502 dataset, our method showed MAE and PCC values of 13.3% and 0.739, which were better than 14% and 0.73 obtained by PSO-SVR.

Comparison of Different Classification Predictors
There are many methods proposed for predicting binary classification (exposed or buried) of residues. Thus, we have also examined the performance of our method in terms of two-state predictions. Similar to the two-layer predictor strategy [37,38], firstly the predicted rASA values were generated by the SDBRNN model. Then, the predicted relative values were transformed into binary states with predefined thresholds for comparison. For instance, the residue with a rASA value that is greater than or equal to the threshold, it can be considered as exposed, otherwise, it is considered as buried. The predictive ability between methods was compared against SARpred [13], PR [29], SVR [28], two-stage SVR [27], SS-SVM [30] and PredRSA using the Manesh215 dataset. Table 3 shows the performances of different methods. Results indicated that the SDBRNN model performed better at RSA prediction and generalization, except for the threshold of 50%. Predicted accuracy on individual sequence from the Manesh215 dataset is analyzed in Figure 2 for evaluating the model generalization. The prediction accuracy of most proteins is between 75% and 90%. Only 14 sequences out of the 215 proteins have a prediction accuracy of less than 75%.
We also tested our model using three publicly-available datasets at different thresholds: CB502, Manesh215 and CASP10. ACC and MCC results are listed in Tables 4 and 5. To compare with PredRSA [22] in detail, its results are also listed in the tables. Our method showed mostly better accuracy on CB502 and Manesh215, except the threshold of 50%. At a threshold of 50%, the ACC associated with the PredRSA on the CB502 and Manesh215 datasets was 0.6% and 0.2% better than our method, respectively. The MCC results of our method are still better than PredRSA.

Residue-Specific Variation in Predictive Error
To evaluate the predictive performance for various types of residues, we calculated the average rASA values using the Manesh215 and CB502 datasets for all 20 types of amino acids. SDBRNN accurately predicted RSA using the CB502 dataset with an MAE <1%, except that phenylalanine (F) and tryptophan (W) were predicted with an MAE <1.3% (Figure 3). Most amino acids in the Manesh215 dataset were predicted with an MAE <1%, with tyrosine, histidine, tryptophan and phenylalanine predicted with <2% MAE.

Discussion
To improve the performance on different combinations of sequence-derived features, we validated five types of input variables using the TR7000 dataset for training and an independent test set (TS261). The performance of the method on different feature combinations was compared, with the results ( Table 6) indicating that all five features contributed to RSA prediction. When bidirectional information from a hidden node is merged in the BRNN , concatenation (concat) [51] and sum [47] are commonly used. In the SDBRNN, three merging operators ("concat", "sum" and "weighting sum") were used in different hidden layers. To assess the generalization of this BRNN model using hybrid merging operators, we compared the SDBRNN with a BLSTM using the "concat" operator (BLSTM_C), with an BLSTM using the "sum" operator (BLSTM_S) and with a unidirectional LSTM with 800 hidden nodes. Hyperparameters for the BLSTM_C and BLSTM_S were the same as those used for the SDBRNN, and rASA prediction was used to compare the different models. The results listed in Table 7 show that hybrid merging operators effectively promoted better predictive performance.

Datasets and Input Features
A large, non-homologous sequence dataset, produced using the PISCES server [52], was obtained. Structures exhibited <25% similarity and showed a resolution >3.0 Å, with R factors of 1.0. Three-dimensional structure files were downloaded from the RCSB Protein Data Bank. After removing redundancy in the test datasets using cd-hit [53], 7361 proteins (CullPDB7361) comprising 1,596,728 residues in proteins with sequence lengths between 50 and 600 residues were retained. Among these, 7000 proteins (TR7000) were used for training, and 261 proteins (TS261) and 100 proteins (VD100) were randomly selected for testing and validating the model. In addition to CullPDB7361, three public testing datasets, CB502, Manesh215 and CASP10, were used to evaluate the performance of our model. CB502 (82,420 residues) was selected from CB513 [54] and ordered by sequence length from long to short. The Manesh215 dataset [25] contains 47,243 residues. CASP10 from the Protein Sequence Prediction Center (http://predictioncenter.org/) has 123 domain fragments and extracts from 103 chains. The CASP10 dataset was selected according to protein identity and contained 20,778 residues from 90 proteins. The sequence lengths of the proteins in test datasets were all ≤600.
Five types of features, PSSM, protein sequence coding, conservation score, physical properties and physicochemical characteristics, were used as input features. PSSM-derived features have been widely used to perform protein-related predictions. PSSMs provide the effective frequency of occurrence of all 20 amino acid residues at each position of the sequence. PSSMs can be obtained by performing multiple sequence alignments on a large protein database (NCBI NR database) using PSI-BLAST [55]. PSI-BLAST involves an iterative search process in searching the profile of a query protein. The expectation value (E-value) and the number of iterations for PSI-BLAST are set to 0.001 and three, respectively. The PSSM profile is in the form of a 20 × L matrix where L is the length and each amino acid in the sequence is described by 20 features.
Physical properties [40] include: a steric parameter (graph-shape index), polarizability, volume (normalized van der Waals volume), hydrophobicity, isoelectric point, helix probability and sheet probability. Specific values were derived from the study [40]. Protein physicochemical characteristics [56] include the number of atoms, electrostatic charges and potential hydrogen bonds.
To ensure smooth transitions in changes to the network gradient, the above features were normalized using a logistic regression function y = 1/(1 + e −x ).
Residue conservation was derived from the amino acid frequency distribution in the corresponding column of a multiple-sequence alignment of homologous proteins. A one-dimensional conservation score was computed according to Quan's [41] previously described Equation (5): Commonly-used protein coding involves an orthogonal code. In addition to the 20 known residues, "X" represents unknown residues, and "NoSeq" represents non-protein sequences in our coding scheme. A 22 × L matrix represents a sequence, where L is the sequence length and a 22-dimensional (dim) vector represents a residue in the sequence. However, the 22-dimensional coding vector represents a parsed, one-hot vector, where only one of 22 elements is a non-zero value. For instance, for the residue "A", the coding is: "1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0". We adopted an embedding operation from natural-language processing to transform sparse sequence features into denser representations. This embedding operation was implemented as a simple auto-encoder (a feed-forward neural-network layer) along with an embedding matrix mapping a sparse 22-dimensional vector into a denser 22-dimensional vector. This transformation was just converted into a one-hot vector, which coded one residue into a dense vector.
In our scheme, an input residue was represented by 53-dimensional features: 20-dim PSSM, 7-dimensional physical properties, 3-dimensional physicochemical characteristics, 1-dimensional conservation score and 22-dimensional protein codings. These features are all derived from the protein sequence.
The rASA of a residue in a protein chain is calculated by dividing the accessible surface area derived from DSSP [44] by the maximum solvent accessibility. However, there is no standard for maximum solvent accessibility. According to previous results [22,25], Gly-X-Gly extended tripeptides were used.

BRNN and Merging Operator
For sequence data X = (x 1 , x 2 , x 3 ...x t−1 , x t , x t+1 ...x n ), where x i is context dependent and strongly reliant on forward and backward information. The label vector Y = (y 1 , y 2 , y 3 ...y t−1 , y t , y t+1 ...y n ) is the target output space. Compared to a forward neural network, the current output from the recurrent neural network will be reverted backward for subsequent time-specific inputs. The recurrent neural network structure can be described as Equation (6): At the time T = t, the recurrent network can remember the information from previous x 1 , x 2 , x 3 ...x t−1 and the present input x t . However, in many applications, the output y t might be dependent on the entire input sequence, as in protein sequences and handwriting-recognition examples. BRNN [57] combines an RNN that moves forward through time beginning from the start of the sequence along with another RNN that moves backward through time beginning from the end of the sequence. In this BRNN, time-increasing input is represented by − → f (x 1 , x 2 , x 3 , ..., x t ), and time-decreasing input is represented by ← − f (x t , x t+1 , ..., x n ). Bidirectional information is merged as the current node's output. Therefore, BRNN is more suitable for context-related applications, and it is capable of outperforming unidirectional recurrent neural network. The BRNN can be described as Equation (7): The BRNN structure is capable of efficiently learning sequence information [45]. Previous studies focused primarily on the BRNN methodology with little research on incorporating bidirectional-information merging. In a conventional neural network, pooling operations are computationally important, with the pooling layer responsible for accumulating convolutional results. The merging operator in a BRNN acts similarly to pooling operations. At time T = t, input forwarded to the current node is represented by − → f (x 1 , x 2 , x 3 , ..., x t ), and the backward input is represented by ← − f (x t , x t+1 , ..., x n ). Therefore, a common formula for the output of the current node is presented in Figure 4 and as follows: where Θ is the merging operator, Θ ∈ R, R = { , +, , ∞, max, min, avg, ...}. represents element-wise multiplication; + is the sum; is the element-wise weighting sum; ∞ is concatenation; and is the reshape operator. The merging operator can perform more computations than necessary when the information is merged via the aggregation operation. Figure 4. In order to remember more long-range information in the sequence study, when the past computing information and the future computing information are merged, the merging operator is proposed to execute the merging operation.

SDBRNN
Protein primary structure represents a strongly time-ordered conformation, suggesting that it contains adequate information for the protein to fold into its native conformation. To capture more features in the primary structure, we use three types of computing operators in the merging operation. During the initial period, the computing node in the BRNN needs to remember backward, forward and current input information; therefore, the merging operator "concat" is used in the first layer. In the second layer, the computing node no longer needs to remember previous and future information; however, the "sum" operator reinforces bidirectional input aggregation, with bidirectional information again aggregated in the third layer. Therefore, a "weighting sum" operator is used for filtering unnecessary information and extracting key features. The equations associated with these operations are as follows (9)- (11): The recurrent neural network accepts a time-stepping sequence that makes the network extremely deep, with the depth making the network difficult to train because of the exploding or vanishing gradient [46]. Long short-term memory (LSTM) [58], which consists of a variety of gate structures (a forget gate, an input gate and an output gate) and a memory cell are used to address the vanishing gradient problem. In the SDBRNN architecture, the unidirectional computing node is performed by LSTM. The widely-accepted LSTM definition was provided by Graves [59], and the details of LSTM are described in Equation (12), which involves no peep-hole connections, mainly to improve computing performance.
where σ is the sigmoid function; i, f , o and c are respectively the input gate, forget gate, output gate and cell activation vectors, all of which are the same size as the hidden vector h.
The following BRNN layer represents a multi-layer perceptron (MLP) network that reduces the network scale. The MLP layer with logistic activation ultimately outputs the prediction, with logistic regression used for fitting the relative surface-area value. The SDBRNN architecture is shown in Figure 5. The cross-entropy loss function is used for model training: whereŷ i are the predicted probabilities of secondary structure labels, y i represent ground-truth labels of the secondary structure and N is the number of input residues.ŷ i is a vector of output activations prior to normalization with the logistic function. These derivatives are then fed back through the network using back propagation through time to determine the weight gradient. In our model, rASA prediction is simulated as a regressive problem. With the predicted rASA values, the classifications are carried out in two states (buried and exposed). We have tried seven thresholds of 5, 10, 20, 25, 30, 40 and 50% in the two-state classification. Figure 5. SDBRNN architecture. The merging operator "concat" is used in the first BRNN layer. The "sum" and "weighting sum" operator are used in the second and third layer. Two multi-perception networks are connected to BRNN.

Model Hyperparameters
SDBRNNs have three BRNN layers and two MLP layers. The first BRNN layer has 300 nodes, with 500 nodes in the other two layers. The first MLP layer has 128 nodes, and the second has 60 nodes. The Adam optimizing function was used for training the network using default settings, with the default learning rate set to 0.0008. This was reduced by 50%, whereas the validation accuracy decreased by more than 10-times. The threshold of the learning rate was set to 0.0001.
A natural stopping policy was accepted while the validation stopped decreasing. To balance model performance between fitting and generalization, a regularization or penalty term was attached to the loss function (e.g., L1 or L2). Unlike other models, regularization was not attached, and our experiment displayed dropout efficiency [60]. Except for the classification layer, each layer was regularized according to a dropout (p = 0.5) to avoid overfitting. Our model was implemented in Keras, which is a public deep learning software based on Theano [61]. Weights in the SDBRNN were initialized with default values in Keras. The network was trained using a single NVIDIA Tesla K40 GPU with 12 GB memory. Training the model requires about 16.5 min per epoch, and it takes about 0.04 seconds to test one sequence.
Proteins shorter than 600 AA were padded with all-zero features. The outputs corresponding to padded inputs are labeled as 0.5 according to the logistic function. The advantage of padding proteins is that it enables training the model on a GPU in batches.

Conclusions
Accessible RSA is often used as an important measure in proteomics study for describing protein properties. Compared with traditional machine learning techniques, deep learning exhibits wider generalization and is applied in our study to predict RSA area. Deep neural networks are stacked by various complicated neural networks, and more generalized representation capacity can be obtained. However, the deep neural networks own enormous parameters and need more samples and computing resources to train models. Use of the merging operator was proposed for merging computations in the BRNN hidden layer. We redesigned BLSTM merging using three types of merging operators ("concat", "sum", and "weighting sum") in SDBRNN. Compared with the commonly-used BLSTM method using a single merging operator, as well as other predictors, SDBRNN captured more protein features and was more generalizable. Our results on test datasets verified this.
Relative solvent-accessible areas are greatly affected by maximum solvent-accessible areas, as more maximum solvent-accessible areas will lead to lower relative solvent-accessible areas and lower MAEs.
However, there are no standard maximum solvent accessible areas. In this work, a novel deep learning method (SDBRNN) was presented for the prediction of RSA area, as well as for binary state classification. Our methods can be applied to a broad range of problems within and outside of computational biology.
Author Contributions: Q.L. conceived of the study. B.Z. performed the experiments, analyzed the data and initially drafted the manuscript. L.L. collected the features. All authors contributed to the revision and approved the final manuscript.