Prediction of Protein Secondary Structure Based on WS-BiLSTM Model

: Protein secondary structure prediction is an important topic in bioinformatics. This paper proposed a novel model named WS-BiLSTM, which combined the wavelet scattering convolutional network and the long-short-term memory network for the ﬁrst time to predict protein secondary structure. This model captures nonlocal interactions between amino acid sequences and remembers long-range interactions between amino acids. In our WS-BiLSTM model, the wavelet scattering convolutional network is used to extract protein features from the PSSM sliding window; the extracted features are combined with the original PSSM data as the input features of the long-short-term memory network to predict protein secondary structure. It is worth noting that the wavelet scattering convolutional network is asymmetric as a member of the continuous wavelet family. The Q3 accuracy on the test set CASP9, CASP10, CASP11, CASP12, CB513, and PDB25 reached 85.26%, 85.84%, 84.91%, 85.13%, 86.10%, and 85.52%, which were higher 2.15%, 2.16%, 3.5%, 3.19%, 4.22%, and 2.75%, respectively, than using the long-short-term memory network alone. Comparing our results with the state-of-art methods shows that our proposed model achieved better results on the CB513 and CASP12 data sets. The experimental results show that the features extracted from the wavelet scattering convolutional network can effectively improve the accuracy of protein secondary structure prediction.


Introduction
Protein is an essential component of organisms, complete immunity, cellular signal transmission, and other functions. The in-depth study of proteins is beneficial to research more targeted diseases and drug research and development. With the completion of human genome sequencing, protein sequence databases have proliferated, and protein research has become an important task. Protein structure can be divided into primary, secondary, tertiary, and quaternary structures. Among them, the tertiary structure is biologically active. Inspired by the great success in the fields of computer vision [1], speech recognition [2], and emotion classification [3], the method based on deep learning has been widely used in many biological research fields [4,5]. Examples include protein contact map [6], drugtarget binding affinity [7,8], chromatin accessibility [9], protein function [10,11], and using Support Vector Machine (SVM) to solve the problem of protein structure prediction [12]. The main advantage of the deep learning method is that it can automatically represent the original sequence and learn hidden patterns through nonlinear transformation [13].
With the continuous development of deep learning, various neural network structures have been applied to predict protein secondary structures.
Sønderby and Winther [14] used Long-Short-Term Memory networks (LSTM) to predict protein secondary structure on the CB513 data set, and Conover et al. [15] also used the LSTM model for protein model evaluation. McGuffin et al. established a PSIPRED

Protein Secondary Structure Prediction Model
The protein sequence is expressed based on the Position-Specific Scoring Matrix (PSSM) [23] of PSI-BLAST. The protein secondary structure prediction is completed by predicting the structure type corresponding to the amino acid residue of the protein sequence. Kabsch and Sander developed the Define Secondary Structure of Proteins (DSSP) algorithm that classified secondary structure into eight types: H (α-helix), G (3-10-helix), I (π-helix), E (β-strand), B (isolated β-bridge), T (turn), S (bend), and C (others) [24]. Protein secondary structure could be calculated using DSSP for proteins with already known structures, and it is usually applied to train data sets to establish class labels. In this experiment, G, H, I will be replaced by H, B, E will be replaced by E, and the others will be divided by C. It is worth noting that the test set used in this article is not included in the training set.
This paper proposes the WS-BiLSTM model to predict the secondary structure of proteins. First, the data are preprocessed, and the PSSM matrix is divided according to the sliding window size of 13 and 19, respectively. The PSSM matrix is sent to the WS-BiLSTM model, and the wavelet scattering convolution network is used to extract the protein features. The features obtained are combined with the PSSM sequence without sliding window processing, and sent to the long-short-term memory network in sequence mode for prediction. The prediction model is shown in Figure 1.

Data Set
In this paper, the CullPDB [25] data set is used as the training set of the model. The CullPDB data set contained 15,125 proteins with less than 25% sequence identity. We removed the same protein as the test set, leaving 14,199 proteins. The test set uses CASP class data sets, including CASP9, CASP10, CASP11, and CASP12 [26][27][28]. In addition, the CB513 [29] and PDB25 [30] data sets are also used as test sets of the model, in which the protein sequences of the test set and training set are not repeated. The number of protein sequences in the test set is shown in Table 1.

Data Set
In this paper, the CullPDB [25] data set is used as the training set of the model. The CullPDB data set contained 15,125 proteins with less than 25% sequence identity. We removed the same protein as the test set, leaving 14,199 proteins. The test set uses CASP class data sets, including CASP9, CASP10, CASP11, and CASP12 [26][27][28]. In addition, the CB513 [29] and PDB25 [30] data sets are also used as test sets of the model, in which the protein sequences of the test set and training set are not repeated. The number of protein sequences in the test set is shown in Table 1.

Wavelet Scattering Convolution Network
In 2012, Mallat proposed a scattering operator based on wavelet transform [22,31], which uses a wavelet scattering convolution network for texture segmentation and image feature extraction, and achieves good results. The wavelet scattering convolutional network is a scattering decomposition network composed of two complex-valued two-dimensional Morlet [32] filter banks and isotropic scale invariance. The Morlet wavelet is a continuous wavelet with time attributes and two essential wavelet scales: scale and displacement. Therefore, projecting protein data into the wavelet scattering convolution network can extract protein features with different scales and displacements that include time attributes.

Wavelet Scattering Convolution Network
In 2012, Mallat proposed a scattering operator based on wavelet transform [22,31], which uses a wavelet scattering convolution network for texture segmentation and image feature extraction, and achieves good results. The wavelet scattering convolutional network is a scattering decomposition network composed of two complex-valued two-dimensional Morlet [32] filter banks and isotropic scale invariance. The Morlet wavelet is a continuous wavelet with time attributes and two essential wavelet scales: scale and displacement. Therefore, projecting protein data into the wavelet scattering convolution network can extract protein features with different scales and displacements that include time attributes.
The scattering wavelet can obtain the two-dimensional directional wavelet function by scaling and rotating the one-dimensional band-pass filtering function µ. The multiresolution wavelet function can be obtained by scaling and rotating the function µ in binary: In the formula α = 2 i r ∈ Λ = D × Z, i ∈ Z, r ∈ D (D is r 2 a discrete, finite rotation group), i determines the scale of µ(x) and r determines the direction of µ(x). Then, the form of wavelet transform for high-frequency information θ(x) follows: Therefore, the wavelet transform modulus is: The wavelet extracts the high-frequency information θ(x) in the protein and convolves with the scaling function ϕ q (x) to obtain the low-frequency information of the protein.
Thus the modulus operator of wavelet transform is: The use of wavelet transform modulus operator to act on protein data will lose some high-frequency information, thus reducing the amount of high-frequency protein information converted into low-frequency information. The extracted protein features are incomplete, affecting the results of protein prediction.
When we iterate over the W α , the high-frequency information of the protein will be dispersed into different paths of e = {α n } 1≤n≤|e| , and the scatter propagator will be obtained: The scattering operator is defined by W q θ = θ * ϕ q and W α θ =|θ * µ α |. Defining (scattering operator) sequence e = {α n } 1≤n≤|e| as the wavelet transform path, and the scattering operator is defined on all path sets e q of the path e = {α n = (i n , γ n )} 1≤n≤|e| (where i n ≺≺< q): when e = 0, S(0)θ = θ, thus S q (0)θ = B q θ.
The scattering operator can also be marked as: The wavelet scattering convolution network is a multilevel convolution network; the network flow is shown in Figure 2. The coefficients of ordinary convolution networks must be obtained by continuous learning, and the results are only distributed in the last layer. The difference of the wavelet scattering convolution network is that it has a fixed wavelet filter bank, the parameters do not need to be obtained through learning, and its output coefficients are distributed in each layer.
To extract protein features using a wavelet scattering convolution network, we need to divide PSSM into sliding windows with a size 20 * N and adjust the scattering wavelet coefficients of different scales and directions to obtain PSSM features. Because the energy features extracted from the wavelet scattering convolution network are mainly distributed in the first three layers of the scattering convolution network, we use the characteristics of the scattering operator to transform the high-frequency information of the first three layers into low-frequency information, collect and integrate the features of all directions and scales, and fuse with the original PSSM features, which will provide the input data for the long-short-term memory network.  To extract protein features using a wavelet scattering convolution network, we need to divide PSSM into sliding windows with a size 20 * N and adjust the scattering wavelet coefficients of different scales and directions to obtain PSSM features. Because the energy features extracted from the wavelet scattering convolution network are mainly distributed in the first three layers of the scattering convolution network, we use the characteristics of the scattering operator to transform the high-frequency information of the first three layers into low-frequency information, collect and integrate the features of all directions and scales, and fuse with the original PSSM features, which will provide the input data for the long-short-term memory network.

Long-Short-Term Memory Network
Bidirectional Long-Short-Term Memory networks (Bi-LSTM) gradually developed from Recurrent Neural Network (RNN) and LSTM [33]. LSTM can remember the state of protein long-sequence data and solve possible problems such as gradient disappearance and gradient explosion, which RNN cannot solve. The long-short-term memory network structure is shown in Figure 3.

Long-Short-Term Memory Network
Bidirectional Long-Short-Term Memory networks (Bi-LSTM) gradually developed from Recurrent Neural Network (RNN) and LSTM [33]. LSTM can remember the state of protein long-sequence data and solve possible problems such as gradient disappearance and gradient explosion, which RNN cannot solve. The long-short-term memory network structure is shown in Figure 3. To extract protein features using a wavelet scattering convolution network, we need to divide PSSM into sliding windows with a size 20 * N and adjust the scattering wavelet coefficients of different scales and directions to obtain PSSM features. Because the energy features extracted from the wavelet scattering convolution network are mainly distributed in the first three layers of the scattering convolution network, we use the characteristics of the scattering operator to transform the high-frequency information of the first three layers into low-frequency information, collect and integrate the features of all directions and scales, and fuse with the original PSSM features, which will provide the input data for the long-short-term memory network.

Long-Short-Term Memory Network
Bidirectional Long-Short-Term Memory networks (Bi-LSTM) gradually developed from Recurrent Neural Network (RNN) and LSTM [33]. LSTM can remember the state of protein long-sequence data and solve possible problems such as gradient disappearance and gradient explosion, which RNN cannot solve. The long-short-term memory network structure is shown in Figure 3.  Bi-LSTM can be seen as a forward LSTM combined with a reverse LSTM. They take the beginning and end of the amino acid sequence as input, respectively. After passing through the Bi-LSTM network, the amino acid features obtained are integrated, spliced, and then sent to the classifier. Bi-LSTM trains the amino acid features extracted by the wavelet scattering convolution network according to sequence patterns. Bi-LSTM can consider fully the information of amino acids and their interaction and improve prediction accuracy. The schematic diagram of Bi-LSTM training is shown in Figure 4.
LSTM adds three types of gate structures based on RNN: forgetting gate, input gate, and output gate. This design makes the amino acid information in the cell state be selectively forgotten. The new amino acid information will be selectively recorded in the cell state, and the previous amino acid information will be saved in the hidden layer. The schematic diagram of the LSTM unit is shown in Figure 5.
Bi-LSTM can be seen as a forward LSTM combined with a reverse LSTM. They take the beginning and end of the amino acid sequence as input, respectively. After passing through the Bi-LSTM network, the amino acid features obtained are integrated, spliced, and then sent to the classifier. Bi-LSTM trains the amino acid features extracted by the wavelet scattering convolution network according to sequence patterns. Bi-LSTM can consider fully the information of amino acids and their interaction and improve prediction accuracy. The schematic diagram of Bi-LSTM training is shown in Figure 4. LSTM adds three types of gate structures based on RNN: forgetting gate, input gate, and output gate. This design makes the amino acid information in the cell state be selectively forgotten. The new amino acid information will be selectively recorded in the cell state, and the previous amino acid information will be saved in the hidden layer. The schematic diagram of the LSTM unit is shown in Figure 5.
The output dt of the amnesia gate is obtained by activating the hidden state Ct−1 of the last amino acid sequence and the amino acid data Xt, of this sequence. The input gate consists of two parts. The first part uses the sigmoid activation function; the output is et. The second part uses the tanh activation function, and the output is the multiplication of the two results of it, to update the cell state. Both the forgetting gate and the input gate will act on the cell state At to update the LSTM amino acid cell state. The hidden state Ct of this amino acid sequence consists of the output amino acid sequence state bt and the cell state At. The mathematical expression of the LSTM unit follows: accuracy. The schematic diagram of Bi-LSTM training is shown in Figure 4. LSTM adds three types of gate structures based on RNN: forgetting gate, input gate, and output gate. This design makes the amino acid information in the cell state be selectively forgotten. The new amino acid information will be selectively recorded in the cell state, and the previous amino acid information will be saved in the hidden layer. The schematic diagram of the LSTM unit is shown in Figure 5. The output dt of the amnesia gate is obtained by activating the hidden state Ct−1 of the last amino acid sequence and the amino acid data Xt, of this sequence. The input gate consists of two parts. The first part uses the sigmoid activation function; the output is et. The second part uses the tanh activation function, and the output is the multiplication of the two results of it, to update the cell state. Both the forgetting gate and the input gate will act on the cell state At to update the LSTM amino acid cell state. The hidden state Ct of this amino acid sequence consists of the output amino acid sequence state bt and the cell state At. The mathematical expression of the LSTM unit follows: The output d t of the amnesia gate is obtained by activating the hidden state C t−1 of the last amino acid sequence and the amino acid data X t , of this sequence. The input gate consists of two parts. The first part uses the sigmoid activation function; the output is e t . The second part uses the tanh activation function, and the output is the multiplication of the two results of i t , to update the cell state. Both the forgetting gate and the input gate will act on the cell state A t to update the LSTM amino acid cell state. The hidden state C t of this amino acid sequence consists of the output amino acid sequence state b t and the cell state A t . The mathematical expression of the LSTM unit follows: where Z is the input weight, Y is the recursive weight, m is the bias, d represents the forgetting gate, e represents the candidate gate, i represents the input gate, b represents the output gate, and ρ represents the activation function. First, Formula (9) represents the output of the forgetting gate, Formula (10) represents the input gate to decide which values to update. The results of the two are multiplied, and then the addition operation is updated to the cell state A t , as shown in Formula (13). Finally, the output gate determines the final output content, which can be used as the hidden state of the current unit C t , as shown in Formula (14).
Finally, the neurons in the fully connected layer of Bi-LSTM are connected to the neurons in the upper layer, which are classified by the Softmax classifier. The probability of the three structures of C, E, and H, where the amino acid is located, is calculated to complete the classification.

Result Evaluation
Protein secondary structure prediction is usually evaluated by the accuracy of Q3 [17], which is simply the percentage of the protein sequence in which the secondary structure location is correctly predicted. The Segment of Overlap Score (SOV) [34] is widely used to evaluate protein secondary structure prediction. According to DSSP, G, H, and I are transformed into H, E, and B into E, and other structures are transformed into C. The calculation formula follows: where S C represents the number of class C protein structures accurately predicted, S E represents the number of class E protein structures accurately predicted, and S H represents the number of class H protein structures accurately predicted. S represents the total number of amino acids. The calculation results of the accuracy of each secondary structure follow: Q i represents the accurate proportion of amino acids predicted in state I and A i represents the total number of amino acids in state i. The calculation of Sovis a measure based on the ratio of overlapping fragments, which treats the predicted and observed results. Assuming that the observed sequence is recorded as the sequence predicted by S m , as S n , S mn as the fragment with the same state as S m and S n , then S mn must contain a pair of overlaps and a helix. The length of S m is marked as length (S m ), and the union set of S m and S n sequences in each pair is marked as max (S m , S n ), and the intersection set of S m and S n sequences is marked as min (S m , S n ). Based on this, the Sovcalculation formula follows: Among the variables, the setting for ρ allows the change of the fragment at the edge of the protein structure, and the ρ(S m , S n ) value conforms to the following definition:

Hyperparameters of Wavelet Scattering Convolutional Network
In order to evaluate the accuracy of the proposed model and verify the effectiveness of the wavelet scattering convolutional network for feature extraction, two separate experiments were set up to predict the protein secondary structure. The first experiment used the WS-BiLSTM model to predict the secondary structure of the protein, and the second experiment used the Bi-LSTM model to predict the secondary structure of the protein. The sliding window size in this article was 13 and 19, respectively. Aiming at the hyperparameters of the wavelet scattering convolutional network, we set up a comparative test on its image invariance scale, hoping to obtain the appropriate scale to obtain the best experimental results. The experimental results are shown in Tables 2 and 3. It can be seen from Tables 2 and 3 that when the sliding window size is 19, the accuracy rate is higher because the more extensive the sliding window length, the more protein feature information can be obtained. When the sliding window is 19, as the scale of image invariance increases, the accuracy rate shows an upward trend. When the scale increases to the maximum of 19, the accuracy reaches the highest rate. In the process of extracting features by wavelet scattering convolutional network, the protein feature data obtains 19*19 spatial support through the scaling filter specified by the scale of 19, thereby obtaining the theoretically most extensive protein feature and obtaining the highest accuracy.
For the hyperparameters of Bi-LSTM, we adjusted the number of Bi-LSTM layers, hidden units, model learning rate, regularization coefficient, and dropout value. The experiment tuning process uses CASP10 as the test set.

Network Layers and Hidden Units
In Bi-LSTM, the number of Bi-LSTM layers and the number of hidden units affect the experimental results. The irrational number of hidden units lead to overfitting the model, and too many Bi-LSTM layers lead to network bloating and excessive time consumption. In the network structure design, we hope that the complexity of the design is reasonable and the model has good generalization under the appropriate training time. Therefore, we adjust the number of network layers and hidden units. The experimental results are shown in Tables 4 and 5.  Comparing Tables 4 and 5, we find that Table 5 shows a higher accuracy rate than Table 4. It can be seen that the higher the number of layers, the higher the accuracy. However, it can be seen from Table 4 that the setting of the number of hidden units still has a significant influence on the accuracy, so the network takes the output of each time step of the first layer as the input of the corresponding time step of the next layer, which is used for deep-level feature learning. The experimental results show that the optimal combination can be reached under the combination of 1000 hidden units in the first layer and 1000 hidden units in the second layer. Considering the feasibility of the experiment and the complexity of the model, the final network structure is determined as two layers of Bi-LSTM and two layers of 1000 hidden units, respectively.

Learning Rate
Adam's optimization algorithm [17] is used in the network model. Compared with the traditional stochastic gradient descent method, Adam can update variables according to the concussion of historical gradient and the actual historical gradient after filtering concussion, which is suitable for solving optimization problems with large-scale data and parameters. The initial value of the learning rate will significantly impact the results, so in this group of experiments, to determine the appropriate initial learning rate, as shown in Table 6 of the experimental results, the rough adjustment is first carried out with a 10-fold reduction. A fine adjustment is then carried out at an appropriate order of magnitude. The final learning rate is determined to be 0.0004, and the result can reach 84.04%.

Dropout
Due to the two-layer Bi-LSTM design adopted by the model in this paper, many hidden units were added. In order to avoid overfitting problems caused by excessive training time, we introduced a dropout layer after each layer of the Bi-LSTM model to partially inactivate neurons. We set the two dropout layers to the same value for synchronous adjustment. The experimental results are shown in Table 7. This works well when the two-layer dropout value is set synchronously to 0.2.

Regularization
In the long-short-term memory network, L2 regularization can further prevent the model from overfitting. L2 regularization is a penalty term added to the loss function, and the so-called penalty term refers to the sum of squares of each element in the weight vector and then finds the square root. First, the rough adjustment is made by ten times shrinkage, and then the fine adjustment is carried out after the range is determined. The experimental results are shown in Table 8. L2 regularization dramatically improves the experimental results. When L2 regularization is 0.000006, the result can reach 85.84% on CASP10.

Discussion
From Tables 2-8, the parameter settings in WS-BiLSTM have a more significant impact on the experimental results. It can be seen from Tables 4 and 5 that as the number of hidden units increases, the accuracy of Q3 also increases, increasing by about 3%. At the same time, the two-layer network is better than the single-layer network, and the accuracy is improved by 0.8%. In order to balance the complexity and performance of the model, we no longer superimpose the number of hidden units and the number of network layers. Compared with Tables 4 and 5, the improvement of Q3 accuracy in Table 6 is also significant. Based on the number of hidden units and model layers after optimization, the learning rate optimization increases by 1.56%. Different dropout values are set in Table 7, but there is almost no improvement based on the former. The optimization of the regularization in Table 8 increased the results again by 1.42%.
Last but not least, the adjustment of the parameters in the wavelet scattering convolutional network model is also essential. It can be seen from Tables 2 and 3 that as the sliding window increases, the experimental results generally increase by about 1%. This is because the field of view of the wavelet scattering convolutional network for extracting the local interaction ability of amino acids is increased. At the same time, when the image invariance scale is increased to 19, the Q3 accuracy has once again completed a leap, and the experimental results increased by about 1% again. A larger image invariance scale provides more local correlation between amino acid residues. Table 9 shows the experimental results of 70% discount cross-validation using the CULLPDB data set in this experiment. The CULLPDB data set is divided into three parts, two as training sets and one as the test set. The average value of 70% discount crossvalidation can reach 81.84%. It is worth noting that the optimized parameters derived in Sections 3.1-3.5 are used to perform cross-validation. We only use a long-short-term memory neural network to predict protein structure, and the results are shown in Table 10.  Table 11 lists the Q3 accuracy of the model on CASP9, CASP10, CASP11, CASP12, CB513, and PDB25. Comparing Tables 10 and 11, we find that the WS-BiLSTM model that integrates the wavelet scattering convolutional network and the long-short-term memory network is much better than using the long-short-term memory network alone. The Q3 accuracy on the test set CASP9, CASP10, CASP11, CASP12, CB513, and PDB25 reached 85.26%, 85.84%, 84.91%, 85.13%, 86.10%, and 85.52%, which are 2.15%, 2.16%, 3.5%, 3.19%, 4.22%, 2.75% higher, respectively, than using the long-short-term memory network alone. The reason may be that when we only use the Bi-LSTM model to predict protein secondary structure, Bi-LSTM can capture the most comprehensive distance-dependent information. It can be seen from Tables 3 and 4 that when the Bi-LSTM network is double-layered, and the number of hidden units is greater than or less than 1000, Bi-LSTM cannot capture more information about the residues in the protein sequence. Moreover, our WS-BiLSTM model, with the robust feature extraction ability of wavelet scattering convolutional network, can extract more local correlations between small amino acid residues and combine with Bi-LSTM to extract more abundant amino acid characteristics. This paper selects five prediction methods: RaptorX-SS8 [35], PSIpred [16], Jpred [36], DeepCNF [19], and MUFOLD-SS [18], which are compared with this model on CASP10, CASP11, CASP12, and CB513. In contrast, RaptorX-SS8 uses conditional neural fields, PSIpred uses two layers of feedforward neural networks, Jpred uses two layers of artificial neural networks from the SNNS neural network package, DeepCNF is a combination of deep neural networks and conditional neural fields, and MUFOLD-SS uses a deep neural network. The comparison results are shown in Table 12 and Figure 6.    Figure 6 show the performance comparison between WS-BiLSTM and the four tools. The comparison of the same test set shows the progressive nature of WS-BiLSTM. Since the four tools did not test all test sets, some results are unknown. For a fair comparison, the same CullPDB data set was used to train and test WS-BiLSTM in the same way as for existing methods reported in previous publications. From Table 12, it can be found that WS-BiLSTM performs better than RaptorX-SS8, PSIpred, and Jpred in all cases, and there is a large gap. Therefore, we focus on the differences between WS-BiLSTM, DeepCNF, and MUFOLD-SS. DeepCNF is a deep learning extension of Conditional Neural Fields (CNF), which integrates Conditional Random Fields (CRF) and external neural networks. DeepCNF can model complex sequence-structure relationships through a deep   Figure 6 show the performance comparison between WS-BiLSTM and the four tools. The comparison of the same test set shows the progressive nature of WS-BiLSTM. Since the four tools did not test all test sets, some results are unknown. For a fair comparison, the same CullPDB data set was used to train and test WS-BiLSTM in the same way as for existing methods reported in previous publications. From Table 12, it can be found that WS-BiLSTM performs better than RaptorX-SS8, PSIpred, and Jpred in all cases, and there is a large gap. Therefore, we focus on the differences between WS-BiLSTM, DeepCNF, and MUFOLD-SS. DeepCNF is a deep learning extension of Conditional Neural Fields (CNF), which integrates Conditional Random Fields (CRF) and external neural networks. DeepCNF can model complex sequence-structure relationships through a deep hierarchical architecture and interdependencies between adjacent labels. In contrast, WS-BiLSTM integrates a wavelet scattering convolutional network and a long-short-term memory network. The same wavelet scattering convolutional network extracts features while modeling the interdependence between adjacent labels. The long-short-term memory network constructs a complex sequence-structure relationship similar to DeepCNF. Nevertheless, WS-BiLSTM better handles the relationship between the two, as seen from the comparison results. MUFOLD-SS uses a smaller convolution window than popular neural networks. However, through the stacking of deep convolution blocks, the network can also represent local and global contexts while maintaining efficient calculations. Although WS-BiLSTM is slightly behind in CASP10 and CASP11, because WS-BiLSTM has a more vital ability to extract advanced features, the nonlocal interaction of residues gives WS-BiLSTM a significant lead on CASP12.

Conclusions
Compared with only using the long-short-term memory network, our model has a more significant improvement in Q3 accuracy on the CASP9, CASP10, CASP11, CASP12, CB513, and PDB25 data sets. Compared with RaptorX-SS8, PSIpred, Jpred, DeepCNF, it improved the CASP10, CASP11, CASP12, and CB513 data sets. Compared with MUFOLD-SS, it also obtained good results on the CASP12 data set. Mallat [14] and others characterized three properties that deep learning architectures possess for extracting valuable features from data: multiscale contractions, linearization of hierarchical symmetries, and sparse representations. The wavelet scattering convolutional network exhibits all of these properties. The wavelet scattering convolutional network linearizes the small deformations between amino acid features, such as dilation, by separating the changes of different scales. Combining the characteristics of wavelet and deep learning to extract features makes the features extracted by wavelet scattering convolutional network more competitive in subsequent classification. Because of the design of the WS-BiLSTM model in this paper, the circulation form of the data features is from the partial sequence of amino acids to the complete sequence. This allows us to obtain the partial characteristic information of the protein and the general information of the protein sequence. It is more likely to improve the accuracy of our protein prediction. At the same time, the experimental results show that the model proposed in this paper is effective in predicting protein secondary structure.
In future work, we will apply WS-BiLSTM to predict other protein structure-related properties, such as backbone torsion angles, solvent accessibility, and protein order/disorder region. These predicted features are helpful for protein tertiary structure prediction and protein model quality assessment.
Author Contributions: Y.G., Y.Z. and Y.L. designed the network model structure; Y.G. carried out the experiment; Y.G. and Y.M. wrote the paper. All authors have read and agreed to the published version of the manuscript.