A Model Stacking Framework for Identifying DNA Binding Proteins by Orchestrating Multi-View Features and Classifiers

Nowadays, various machine learning-based approaches using sequence information alone have been proposed for identifying DNA-binding proteins, which are crucial to many cellular processes, such as DNA replication, DNA repair and DNA modification. Among these methods, building a meaningful feature representation of the sequences and choosing an appropriate classifier are the most trivial tasks. Disclosing the significances and contributions of different feature spaces and classifiers to the final prediction is of the utmost importance, not only for the prediction performances, but also the practical clues of biological experiment designs. In this study, we propose a model stacking framework by orchestrating multi-view features and classifiers (MSFBinder) to investigate how to integrate and evaluate loosely-coupled models for predicting DNA-binding proteins. The framework integrates multi-view features including Local_DPP, 188D, Position-Specific Scoring Matrix (PSSM)_DWT and autocross-covariance of secondary structures(AC_Struc), which were extracted based on evolutionary information, sequence composition, physiochemical properties and predicted structural information, respectively. These features are fed into various loosely-coupled classifiers such as SVM and random forest. Then, a logistic regression model was applied to evaluate the contributions of these individual classifiers and to make the final prediction. When performing on the training dataset PDB1075, the proposed method achieves an accuracy of 83.53%. On the independent dataset PDB186, the method achieves an accuracy of 81.72%, which outperforms many existing methods. These results suggest that the framework is able to orchestrate various predicted models flexibly with good performances.


Introduction
DNA-binding proteins play a vital role in many biological processes, for instance DNA replication, transactional regulation, DNA repair and DNA modification [1]. It is highly desired to develop computational methods to identify these proteins because of the time consumption and high cost in using experimental techniques such as filter binding assays [2] and X-ray crystallography [3]. So far, many computational methods based on machine learning have been proposed to identify these proteins [4]. Building a meaningful feature set and choosing an appropriate classification algorithm are the two most trivial tasks [5].
Building a feature set usually involves two steps: feature extraction and feature transformation. However, there is no clear boundary in using these two terms. To eliminate confusion, we use the (SVM, Random Forest (RF) and Naive Bayes (NB)) base models. After plotting and T-test analysis of LR coefficients on both the training and independent test datasets, we find that RF prefers the features of PSSM_DWT, SVM likes Local_DPP features more and both of these two kinds of features play significant roles in the prediction of DNA-binding proteins. It is noteworthy that the NB classifier is more stable against all the features, although its coefficients are not too high. Meanwhile, all three predictors outperform the single classifiers, and there is no big performance difference between homogeneous and heterogeneous base models. Finally, we show that the overall performance of MSFBinder outperforms the state-of-the-art of existing methods on the independent testing dataset.

Materials and Methods
The framework of the presented method is illustrated in Figure 1. It consists of two stages: training and predicting phases. In the training phase, the training samples are transformed into feature vectors by four feature extraction methods. Then, base models are trained on the four feature spaces independently. The predicted probabilities are fed into the LR classifier to make the final prediction. In the prediction phase, the testing samples are transformed into feature vectors using the same feature extraction methods. Then, the models previously trained are used to predict DNA-binding proteins.

Datasets
Two benchmark datasets PDB1075 [18] and PDB186 [21] were used to evaluate the performance of the proposed method. PDB1075 contains 525 DNA-binding proteins and 550 non-DNA-binding proteins, and PDB186 contains 93 DNA-binding proteins and 93 non-DNA-binding proteins. The details of the two datasets are shown in Table 1. The protein sequences in the benchmark datasets were derived from the Protein Data Bank (http://www.rcsb.org/pdb/home/home.do). We used the PDB1075 dataset for training the models and the PDB186 dataset for the independent test. Protein sequences in the benchmark datasets were selected according to the following 3 criteria: (1) the length of protein sequences is not less than 50; (2) the protein sequence cannot contain unknown residues such as X; (3) the sequence similarity between any two proteins is less than 25%.

Feature Extraction
Four kinds of feature extraction methods were used. They were Local_DPP, PSSM_DWT, AC_struct and 188D. Their details are explained below.

Features Based on Evolutionary Information
The sequence evolutionary information can be constructed from the Position-Specific Scoring Matrix (PSSM). It has been applied in many similar works, such as protein fold recognition [34], ATP binding site prediction [6], bacteriophage virion identification [35] and aptamer-protein interacting pair prediction [33]. Local_DPP extracts local evolutionary information from the PSSM matrix by dividing the matrix into n parts, and PSSM_DWT gains discriminatory information from the PSSM matrix by compressing the matrix using the Discrete Wavelet Transform (DWT) method.
Suppose that the number of amino acids in a given protein sequence is L; the PSSM matrix is defined as: The rows in the matrix correspond to the positions in the sequence, and the columns correspond to one of the twenty types of amino acids. The values of the i-th row in the matrix represent the probabilities of the i-th position to be each type of the 20 amino acids. The PSSM matrix is normalized by the following formula.
where p i,j is the original value in the PSSM matrix and p j and S j are the mean and standard deviation of the j-th row. PSSM was obtained from the PSI-BLAST [36] program with iterations of 3 and an E-value of 0.001.

Local_DPP
Local_DPP [22] extracts local evolutionary features from the PSSM matrix. Firstly, it partitions the PSSM matrix into n segments. The length of the first (n − 1) segments is L / n , and the length of the n-th segment is L − (n − 1) × ( L / n ) . The following formulas are applied to extract features for each segment.
where k represents the k-th segment, length(k) represents the length of the k-th segment and p i,j represents the normalized value in the PSSM matrix. length(K) denotes the minimum length of the segments. In the k-th segment, F j (k) represents the average probability of occurrence of the j-th type of amino acid at each position in the segmented sequence during the evolutionary process. φ ξ j (k) denotes the average correlation for the j-th type of amino acid between two residues separated by ξ. Part 1 contains the local evolutionary information, and Part 2 contains the sequence order information.
Hence, for each segment, we can obtain 20D features according to Part 1 and 20 × λD features according to Part 2 . There are in total n segments, and we then gain 20 × (1 + λ) × n D local features.
PSSM_DWT PSSM_DWT [32] extracts features from the given sequences by DWT [37]. DWT can discretely sample the wavelets and grasp both the frequency and location information [38]. When applied to the given protein sequence, DWT decomposes it into a list of coefficients at different dilations and then remove noise information of the sequence from the profiles. According to Nanni et al. [39,40], DWT can be defined as: where N denotes the length of the discrete signal, x[n] denotes the discrete signal, g denotes the low pass filter and h represents the high pass filter. y j,low [n] represents the approximate coefficient of the low frequency part of the signal, and y j,high [n] denotes the detailed coefficient of the high frequency part of the signal. With the decomposition process being repeated, the frequency resolution and approximation coefficients can be further increased. In this study, we employed 4 levels of DWTs, and the schematic diagram is shown in Figure 2. As shown in Figure 2, at each level of DWT, the high-frequency band and the low-frequency band of the discrete signal are separated. The maximum, minimum, average and standard deviation for each band are calculated. For a given amino acid, 13 × 4 = 52D features are obtained after DWT transformation. Finally, we gain 52 × 20 = 1040D features for a total of 20 types of amino acids.

Features Based on Sequence Composition and Physiochemical Properties
The 188D [31] extracts the sequence features according to the Composition (C), Distribution (D), bivalent Frequency (F) and physiochemical properties of amino acids and uses a 188-dimensional vector to encode the raw sequence. The first 20-dimensional features were obtained by calculating the appearance frequency of every amino acid. Subsequently, the amino acids were divided into three different categories based on the eight physiochemical properties of proteins; see Table 2 [41][42][43][44]. For each protein property, three dimensions were for the occurrence frequencies of the three categories, and three dimensions were for the occurrence frequencies of three bivalent classes in which the two amino acids were from different categories. Dividing the entire protein sequence into five equal parts and calculating the distribution frequencies of each class in the five parts (the first 25%, 50%, 75% and 100%) yielded the next fifteen dimensions. For all eight properties, there were (3 + 3 + 15) × 8 = 168 features. Finally, a 168 + 20 = 188-dimensional feature vector was used to represent a protein sequence.

AC_struct
The predicted secondary structural information of protein sequences were derived from PSIPRED2 [45], which is a state-of-the-art algorithm. For each residue in the given sequence, PSIPRED 2 provides the probability profile of three secondary structure states (helix, strand and coil).
For a given sequence with length L, the L × 3 matrix is defined as: . . .
where s i,j denotes the probability that the i-th residue belongs to the j-th type of the secondary structure. In order to make the fixed-length vector of protein sequences from the above matrix, we employ the auto-covariance transformation [46].
The Auto Covariances (AC) is defined as: where s j denotes the average of the probabilities for the j-th type of the secondary structure. λ denotes the distance between any two amino acids in the given sequence. Lmin denotes the minimum length of the protein sequences, which is set to 50 in this work. The auto-covariances within 50 amino acids of a protein sequence produced 49 × 3 = 147-dimensional features. The average probabilities and standard deviations of the three types of secondary structures consisted of the next 6-dimensional features. Finally, a total of 153 features were obtained.
The numbers of features via the four extraction methods are summarized in Table 3 [47] constructs a hyper-plane in high-dimensional space or in infinite-dimensional space and can be employed for classification, regression and other tasks. SVM was originally used to solve the two-class problems and later extended to the multiclass problems. For the two-class problems, SVM maps data to a high-dimensional feature space to find the best hyper-plane. In this study, we select the Radial Basis Function (RBF) as the kernel function, since it has been proven to be very effective in many applications such as protein-ATP binding site prediction [6]. The grid search approach was applied to find out the optimal capacity parameter C and kernel width g.

Random forest
Random Forest, a type of ensemble machine learning algorithm, performs a majority voting on multiple decision tree models, each of which is independently constructed with a bootstrap sample, on both the feature and instance axises, of the training set. It corrects for decision trees' habit of overfitting their training set. Because of its outstanding prediction performance, RF has been adopted in many scientific research fields, especially in bioinformatics [48]. RF has numerous parameters to be tuned, such as the number of trees to be built before taking majority voting, the maximum number of features to train a tree and the minimum sample leaf size in an individual tree. Again, the grid search is used to tune these parameters.

Naive Bayes
Naive Bayes is a type of probabilistic classifier based on applying Bayes' theorem with the strong independence assumption between features. Although this assumption does not hold for most cases, NB has very sound performances in many applications [49,50].
For a given input X = (x 1 , · · · , x n ), NB assigns it a class label c by optimizing the posterior: In this work, we use the multinomial Gaussian distribution to estimate its parameters.

Model Stacking
In contrast to the ordinary classifier, which tries to learn one hypothesis from training data, ensemble learning tries to construct a set of hypotheses and combine them for use. An ensemble contains a number of learners, which are usually called base learners. The generalization ability of an ensemble is usually much stronger than that of base learners. However, this is not always true since using not-so-weak base learners often results in better performance. Bagging, boosting, stacking and voting are some of typical ensemble strategies [51]. Bagging trains a number of base learners, each using a different bootstrap sample. It combines them by majority voting, and the most-voted class is predicted. Boosting uses subsets of the training set to produce a series of averagely performing models and then "boosts" their performance by combining them together using a particular cost function such as majority vote. The majority voting assigns the final class label of each sample using the one predicted by the (weighted) majority of base classifiers. Stacking utilizes another classifier to combine the results of the base classifiers. Because of the inherent parallelization, flexible combination and mathematical evaluation of base classifiers in the stacking strategy [52], we use it as the base framework to combine base classifiers and to make the final prediction using logistic regression.
Logistic regression is a mathematical method modeling the logistic relationship between categories and numerical feature vectors [53]. It uses the sigmoid function to solve the two-class problems and the softmax function to solve multi-class problems. For the two-class problems, given the protein sequence and the corresponding feature vectors, the relationship between two-class labels and feature vectors (F = f 1 , . . . , f N ) is defined as: where θ0 to θN are the unknown parameters, which can be obtained by maximum likelihood estimation.

Performance Measures
We employed leave-one-out cross-validation and five-fold cross-validation to evaluate the performances of predictor capability. In the five-fold cross-validation, the training dataset was randomly divided into five equal parts, one for the testing dataset and the others for training datasets. As for the leave-one-out cross-validation, each sample in the training dataset was used for testing, and the remaining samples were used for training.
Four evaluation metrics, Sensitivity (SN), Specificity (SP), Accuracy (ACC) and Matthew's Correlation Coefficient (MCC), were used as the performance indicators. Their formulas are as follows: where TP (TN) denotes the number of positive (negative) samples correctly predicted and FP (FN) denotes the number of positive (negative) samples incorrectly predicted. All the source codes and data are available on the GitHub server https://github.com/gongxjtju/ MSFBinder.

Performance Comparisons of Different Feature Representations Only Using SVM as the Training Model
In order to evaluate the prediction capability of the different feature extraction methods, the five-fold cross-validation was applied using SVM as the training model. As shown in Table 4, most of the feature spaces achieved acceptable performance. Local_DPP beat all the other three with an accuracy of 78.32%, an MCC of 0.5681 and an SP of 75.63%. The results suggest that sequence order and the local evolutionary information contribute significantly to the prediction of DNA-binding proteins. The performance of AC_struct was inferior to the others. This might have been caused by the false predictions of the secondary structures by the PSIPRED 2 program.

Performance Comparisons Using Different Base Classifiers
In this section, we build three types of predictors: MSFBinder (SVM), MSFBinder (SVM, RF) and MSFBinder (SVM, RF, NB) in the MSFBinder framework. They were performed on four types of feature spaces using SVM, SVM + RF and SVM + RF + NB as the base classifiers and obtained 4, 8 and 12 models, respectively. Then these models are fed into the LR to make the final predictions. For each predictor, we compared their performances to two types of parameter setups of Local_DPP. The leave-one-out cross-validation was applied to evaluate the performances. The results on PDB1075 and PDB186 are shown in Tables 5 and 6.  Table 5 shows that the second predictor with n = 3 and λ = 1 achieved the best performance with ACC = 84.84%, MCC = 0.6969, SN = 85.52% and SP = 84.18%. The first predictor with n = 3 and λ = 1 performed the worst with ACC = 83.35%, MCC = 0.6670, SN = 83.62% and SP = 83.09%. The gaps of ACC, MCC, SN and SP between the best and worst ones were 1.49%, 0.0299, 1.9% and 1.09%, respectively. For the test dataset shown in Table 6, the first predictor with n = 2 and λ = 2 achieved the best performance. The second predictor with n = 3 and λ = 1 reached the worst case. The best one beat the worst one in terms of values of ACC, MCC, SN and SP by 2.69%, 0.0389, −3.22 and 8.6%, respectively. Although the first predictor performed slight worse on the training dataset, it worked best on the testing dataset. This suggests that the first predictor had good generalization. All three predictors outperformed the one only using SVM.

Significance Analysis of Different Base Models for Three Predictors
To further demonstrate the contributions of different base classifiers combined with corresponding feature representations for the final prediction, the LR coefficients of the base models for five-fold cross-validations are drawn in Figures 3-5, in which all the parameters of Local_DPP are set the same, as n = 2 and λ = 2.
In the first predictor (Figure 3), the PSSM_DWT features contributed the greatest, then the Local_DPP features. Both features were more stable among all of the five-fold cross-validations. The other two features were heavily dependent on the datasets. Figure 4 shows that the RF model preferred the features of PSSM_DWT, and their combination made the biggest contribution across all of the five-fold cross-validations. SVM combined with the Local_DPP features ranked second. Both two types of features worked well on the SVM and RF classifiers. The 188D features were not sensitive to the datasets for SVM and RF, while the contribution of AC_struct features was smaller than the others for all folds.
For the third predictor ( Figure 5), Local_DPP and PSSM_DWT features played similar roles as the second one for SVM and RF. Both of them had significant effects on the predictions. The 1880D features were sensitive to the datasets for RF and SVM, and a similar case applied to the AC_struct and SVM models. It should be noted that NB classifiers were more stable across all of the features, but have a lower contribution.

T-Test Analysis of Different Base Models for Three Predictors and Performance Measures
To further show the statistical significance of base models for the three predictors, we performed T-test analysis on the LR base models of each predictor for the five-fold cross-validations. The results are shown in Figures 6-8.
These results demonstrate that there was no big difference in the p-values for all the base models across all folds, except for the ones involved in the AC_struct features. SVM performed the best with all feature spaces. Figure 6. T-test for the first predictor. The meaning of the y-axis denotes the distance between the p-value and the threshold. The larger the value of the y-axis, the greater the distance. The x-axis denotes the combination of different classifiers and feature extraction methods.

Performance Comparisons with Single Classifiers
To verify the effectiveness of the presented method, we compared the performances to the ones only using the single classifiers by leave-one-out cross-validation. Three classifiers, SVM, RF and LR, were used. The results are shown in Table 7. MSFBinder with n = 2, λ = 2 achieved better performance than any of the others, and MSFBinder with n = 3, λ = 1 performed slight worse than the first one. Among these single classifiers, the SVM model with n = 2, λ = 2 works the best with ACC = 82.60%, MCC = 0.6527, SN = 84.19% and SP = 81.09%. Though the SN value of MSFBinder was lower than the SVM model, its values of AC, MCC and the SP were higher than those of the other single classifiers.

Performance Comparisons with Majority Voting-Based Methods
The majority voting strategy is another popular method of model ensembling. It assigns the final class label of each sample using the one predicted by the (weighted) majority of base classifiers. We ran SVM, RF and LR on the four feature spaces, respectively, then the simple majority voting was applied to make the final predictions; see Table 8 for the results. The ACC, MCC and SN values of MSFBinder were 1.93%, 3.83% and 1.73% higher than the best ones of majority voting-based methods, respectively. The MSFBinders with two types of parameters were both better than the majority voting methods. Compared to Table 4, both kinds of model ensembling performed better than the models only using single classifiers.

Stability Comparisons with the Single Classifiers and Majority Voting Methods
The stability of a learning algorithm refers to the changes in the output of the system when we change the training dataset. A learning algorithm is said to be stable if the learned model does not change much when the training dataset is modified. There are several ways o modify the training set, such as choosing different subsets, using different feature sets and putting noise into the training set. Mathematically speaking, there are many ways of determining the stability of a learning algorithm. Some of the common methods include hypothesis stability, error stability and leave-one-out cross-validation stability.
Here, we tested the stability using different subsets by 5 × 5-fold cross-validation methods. By running the five-fold cross-validation five times on the training set using the four feature sets, the means and standard deviations of four evaluation metrics were calculated; see Table 9.
The results show that MSFBinder achieved the best mean performance with acc = 83.70%, MCC = 0.6744, SN = 84.47% and SP = 83.19%. It beat RF in terms of the values ACC, MCC, SN and SP by 2.34%, 0.0474, 3.37% and 1.59%, respectively. While RF beat the minimal standard deviations for all the measures, this suggests that RF was the most stable of them. The standard deviations for the above four measures in MSFBinder ranked 5, 5, 2 and 4 among seven methods, respectively.
On the training set, iDNAProt-ES ranked first for all the evaluation metrics, and MSFBinder ranked second except for the SP metrics. The best performance of iDNAProt-ES might lie in its feature set incorporating various transformations for both the sequence evolutionary information and predicted secondary structure information, while our method only used the auto-covariance features for predicted secondary structure information. On the testing set, MSFBinder (n = 2, λ = 2) ranked the first in terms of ACC (81.72%), and its MCC (0.6417) and SN (93.55%) values were almost the same as the best ones (0.647 and 94.62%) of the others.
It is noteworthy that the values of its ACC, SN and SP were 6.65%, 6.57% and 6.73% higher than ours on the training dataset, while the values of ACC, MCC and SN in our method were 1.08%, 0.0287 and 7.94% higher than iDNAProt-ES. Thus, MSFBinder has better generalization ability than existing methods.

Conclusions
Orchestrating features and classification algorithms is one of the most tedious works in predicting spatial structures or functions of biological sequences. The stacking model provides a way to combine and to evaluate loosely-coupled base models. In this paper, we propose a model stacking framework by integrating multi-view features and classifiers to investigate how to combine and validate these base models for predicting DNA binding proteins. Integrative experiments demonstrated that MSFBinder has a state-of-the-art performance on both the training and testing datasets and outperforms most of the existing methods. We also explored some characteristics of base classifiers matching corresponding feature spaces. For instance, RF prefers the features of PSSM_DWT; SVM likes the features of Local_DPP more; and both of these two features play significant roles in the prediction of DNA-binding proteins. It is noteworthy that the NB classifier is more stable against all the features, although its coefficients are not too high. Meanwhile, all three predictors outperform the single classifiers, and there is no big performance difference between homogeneous and heterogeneous base models. These findings might yield important clues for designing new feature extraction and classification algorithms for future proteomics studies.