FKRR-MVSF: A Fuzzy Kernel Ridge Regression Model for Identifying DNA-Binding Proteins by Multi-View Sequence Features via Chou’s Five-Step Rule

DNA-binding proteins play an important role in cell metabolism. In biological laboratories, the detection methods of DNA-binding proteins includes yeast one-hybrid methods, bacterial singles and X-ray crystallography methods and others, but these methods involve a lot of labor, material and time. In recent years, many computation-based approachs have been proposed to detect DNA-binding proteins. In this paper, a machine learning-based method, which is called the Fuzzy Kernel Ridge Regression model based on Multi-View Sequence Features (FKRR-MVSF), is proposed to identifying DNA-binding proteins. First of all, multi-view sequence features are extracted from protein sequences. Next, a Multiple Kernel Learning (MKL) algorithm is employed to combine multiple features. Finally, a Fuzzy Kernel Ridge Regression (FKRR) model is built to detect DNA-binding proteins. Compared with other methods, our model achieves good results. Our method obtains an accuracy of 83.26% and 81.72% on two benchmark datasets (PDB1075 and compared with PDB186), respectively.


Introduction
The interaction between DNA and protein exists in various tissues of the living body. For example, DNA-protein interactions during many activities such as DNA replication, DNA repair, DNA packaging, DNA modification, and viral infection. The study of DNA binding residues in DNA-protein interactions facilitates a comprehensive understanding of the mechanisms of chromatin recombination and gene-regulated expression. The methods of detecting DNA-binding proteins are mainly deployed by biochemistry and physical chemistry methods. However, wet experiment-based methods are both time and money consuming.

Data Sets
In our study, two benchmark datasets (PDB1075 and PDB186 datasets) are used to test our predictive model of DNA-binding proteins. PDB1075 and PDB186 were collected from the Protein Data Bank (PDB) [41]. Liu et al. [26] randomly extracted non-DNA-binding and DNA-binding proteins from the PDB database. The similarity of any two sequences does not exceed 25%. A total of 525 DNA-bind proteins and 550 non-DNA-binding proteins form the PDB1075 dataset. PDB186 dataset [42] contains 93 DNA-bind and 93 non-DNA-bind proteins. Table 1 lists the information of the two benchmark data sets.

Measurements
Accuracy (ACC), Sensitivity (SN), Specificity (SP) and Matthew's Correlation Coefficient (MCC) are used to evaluate the performance of predictive model. These coefficients are calculated as follows: where N + and N − are the total number of positive and negative samples, respectively. N − + and N + − are the number of false positive and false negative, respectively. And Area Under ROC curve (AUC) is also an effective evaluation method for binary classification.

Performance Analysis of Different Features on the PDB1075 Data Set
The single type feature can not fully describe the properties of a protein, so we build the predictive model with multi-view sequence features to represent the protein. We test (Jackknife test evaluation) these features (kernels) on the PDB1075 dataset, as shown in Table 2. The PSSM-based features (PSSM-AB and PsePSSM feature) achieve better performance than non-PSSM (MCD and NMBAC feature) single features. The performance (MCC) of MCD, NMBAC, PSSM-AB and PsePSSM feature are 0.4139, 0.4564, 0.5113 and 0.5886, respectively. In addition, mean weighted kernels (KRR) combines the above 4 kernels (features) via average weight and obtains better performance (MCC: 0.6398) than single feature. Compared with mean weightes (KRR), MKL (KRR) achieves a higher value of MCC (0.6439). FKRR weighs training sets by fuzzy membership, which can filter outliers. So, mean weights (FKRR) (MCC: 0.6554) and MKL (FKRR) (MCC: 0.6664) are both better than KRR because of using multiple kernel information and fuzzy membership. Moreover, MKL (FKRR) achieves a better MCC of 0.6664.

Performance on an Independent DataSet of PDB186
In order to evaluate the generalization performance of predictive models, FKRR-MVSF and other methods are also tested on the independent dateset (training set is PDB1075). The results are shown in Table 4.

Discussion
To improve the performance of predicting DNA-binding proteins, we employ an MKL algorithm and fuzzy-based model to integrated different features and further handle the outliers, respectively. There are many ways in machine learning to avoid overfitting and generating skewed models caused by outliers, e.g., adjustment of the cost value in SVM. For different training samples, the parameter of cost should be different. Different samples have different contributions to the model. In Table 2, the performance (MCC: 0.6664) of fuzzy-based models (FKRR with MKL) is better than non-fuzzy models (KRR with MKL, MCC: 0.6439).
Compared to other single kernels, the PsePSSM-based kernel achieves the highest weight and highest value of MCC (0.5886). MKL could integrate multiple information of sequence. Our method (KRR with MKL) also achieves better performance of MCC (0.6439) than a single kernel model on the PDB1075 dataset. In addition, the performance of KRR with MKL (MCC: 0.6439) is better than KRR with mean weights (MCC: 0.6398) under PDB1075 dataset.
On the independent test dataset, our method (FKRR with MKL) also achieves better MCC (0.676). MSFBinder (SVM) [48] is a two-layer model with SVM. MSFBinder (SVM) also employed several features to build a predictive model. The generalization performance of FKRR (withe MKL) is better than MSFBinder (MCC: 0.640) on an independent test set (PDB186). The above two models are similar. The main reason of different results is that the parameter C of FKRR is different for each train sample. Fuzzy membership may reduce the effect of some noise samples in the model.

Materials and Methods
The prediction of DNA-binding proteins can be regarded as a task of binary classification. The protein can be represented by some feature vectors. The DNA-binding proteins and non-DNA-binding proteins are labeled as +1 (positive samples) and −1 (negative samples), respectively. We construct a Fuzzy Kernel Ridge Regression model based on Multi-View Sequence Features (FKRR-MVSF) to determine whether a protein binds to DNA. We employ Normalized Moreau-Broto Auto Correlation (NMBAC) [49,50], PSSM based Average Blocks (PSSM-AB) [51], Multiple-scale Continuous and Discontinuous descriptor (MCD) [52] and PsePSSM algorithms to extract four types of PSSM-based features. Radial Basis Function (RBF) is used to build four types of kernels from the above four kinds of features. In our study, the MKL algorithm is employed to calculate the weights of kernels and to combine four kernels. Then, a membership score is estimated for each training sample. Finally, a fuzzy kernel ridge regression model for identifying DNA-binding proteins is constructed via membership scores and a combined kernel. The framework of proposed method is showed in Figure 3. In the literature [13,33], the researchers have made good use of flowcharts to describe the main framework of their methods. In our work, we employ Figure 4 to describe the flow of our model. Firstly, we extract four types of feature from a sequence. Then, Radical Basis Function (RBF) is used to build four kernels. These kernels are conbined by MKL. Finally, combined kernel and training labels are employed to construct the FKRR model and predict new samples.

Feature Extraction
Extracting features from proteins is a challenge for identifying DNA-binding proteins. A suitable feature extraction algorithm can adequately represent the properties of the protein. We use four types of feature to describe a protein.  [52]. Then, above sequence was split into 10 local regions, which described multiple overlapping continuous and discontinuous interaction patterns. Composition (C), Transition (T) and Distribution (D) were calculated in each local region. The detailed descriptions of MCD algorithm can refer to You's work [52]. The MCD feature was 882-dimentional vector.

NMBAC Feature
Normalized Moreau-Broto Auto Correlation (NMBAC) [49,50] was proposed for extracting the sequence feature of membrane proteins. A protein sequence (string) can be represented as discrete numerical sequence via six physicochemical properties of Amino Acids (AA): including Hydrophobicity (H), Net Charge Index of Side Chains (NCISC), Solvent-Accessible Surface Area (SASA), Volumes of Side Chains of amino acids (VSC), Polarity (P1) and Polarizability (P2), respectively. The six physicochemical properties of amino acids are list in Table 5. To extract the feature of a protein X with L-length, the NMBAC feature is calculated by following equation: where i denote the position in the sequence, and i = 1, 2, ..., n − lag. j is the type of physicochemical properties, j = 1, 2, ..., 6. lag ∈ [1, lg] is the gap between amino acids. lg is a parameter of maximum distance.

Multiple Kernel Learning
RBF is employed to construct 4 types of kernels via above features (including MCD, NMBAC, PSSM-AB and PsePSSM): where γ is the Gaussian kernel bandwidth. N is the number of samples. x i and x j are the feature vector of sample i and j. The 4 types of feature can be represented as a kernel set as: The MKL algorithm combines multi-view features from different sources. Some kernels may have bias in the learning process. MKL can reduce bias of kernels by low weights. The optimal kernel K * train is obtained as follows: where H denotes the number of basic kernels. MKL algorithm [54] can estimate the optimal weights of kernels by minimize the distance between ideal kernel K ideal and optimal kernel K * train . The K ideal = y train y T train ∈ R N×N denote the information of label space. y train ∈ R N×1 is the labels of training set. We hope that optimal kernel K * train is close to the K ideal kernel: where X 2 F = Trace(XX T ), λ is a regularization parameters, ω ω ω = [ω 1 , ω 2 , ..., ω h ] T is the weights of kernels.

Fuzzy Kernel Ridge Regression
Kernel ridge regression is a method from statistics that implements a form of Regularized Least Squares (RLS). Given a training sample x i , y i , i = 1, 2, ..., N. N, x i and y i is the number of samples, feature vector and label. The RLS aims to find the minimum of the following function: where K train ∈ R N×N is the training kernel, C is the non-negative regular term. The solution of KRR is: In this paper, we present a Fuzzy Kernel Ridge Regression (FKRR) for classification. We need to minimize the sum of errors ( K train α α α − y train 2 ). The contribution of sample x i to the decision boundary should be proportional to its fuzzy membership value. The objective function is following function: where D ∈ R N×N is a diagonal matrix whose element D ii (0 ≤ D ii ≤ 1) represents a fuzzy membership value for sample x i . We set ∂J/∂α α α = 0 and the solution of α α α can be obtained as follows: where I ∈ R N×N . So, the decision function is following: where y test ∈ R M×1 is predictive labels. K test ∈ R M×N denotes the kernel of testing samples, M is the number of testing samples.
To compute fuzzy membership values of train samples, we employ the optimal kernels K * train (training kernel) as following function: where score t denotes the score of training point t. If a sample t has a larger score. This sample may has a greater contribution to model. We normalize scores into fuzzy membership values (0-1), as follows: , t = 1, 2, ..., N

Conclusions
FKRR-MVSF achieves better results on independent datasets (MCC: 0.676). Eliminating noise points can improve the predictive performance of the model. In the future, we aim to use other fuzzy membership functions to build fuzzy models for filtering the noise points. As pointed out in PseAAC-based methods [13,33,39,40,[55][56][57][58][59][60], we will establish a web-server for our model. The related code and datasets can be download from: https://figshare.com/s/e80f1a96b7b7bbf8062b.  Acknowledgments: The authors would like to thank all the guest editors and anonymous reviewers for their constructive advices.

Conflicts of Interest:
The authors declare no conflict of interest.